Extracting Text from Word DOCX files using Pandoc

Back in the day, I would use antiword to extract the text from a Word .DOC file. But it only understands DOC. Over the years, more and more Word files have been using the “open” (ha ha) DOCX format, which antiword doesn’t read. So I found Pandoc, which does much, much more.

$ pandoc -i some.docx -t plain > some.txt

There’s also a convert-to-Markdown option:

$ pandoc -i some.docx -t markdown > some.md

I find, however, that the Markdown produced by docx2md is more to my liking. It’s less cluttered, as it doesn’t aim at fidelity to the Word document’s formatting to the same degree as pandoc, but only the basics.

To install pandoc you can use the Macports version, but lately, I’ve found it easier simply to install the official binary Mac OSX PKG.

This entry was posted in Technology and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.