Tag Archives: file format

Extracting Text from Word DOCX files using Pandoc

Back in the day, I would use antiword to extract the text from a Word .DOC file. But it only understands DOC. Over the years, more and more Word files have been using the “open” (ha ha) DOCX format, which antiword doesn’t read. So I found Pandoc, which does much, much more.

$ pandoc -i some.docx -t plain > some.txt

There’s also a convert-to-Markdown option:

$ pandoc -i some.docx -t markdown > some.md

I find, however, that the Markdown produced by docx2md is more to my liking. It’s less cluttered, as it doesn’t aim at fidelity to the Word document’s formatting to the same degree as pandoc, but only the basics.

To install pandoc you can use the Macports version, but lately, I’ve found it easier simply to install the official binary Mac OSX PKG.

Pandoc for Word Document conversion

I just discovered pandoc. Well, I first bookmarked it in 2008, and again in 2016, so I guess I rediscovered it. But what I mean is that I finally discovered what to use it for: converting Word files to Markdown. It’s dead easy:

$ pandoc -f docx -t markdown sample.docx > sample.md

I’ve been using Antiword for years to convert Word 2006 (DOC) files to text, but it doesn’t do DOCX, and, instead of producing Markdown or something more neutral, it tries to recreate the DOC experience in text by centering lines, etc. Not complaining: it gets me plain text and I can take it from there, but Markdown is a big improvement. DOCX is even better, since, apart from pandoc, the only way I knew to read those at the command line was via Libre/OpenOffice:

$ libreoffice –headless –convert-to “txt:Text (encoded):UTF8” sample.docx > sample.txt

(I see — now, when it is too late — that there is also code to do this in ruby: antiword-xp-rb. I hope that’s an awesome tool, but it took me 9 years to figure out what to do with pandoc so don’t wait for me to tell you.)