Tag Archives: openoffice

Pandoc for Word Document conversion

I just discovered pandoc. Well, I first bookmarked it in 2008, and again in 2016, so I guess I rediscovered it. But what I mean is that I finally discovered what to use it for: converting Word files to Markdown. It’s dead easy:

$ pandoc -f docx -t markdown sample.docx > sample.md

I’ve been using Antiword for years to convert Word 2006 (DOC) files to text, but it doesn’t do DOCX, and, instead of producing Markdown or something more neutral, it tries to recreate the DOC experience in text by centering lines, etc. Not complaining: it gets me plain text and I can take it from there, but Markdown is a big improvement. DOCX is even better, since, apart from pandoc, the only way I knew to read those at the command line was via Libre/OpenOffice:

$ libreoffice –headless –convert-to “txt:Text (encoded):UTF8” sample.docx > sample.txt

(I see — now, when it is too late — that there is also code to do this in ruby: antiword-xp-rb. I hope that’s an awesome tool, but it took me 9 years to figure out what to do with pandoc so don’t wait for me to tell you.)