Tag Archives: pandoc

Extracting Text from Word DOCX files using Pandoc

Back in the day, I would use antiword to extract the text from a Word .DOC file. But it only understands DOC. Over the years, more and more Word files have been using the “open” (ha ha) DOCX format, which antiword doesn’t read. So I found Pandoc, which does much, much more.

$ pandoc -i some.docx -t plain > some.txt

There’s also a convert-to-Markdown option:

$ pandoc -i some.docx -t markdown > some.md

I find, however, that the Markdown produced by docx2md is more to my liking. It’s less cluttered, as it doesn’t aim at fidelity to the Word document’s formatting to the same degree as pandoc, but only the basics.

To install pandoc you can use the Macports version, but lately, I’ve found it easier simply to install the official binary Mac OSX PKG.

MacPorts revisited

I’m setting up my new (old) iMac and I thought I’d give MacPorts another try. I used it since the late 2000s. (I forget why I moved from Fink to MacPorts.) For the last five (?) years or so I’ve used Homebrew, but I’ve always been uneasy about making /usr/local writable, and never convinced by the blandishments on the Homebrew site. I heard about Nix, but before I tried something really different, I thought I’d give MacPorts another look.

So far, so good. I prefer to need sudo to install software. The only real problem I’ve encountered is that the MacPorts version of Pandoc is so old it can’t read docx files. (#sad!) But Pandoc has its own installer, so I’m trying that out too.