{"id":1169,"date":"2020-05-24T17:17:13","date_gmt":"2020-05-25T01:17:13","guid":{"rendered":"https:\/\/accretiondisc.com\/blog\/?p=1169"},"modified":"2020-05-24T17:17:13","modified_gmt":"2020-05-25T01:17:13","slug":"extracting-text-from-word-docx-files-using-pandoc","status":"publish","type":"post","link":"https:\/\/accretiondisc.com\/blog\/2020\/05\/24\/extracting-text-from-word-docx-files-using-pandoc\/","title":{"rendered":"Extracting Text from Word DOCX files using Pandoc"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Back in the day, I would use <a href=\"http:\/\/www.winfield.demon.nl\/\">antiword<\/a> to extract the text from a Word .DOC file. But it only understands DOC. Over the years, more and more Word files have been using the &#8220;open&#8221; (ha ha) DOCX format, which antiword doesn&#8217;t read. So I found <a href=\"https:\/\/pandoc.org\/\">Pandoc<\/a>, which does much, much more.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ pandoc -i some.docx -t plain > some.txt<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">There&#8217;s also a convert-to-Markdown option:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ pandoc -i some.docx -t markdown > some.md<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">I find, however, that the Markdown produced by <a href=\"https:\/\/github.com\/mattn\/docx2md\">docx2md<\/a> is more to my liking. It&#8217;s less cluttered, as it doesn&#8217;t aim at fidelity to the Word document&#8217;s formatting to the same degree as <em>pandoc<\/em>, but only the basics.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To install pandoc you can use the Macports version, but lately, I&#8217;ve found it easier simply to install the <a href=\"https:\/\/pandoc.org\/installing.html\">official binary Mac OSX PKG<\/a>.<br><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Back in the day, I would use antiword to extract the text from a Word .DOC file. But it only understands DOC. Over the years, more and more Word files have been using the &#8220;open&#8221; (ha ha) DOCX format, which antiword doesn&#8217;t read. So I found Pandoc, which does much, much more. $ pandoc -i [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[52],"tags":[1084,1083,1085,970,975,1034,969],"class_list":["post-1169","post","type-post","status-publish","format-standard","hentry","category-technology","tag-antiword","tag-docx","tag-docx2md","tag-file-format","tag-markdown","tag-pandoc","tag-word"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/paRqpr-iR","_links":{"self":[{"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/posts\/1169","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/comments?post=1169"}],"version-history":[{"count":0,"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/posts\/1169\/revisions"}],"wp:attachment":[{"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/media?parent=1169"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/categories?post=1169"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/tags?post=1169"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}