{"id":355,"date":"2009-02-09T12:13:30","date_gmt":"2009-02-09T20:13:30","guid":{"rendered":"http:\/\/messofpottage.com\/blog\/?p=355"},"modified":"2009-02-09T12:13:30","modified_gmt":"2009-02-09T20:13:30","slug":"iconv-is-the-heat","status":"publish","type":"post","link":"https:\/\/accretiondisc.com\/blog\/2009\/02\/09\/iconv-is-the-heat\/","title":{"rendered":"iconv is the heat"},"content":{"rendered":"<p>I&#8217;ve been looking for a software tool that would convert foreign characters into a poor substitute.<\/p>\n<p>Call me Ugly McAmerican. I don&#8217;t care.<\/p>\n<p>My language has been worn down &#8212; I would say, &#8220;polished&#8221; &#8212; like a river rock to the point where it doesn&#8217;t have a million characters or funny accent marks or any of that stuff. Now, I don&#8217;t mind if your language uses them. I don&#8217;t even mind if we have a common encoding. What I do mind is that none of my tools work with your stupid common encoding. When <em>grep<\/em> and <em>sed<\/em> and <em>diff<\/em> and <em>ruby<\/em> all know what to do with your ?q???????, give me a call.<\/p>\n<p>In the meantime, I plan to go on working in <a href=\"http:\/\/en.wikipedia.org\/wiki\/ASCII\">ASCII<\/a> as much as possible. Then, when necessary, I&#8217;ll use tools to convert ugly-quotes to pretty ones, or turn <code>...<\/code> into ellipses, etc.<\/p>\n<p><!--more-->On the Mac, <em>TextWrangler<\/em> is one of those tools. It will convert <a href=\"http:\/\/en.wikipedia.org\/wiki\/UTF-8\">UTF-8<\/a> and <a href=\"http:\/\/en.wikipedia.org\/wiki\/ISO_8859-1\">Latin-1<\/a> text files to ASCII. (It will also convert smart quotes to stupid, and vice-versa.)<\/p>\n<p>Sadly, I don&#8217;t have anything to do that on my Linux machine.<\/p>\n<p>Well, I didn&#8217;t. But I do now. In fact, I had it all along, because <em>iconv<\/em> can do it.<\/p>\n<p>What <em>iconv<\/em> does is translate from one character encoding to another. For example, you can use it to convert from ASCII to ISO Latin-1 to UTF-8. That would be a boring conversion, but the reverse is not true: an ASCII <strong>e<\/strong> is still an <strong>e<\/strong> in UTF-8, but a UTF-8 <strong>\u00c3\u00a9<\/strong> isn&#8217;t necessarily going to be an ASCII <strong>e<\/strong>.<\/p>\n<p>The problem is that <em>iconv<\/em> is really badly documented. Its man page is 266 words long. The perl equivalent (<em>piconv<\/em>) has a man page 394 words long. Neither one told me what I needed to know. They assumed I would be happy converting <em>r\u00c3\u00a9sum\u00c3\u00a9<\/em> to <em>r?sum?<\/em>.<\/p>\n<p>Well, long story short, I ran across <a href=\"http:\/\/blog.grayproductions.net\/articles\/encoding_conversion_with_iconv\">this page describing ruby&#8217;s <em>iconv<\/em> library<\/a>, and it happened to mention something that isn&#8217;t documented in the <em>iconv<\/em> man page:<\/p>\n<blockquote><p><code>\/\/TRANSLIT<\/code> can&#8217;t convert absolutely everything you will see in the wild, so it&#8217;s still possible to get errors when using it. You can combine the modes though by specifying <code>\/\/TRANSLIT\/\/IGNORE<\/code>. That will ask iconv to transliterate what it can and drop the rest. Note that order does matter there, you need to be sure it tries transliteration before ignoring the character.<\/p><\/blockquote>\n<p>(Oddly, <em>piconv<\/em> knows about <code>\/\/TRANSLIT<\/code>, but doesn&#8217;t seem to know <code>\/\/IGNORE<\/code>. Bummer.)<\/p>\n<p>So, whenever I get a text file (e.g., when I copy something from a web page, or use antiword to convert the word document you mailed me into something useful) all I have to do is run it through iconv thus:<\/p>\n<blockquote>\n<pre>antiword file.doc \\\n| iconv -f UTF8 -t ASCII\/\/TRANSLIT\/\/IGNORE \\\n> file.txt\n<\/pre>\n<\/blockquote>\n<p>O frabjous day.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve been looking for a software tool that would convert foreign characters into a poor substitute. Call me Ugly McAmerican. I don&#8217;t care. My language has been worn down &#8212; I would say, &#8220;polished&#8221; &#8212; like a river rock to the point where it doesn&#8217;t have a million characters or funny accent marks or any [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[23,52],"tags":[],"class_list":["post-355","post","type-post","status-publish","format-standard","hentry","category-life","category-technology"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/paRqpr-5J","_links":{"self":[{"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/posts\/355","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/comments?post=355"}],"version-history":[{"count":0,"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/posts\/355\/revisions"}],"wp:attachment":[{"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/media?parent=355"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/categories?post=355"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/accretiondisc.com\/blog\/wp-json\/wp\/v2\/tags?post=355"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}