I’ve been looking for a software tool that would convert foreign characters into a poor substitute.
Call me Ugly McAmerican. I don’t care.
My language has been worn down — I would say, “polished” — like a river rock to the point where it doesn’t have a million characters or funny accent marks or any of that stuff. Now, I don’t mind if your language uses them. I don’t even mind if we have a common encoding. What I do mind is that none of my tools work with your stupid common encoding. When grep and sed and diff and ruby all know what to do with your ?q???????, give me a call.
In the meantime, I plan to go on working in ASCII as much as possible. Then, when necessary, I’ll use tools to convert ugly-quotes to pretty ones, or turn ...
into ellipses, etc.
On the Mac, TextWrangler is one of those tools. It will convert UTF-8 and Latin-1 text files to ASCII. (It will also convert smart quotes to stupid, and vice-versa.)
Sadly, I don’t have anything to do that on my Linux machine.
Well, I didn’t. But I do now. In fact, I had it all along, because iconv can do it.
What iconv does is translate from one character encoding to another. For example, you can use it to convert from ASCII to ISO Latin-1 to UTF-8. That would be a boring conversion, but the reverse is not true: an ASCII e is still an e in UTF-8, but a UTF-8 é isn’t necessarily going to be an ASCII e.
The problem is that iconv is really badly documented. Its man page is 266 words long. The perl equivalent (piconv) has a man page 394 words long. Neither one told me what I needed to know. They assumed I would be happy converting résumé to r?sum?.
Well, long story short, I ran across this page describing ruby’s iconv library, and it happened to mention something that isn’t documented in the iconv man page:
//TRANSLIT
can’t convert absolutely everything you will see in the wild, so it’s still possible to get errors when using it. You can combine the modes though by specifying//TRANSLIT//IGNORE
. That will ask iconv to transliterate what it can and drop the rest. Note that order does matter there, you need to be sure it tries transliteration before ignoring the character.
(Oddly, piconv knows about //TRANSLIT
, but doesn’t seem to know //IGNORE
. Bummer.)
So, whenever I get a text file (e.g., when I copy something from a web page, or use antiword to convert the word document you mailed me into something useful) all I have to do is run it through iconv thus:
antiword file.doc \ | iconv -f UTF8 -t ASCII//TRANSLIT//IGNORE \ > file.txt
O frabjous day.