Some Character Encoding Gotchas

10:31 Thu 16 Jul 2009
[, , , ]

While scripting my reStructuredText to WordPress workflow, I ran into a bunch of character encoding problems.

The first thing to keep in mind is that character encoding is not a metadata property of a file. That is, unless a file type itself has some metadata that includes encoding, the only way the OS can determing character encoding is by reading the file and guessing.

I knew this, but despite having been caught by this before, I “knew” it in the sense that if you had asked me whether or not there’s some way outside the file for the OS to know what the encoding is, I would have answered “no”. However, that didn’t stop me from acting as if some such magical property existed.

I started out by playing with rst2html and opening its output in Firefox. As far as character encoding went, that was fine. Then I switched to rst2wp, which outputs a truncated version of the HTML containing essentially just the body text that you’re going to paste into WordPress. It worked, but the character encodings were screwed up. I spent a lot of time searching for what the difference between the two was in the source code before finding the obvious answer: rst2wp outputs just the body text.

In other words, it outputs to a file that doesn’t contain this:

<?xml version="1.0" encoding="UTF-8"?>

Without that, the browser has to guess what the encoding is. Looking at the source code doesn’t help either, because that source code is also rendered in whatever the browser thinks the encoding is (which makes sense).

After I got past that issue, the next one involved writing to standard output from Python. This was consistently screwed up, and I couldn’t figure it out because when I sent the output to Terminal, the encodings were fine. But if I ran the same script either piped to another script, or outputting to stdout from within the script, they were wrong. It turns out that you need the LC_CTYPE environment variable to be set appropriately in order for Python to write your desired encoding to stdout, e.g. (for tcsh):

setenv LC_CTYPE en_US.utf-8

(Substituting en_US.utf-8 for your preferred encoding, of course.)

This last one is specific to OS X. I’m copying to and from the clipboard using pbpaste and pbcopy as well as the usual keyboard shortcuts. When doing that from my scripts, the character encodings would get screwed up. To force OS X to use UTF by default, you need to set defaultStringEncoding, which is done in tcsh with:

setenv __CF_USER_TEXT_ENCODING 0x1F5:0x8000100:0x8000100

Note that 0x1F5 refers to your uid, in this case 501 expressed in hex.

Hopefully those three tips will save some readers a little pain in their scripting.

Leave a Reply