Mon, 07 Mar 2005

UTF-8

Last weekend I was feeling both bored and geeky, so I did something I'd been meaning to do for a while: I switched to UTF-8. I'm running Debian unstable, and the transition was relatively painless, though I did run into some problems.

Markus Kuhn's Unicode page proved invaluable for both theory and practice, as did his UTF-8 example files. Also of use was Radovan Garabik's Debian howto for switching to UTF-8.

I got rid of gnome-terminal, sadly, and went back to vanilla xterm. There were some aspects of UTF-8 that gnome-terminal didn't support (combining characters, notably), and there wasn't a good Unicode font that it could use. (The only monospace font with any sort of reasonable coverage was FreeMono, which looks horrible. Terminus was actually decent in the Latin-1 sections, but I'd need more than 1152x864 to use it the way I'd want. The fact that gnome-terminal refused to use traditional X fonts is a separate rant.) I'm using xterm as xterm -fn '-misc-fixed-medium-r-semicondensed---120-----iso10646-1'.

screen supports UTF-8 nicely. I merely set defutf8 on in my .screenrc. Debian has a separate package for mutt with UTF-8 support; it's mutt-utf8. Once installed, it diverts existing mutt binaries to mutt.ncurses, so just typing mutt works. irssi happily handled UTF-8 without any intervention from me. In order to get w3m working, I had to compile and install w3m-m17n.

XEmacs seems uneasy with the whole thing. I'm using xemacs21-mule and I have

(require 'un-define)
(set-coding-priority-list '(utf-8))
(set-coding-category-system 'utf-8 'utf-8)

in my startup files. That enables UTF-8 support and autodetects files that already have UTF-8 characters in them. I still need to figure out how to open other files as UTF-8 (default translation still seems to be ISO-8859-1). I also need to look at the displaying of Unicode characters. XEmacs is running in screen in a UTF-8-aware xterm, so things should display properly, but most Unicode characters are displayed as tildes. Finally, it appears that the easiest way to enter Unicode characters is to call the function insert-ucs-character and type in the decimal (not hex!) number of the Unicode codepoint. Addendum: XEmacs does seem to actually mess up some UTF-8 characters. Sigh.

I played a little with other editors to see what I could do with them. yudit seems the best of the lot, but it's GUI-only. qemacs doesn't look too bad, but it had some problems detecting UTF-8 documents, which led to munged characters when I saved. And apparently vim has excellent UTF-8 support. Figures.

zsh does not support UTF-8. (Though it's one of two items in the TODO list.) It passes things through literally enough that you can paste UTF-8 into a command line and have the app handle it, but you can't edit Unicode on the command line. It also doesn't deal properly with the size of UTF-8 characters, so no UTF-8 in my shell prompt.

Just for the fun of it, I switched the blog pages here over to UTF-8. I might switch the web server as a whole, but that could break some of my text files. It doesn't really matter, because I generally use the HTML entities for non-ASCII characters, anyway. Why type "àéîõü" (UTF-8) when I can use "àéîõü" (HTML entities) and be much more portable? Of course, UTF-8 does let me put in things like "ᚠᛁᚢᛖ᛫ᚠᛟᛏ᛫ᚻᛁᚷᚻ᛫ᚦᛖ᛫ᛞᚩᚱ᛫ᚨᚾᛞ᛫ᚦᚱᛟ᛫ᛗᚨᚤ᛫ᚹᚨᛚᚴ᛫ᚨᛒᚱᛖᚨᛋᛏ". }:>


Phil! Gold