Joel on Unicode

Tue, Oct 14, 2003

I'm sure everyone has seen this now, but Joel is dead on wrt Unicode.  I worked on some of the codepage reloading stuff in IE so I've lived some of this stuff -- Both recognizing the BOM for Unicode and implementing the reload of an already started page when we hit the codepage meta.  Most of IE is fully unicode of course.  The only exception is when we deal with URLs because, back in the day, they were restricted to 7bit ascii.

There are some other issues that Joel didn't bring up that devs should be aware of:

  • Sorting unicode is hard.  Each language has its own sort rules.  Given any particular set of strings, they may be sorted differently depending on the language.
  • Capitilization can be hard also.  There was a bug in IE at one point where we didn't deal with capitilization of a japanese string right and someone's name was transformed into "dead body on beach."  At the very least that is a rude thing to call someone.
  • Be aware that text that looks the same on the screen can be backed by different unicode codepoints.  This is a problem with going to unicode for URLs as it makes domain squating more difficult to deal with.  Two people might own two domains that look the same when typed in but are instead actually two different domains from a unicode codepoint point of view.  As an analogy, I was looking at an ancient typewriter the other day that one of our PMs has outside of her office.  It doesn't have a '1' key!  The solution is to just use the 'l' (L) key instead.  Now imagine if the keys are font glyphs as displayed on the screen and the concept of '1' and 'l' are unicode codepoints.  'He11o' and 'Hello' would look exactly alike on the screen but would actually be different domains.