Unicode headed toward World Domination™
The Google Blog has a chart showing that there is a very clear trend toward Unicode adoption.
Apparently their numbers refer to UTF-8 alone (as opposed to UTF-16/UCS-2 or (haha)UTF-32/UCS-4), which again is good news. (Though one wonders if there is any uptake of UTF-16 on the web… I hope not.)
The data is “Google internal”… peer-reviewed, it ain’t.
Thanks to Won for the pointer!
6 comments.
Technorati tags: Language and the Web, unicode
Yay, long live Unicode! :)
Here here!
Kaj multajn dankojn por via komento, Robin ☺
What’s wrong with UTF-16 and other Unicode encodings?
There’s nothing inherently wrong with UTF-16 or any other transformation format of Unicode.
But I think the web is heading toward standardization on UTF-8, because it’s backwards-compatible with ASCII and latin-1 (though not the annoying CP1252 gremlins), and it has the widest support in applications.
Because of the way it’s defined, UTF-8 is in a way-self validating. For a file to be processed as UTF-8, it kind of really has to be UTF-8. (As any Python programmer familiar with the notorious
UnicodeDecodeErrorcan attest.) Asking to decode a UTF-16 file doesn’t do any such “validation”, because any sequence of bytes is valid UTF-16.(It also takes up more memory, but that hardly matters these days… it’s just text.)
<nit-picky>Not on Scripts mainly above the 7-bit border.</nit-picky>
That’s not nit-picky, that’s an important point. The “UTF-8 only” attitude strikes me as latin-centric.