Unicode vs. Latin-1… DEATHMATCH
And now for some highly inaccurate but hopefully provocative statistics on the progress of Unicode on the web…
Emily Chang’s eHub is “a constantly updated list of web applications, services, resources, blogs or sites with a focus on next generation web (web 2.0), social software, blogging, Ajax, Ruby on Rails, location mapping, open source, folksonomy, design and digital media sharing.”
Holy smokes that’s a lotta buzzwords!
Presumably the folks building such web applications are pretty up-to-date with regards to web standards and such — and I was wondering how clued-in they were about character encodings.
So I took 5 minutes and got all the urls out of ehub’s front page, put them in a file called “ehublinks.html” said this to my bash shell:
$ wget -i ehublinks.html
$ grep -i charset index.html* |lower |tr ' ' '\12' |grep chars|tr '";' '\12' |grep chars|sort |uniq -c |sort -n
1 charset
1 charset=
5 charset=windows-1252
66 charset=iso-8859-1
100 charset=utf-8
66 in latin-1, 100 in utf-8.
That’s out of two-hundred-some-odd pages I downloaded, so it’s hardly accurate. Plus there’s the fact that the encoding that pages claim to be in isn’t necessarily what the server is sending.
But whatever, ballparks, ballparks.
The good news is that Unicode (utf-8) is winning, the bad news is that latin-1 won’t be going away any time soon.
And the even worse news is that my crufty little survey is probably half wrong anyway, since servers don’t necessarily actually send the page in the encoding that the page says it’s in.
But about those windows-1252 people… GOOD GRIEF.

A lot of stuff marked Latin-1 is actually Win-1252 as they contain quotes which are valid 1252 and control characters in Latin-1
http://intertwingly.net/stories/2004/04/14/i18n.html#CleaningWindows
http://en.wikipedia.org/wiki/Windows-1252