Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Unicode vs. Latin-1… DEATHMATCH

Written by Patrick Hall, 2 years, 7 months ago.
Tags: , .

And now for some highly inaccurate but hopefully provocative statistics on the progress of Unicode on the web…

Emily Chang’s eHub is “a constantly updated list of web applications, services, resources, blogs or sites with a focus on next generation web (web 2.0), social software, blogging, Ajax, Ruby on Rails, location mapping, open source, folksonomy, design and digital media sharing.”

Holy smokes that’s a lotta buzzwords!

Presumably the folks building such web applications are pretty up-to-date with regards to web standards and such — and I was wondering how clued-in they were about character encodings.

So I took 5 minutes and got all the urls out of ehub’s front page, put them in a file called “ehublinks.html” said this to my bash shell:


$ wget -i ehublinks.html
$ grep -i charset index.html* |lower |tr ' ' '\12' |grep chars|tr '";' '\12' |grep chars|sort |uniq -c |sort -n

1 charset
1 charset=
5 charset=windows-1252
66 charset=iso-8859-1
100 charset=utf-8

66 in latin-1, 100 in utf-8.

That’s out of two-hundred-some-odd pages I downloaded, so it’s hardly accurate. Plus there’s the fact that the encoding that pages claim to be in isn’t necessarily what the server is sending.

But whatever, ballparks, ballparks.

The good news is that Unicode (utf-8) is winning, the bad news is that latin-1 won’t be going away any time soon.

And the even worse news is that my crufty little survey is probably half wrong anyway, since servers don’t necessarily actually send the page in the encoding that the page says it’s in.

But about those windows-1252 people… GOOD GRIEF.

1 Comment for 'Unicode vs. Latin-1… DEATHMATCH'

  1. Comment received 2 years, 3 months ago from John

    A lot of stuff marked Latin-1 is actually Win-1252 as they contain quotes which are valid 1252 and control characters in Latin-1
    http://intertwingly.net/stories/2004/04/14/i18n.html#CleaningWindows
    http://en.wikipedia.org/wiki/Windows-1252

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <img src="" alt=""> <strike> <strong> . Don't forget to close them after use.