If you’re new to Unicode, or even if you’re not, this is a great summary of the “big picture.”
(And speaking of big pictures of Unicode… this one is certainly not to be missed: “I estimated I could print the whole thing on about a 36″×36″ poster. Well, my estimates were off. It turned out to be about 6 feet by 12 feet…”)
That is to say, if I put some text in a box, can you see it?
Some text like this:
Kaj je Unicode?
Well, you could see that for sure, even if you couldn’t read it. (It happens to be Slovenian.) But what about:
유니코드에 대해?
I’d bet you could see that. Well… maybe. Actually, I’m not sure. (If you must know, it’s Korean.)
ዩኒኮድ ምንድን ነው?
I’m pretty darn sure you couldn’t see that, but I can’t be positive.
This is actually a pretty serious problem: If you write server side software, you don’t really know what languages your users can see (well, what scripts they can see–several languages can be written in the same writing system, after all). Because you don’t know what fonts they have installed.
In the next exciting episode of Blogamundo Hacklog, I’m going to be stepping through a bit of Javascript that Jim Ley wrote on his blog which seems to be a very good first step towards resolving this problem.
I wrote a little wrapper for his code and it seems to succeed in answering the question “what languages can I send this user and be confident that they will rendered at least somewhat?”
But it’s not done yet.
So I guess this post was kind of a commercial for the next post.
PS. The last one was Amharic. Thought I wasn’t gonna tell you, huh. ☺
When you’re trying to build a multilingual site, as we are, one of the problems you have to solve is how to help people choose between languages. A traditional approach to this is to use flag icons.
I think that method of choosing between languages is a mistake. Why? Well, aside from the fact that you’d have to futz around with a bazillion tiny icons, put it this way:
Quick! Choose your language.
I ran across a 2002 article from Jukka Korpela which makes just the points I’ve been thinking of. Despite its age, it’s more relevant than ever, because we have widespread Unicode. He describes the right way to designate languages:
There is a perfect symbol for any language which you can use on the Web: the name of the language in the language itself, e.g. English (or British English or US English, if needed), svenska, suomi, Deutsch, français. … If a reader doesn’t know the name of language X in X, he
probably does not know X enough for the link to be of use to him.
And, I would add, if the user doesn’t have a font installed to read, say, বাংলা or Česky or 中文 or Македонски or 日本語 or whatever… the links will probably show up as a bunch of question marks or boxes something (as some of those language names may have for you).
Flags don’t represent languages!
One could argue such unreadable links constitute a usability problem with your site, but if the user doesn’t have the fonts for the language in question, why would they follow the link anyway?
An alternative is something like the following, with the localized translation of the name following the name itself:
বাংলা Bengali
Česky Czech
中文 Chinese
Македонски Macedonian
日本語 Japanese
(These would probably be links or perhaps radio buttons.)
Or even use the increasingly ubiquitous language codes to give folks without the font in question a hint (Ignoring for the moment that the ISO 639 language codes are a bit of a mess themselves.):
বাংলা [ben]
Česky [cze]
中文 [zho]
Македонски [mac]
日本語 [jpn]
And then there is the question of how you want to mark up the choices — assuming your list isn’t too terribly long, a drop-down might suffice:
(That select box doesn’t do anything, of course…)
In any case, use the language name, not a flag. Wikipedia’s front page has the right idea.
I ran across the website of Lãngüagê Liñè (do they read Sam Ruby?), a telephone-based interpretation site in the UK, and poked around in their translations.
Now it is true that taking an image of text in an unfamiliar script and transcribing it is a pretty hard thing to do if you don’t know the language.
But.
Why would you turn text into images in a language whose alphabet only contains 20-odd roman letters with nary a solitary accent? Here’s a screenshot of the… er.. image:
Fortunately it seems that the UN site has has seen the error of its ways and is now actively soliciting the help of transcribers to get those images transcribed.
But good grief. It makes no sense whatsoever to use images for a language like Somali… this isn’t even localization… it’s just common sense.
(I don’t know why the Somali language is following me around the internet, but I’ve blogged about it before.)
Written by Jonas Galvez,
2 years, 6 months ago. Tags: Code, json, regex, ruby, yaml.
I wrote a new Ruby JSON serializer using Jamis Buck’s Ruby Tokenizer. It’s an order of magnitude safer than my previous attempt. I still use inspect() to create a string representation of the object, but now I’m using the tokenizer to safely replace => by : (properly ignoring => inside strings, that is) instead of that heinous regular expression.
require 'rubygems'
require_gem 'syntax'
class RubyJSONSerializer
@@tokenizer = Syntax.load \"ruby\"
def ruby_to_json(obj)
res = \"\"
symbol = false
@@tokenizer.tokenize(obj.inspect) do |token|
if token.group == :punct
if symbol
res += token.gsub(/>/, '')
symbol = false
else
res += token.gsub(/=>/, ': ')
end
elsif token.group == :symbol
res += \"\\"#{token.match(/:(.+)=/)[1]}\\": \"
symbol = true
else
res += token
end
end
res
end
end
s = RubyJSONSerializer.new
o = {\"foo\"=>\"foo => bar\", :bar=>\"bar => foo\", \"meh\"=>[1,2,3], :n=>123}
puts s.ruby_to_json(o)
Let me know if you spot anything wrong. Update: since a couple friends asked me about this, I thought I’d clarify it here: I’m using Ruby’s built-in YAML parser (syck, I believe, which is also available for Python) to parse JSON. It’s been flawless so far.
Update 2: I should also point out that this only works for primitive datatypes. If you have an object whose to_s method is not JSON-enabled, beware that the serializer will silently produce invalid JSON. To avoid this, make sure you’re creating a Ruby object from scratch:
# good
obj = ruby_to_json({\"foo\"=>MyActiveRecordInstance.foo, \"bar\"=>MyActiveRecordInstance.bar})
# bad
obj = ruby_to_json(MyActiveRecordInstance)
Written by Jonas Galvez,
2 years, 6 months ago. Tags: Code, json, regex, ruby, yaml.
JSON is valid YAML, valid Python, but not… valid Ruby. That’s a bummer. Fortunately, Ruby’s object notation is very close to JSON, and with the help of inspect() and a couple of most-likely-not-so-smart regular expressions, you can convert a simple Ruby object into JSON with 2 lines of code:
That might have been a little too much for my regex skills, so watch out. You’ll notice I’m not using negative look-behinds, but that’s because Ruby has still no support for them (Ruby 1.9 only). [Update: I posted a new implementation using a much safer technique]
Here is a piece of advice for the gearheads. Posted at 6 am. Y’all know who you are, you’ve been where we are right now, oh bleary-eyed brethren. Here it is:
Never test an application under development with plain-ASCII content.
Just don’t bother.
Yes, you will establish that stuff is being saved where it should be, that data is flowing wherever it needs to flow, that your models and view and controllers and bits and bobs and kitchen sinks are all cooperating in Rube Goldbergian glory.
But you haven’t done anything to test whether all the pieces are cooperating with regards to encodings.
In other words, you’re kidding yourself.
Don’t say this:
“Hmm, I’ll just stick some sample text into this text entry box to make sure everything’s working… let’s see… I’ll type in abcde.”
Don’t do that.
Please don’t do that.
Do this:
“Hmm, let’s go over here and look at the funny-looking Unicode-encoded scripts on this page. What is this one, Zlatiborian? “ዩኒኮድ ምቃሩ?” “유니코드에 대해?” “Unicode คืออะไร?” “რა არის უნიკოდი?” “यूनिकोड क्या है?” What is all this stuff? I have no idea! Yeah, that sounds about right. Let’s stick all that in the text box!”
Now you’re talking. Cut and paste some of that Zlatiborian (or whatever) into your application, my friend. You don’t have to be able to understand it. You don’t have to read it. You don’t even have to recognize it.
Because you can recognize jibberish when you see it. I recently learned a neat word for the phenomenon: Mojibake. Here’s a page in Russian. Even if you’ve never read a word of Russian in your life, you’ll see where the examples of Mojibake are on that page.
Do it early, do it often. The Mojibake will come creeping out of the woodwork, believe me. But you’ve got to fix it eventually, right?
P.S. By the way, chances are that you didn’t have the fonts for all the excerpts of uncommon scripts I stuck in that quote up there. If you’re writing software that is going to have multilingual users, maybe you should think about investing in some fonts? Or at least, downloading some?
This can make it really hard to evaluate websites for accessibility. And yet designers have to deal with these browsers Here, for instance, is a recent article with some CSS tips related to screen readers: Simple, accessible “more” links.
Since Blogamundo is all about texts in various languages, a question arises:
How do screen readers identify the language of the text they’re reading?
I have buckets of links about automatic language identification at del.icio.us, and I’ve read a fair amount about the statistical approaches to language identification. (The approach used in Mozilla is a familiar application. ) But do screen readers depend upon such an approach, or do they depend on markup? If so, what kind of markup do they look for?
This is all quite apart from the question of which languages screen readers can pronounce, which is an interesting question in its own right.
So anyway, if anyone out there has some experience with screen readers, I would love to hear about how they handle languages.