h
a
c
k
l
o
g

Unicode Script property and Javascript

Written by Patrick Hall, November 21st, 2008

Dear lazyweb,

I would like a Javascript function to work like this:

magicalFunction('カ')
→ 'Katakana'

magicalFunction('a')
→ 'Latin'

magicalFunction('አ')
→ 'Ethiopic'

In other words, I want to be able to access the script property described in UAX #24: Script Names.

This actually exists already in Perl regular expressions, where you can just say \p{Katakana} in a regex to match Katakana characters.

Maybe such a thing could end up in the next version of Javascript… not that I have the slightest idea where to make that suggestion. But in the meantime, it seems to me that there should be a unicodescripts.js or some such.

Any ideas on what would be an efficient programming approach to implementing such a data structure, something that might be reasonably squeezed into a .js file?

Update: Longtime reader Edward O’Connor emails to suggest xregexp:

…you should check out the unicode plugin for xregexp:


http://blog.stevenlevithan.com/archives/xregexp-unicode-plugin

This does pretty much exactly what you want.

Lazyweb, the greatest programming platform in history!

In vaguely related news, rubyistas out there should check out Edwards’s talk from MerbCamp.

4 Comments for 'Unicode Script property and Javascript'

  1. Comment received November 21st, 2008 from Robin

    In ICU it’s implemented with a large C array, where the indices are the code points and the values the code for the script. But that’s not really an option for Javascript.

    Maybe the whole data could be encoded into a large binary string and then you’d have special functions to access the right bits from the string.

    Or, as the data is a mapping from a range (of numbers) to a value, you could store the ranges in a tree. Finding an element is not as fast as with the string (logarithmic instead of O(1)) but as memory-efficient as it gets.

    Some time ago I implemented another approach in Python, a stack of dicts, which is kind of a compromise of the above solutions in terms of speed and memory. Contact me if you’re interested in the code.

    Anyway, interesting problem :).

  2. Comment received November 21st, 2008 from Edward O'Connor

    You’re looking for this.

  3. Comment received November 21st, 2008 from dda

    http://www.sungnyemun.org/ScriptName.html

  4. Comment received November 21st, 2008 from Patrick Hall

    Thanks folks… yeah, dda that seems to be a workable solution.

    Edward, the one you linked is great but it seems to only handle looking up blocks, not scripts, which happens to be slightly different from what I was looking for.

    Dda, maybe you should get in touch with the guy who wrote the plugin Edward linked?

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> . Don't forget to close them after use.