h
a
c
k
l
o
g

Unicode Normalization in Ruby?

Written by Patrick Hall, July 19th, 2008

Last week I gave a little talk on Unicode at the DC Ruby Users Group. I have met some really interesting folks in that group; if you’re in the DC area and into Ruby I highly recommend it.

The talk was a high-level overview of “why Unicode matters,” more than a nitty-gritty down-to-the-bits sort of thing. In my experience the former issue is often more problematic than the latter, so that’s where I focused my attention.

Anyway, as a result of chatting with some folks I decided I would try to get a couple of very small-scale open source Ruby projects rolling. There are some READMEs scribbled at at github.com, and I’m going to try to work regularly there.

I have two initial ideas:

  1. Trying to do a pure-Ruby port of Python’s unicodedata module
  2. Statistical language identification with Ruby1.9

#1 is something I miss a lot in Ruby.
#2 is something I’ve had some success with in Python already, and I’d like to get it running in Ruby and turn it into a gem or something, as I imagine it would be of use to others.

Basically, I’m interested in collaborating on Ruby stuff that intersects with language, i18n, l10n, and all the rest of the stuff I babble about around these parts. Comments welcome…

PS: This stuff will be GPL’d. Free Software FTW.

4 Comments for 'Unicode Normalization in Ruby?'

  1. Comment received July 19th, 2008 from RSL

    Hey, not sure it fits your needs but I ported Perl’s Unidecode over to Ruby. It’s available as a gem [http://rubyforge.org/projects/unidecode/] or as part of my Rails plugin Stringex [http://github.com/rsl/stringex/tree/master]. Very excited about checking out some of your work. That Python Unicodedata seems a lot different than the Perl library I ported.

  2. Comment received July 19th, 2008 from Patrick Hall

    Hey Russell,

    Just installed your gem, will take a look. I believe I have run across Text::Unidecode somewhere in my sordid past… Sean Burke writes neat stuff.

    Have you ever seen John C. Cowan’s DiacriticFolding.txt?

    There’s a Unicode Technical Report:

    http://unicode.org/reports/tr30/

    And the file:

    http://unicode.org/reports/tr30/datafiles/DiacriticFolding.txt

    I have a hideous Python script that uses that file to remove accents (much though it pains my diacritico-phile nature!):

    http://ruphus.com/svn/accents/accents.py

    The lines:

    text = normalize(’NFC’,text)
    text = normalize(’NFD’, text)

    …are one of the things I want to figure out how to do in Ruby. (That’s actually what I was planning on blogging about in this post, and forgot to, as the title indicates!)

  3. Comment received July 19th, 2008 from Robin

    Hi! I’m interested in how you plan to do language identification? I have used a method of comparing the most frequent trigrams in a text to reference models and then picking the best matching model. It’s similar to how languid does it and works pretty well. What technique do you use?

  4. Comment received July 19th, 2008 from Patrick Hall

    Hey Robin,

    I used a bigram model and cosine similarity. But there are a bunch of possible algorithms out there. What I would really like to do would be to set up a good testing infrastructure and sample corpus (starting with the UDHR), and test a bunch of algorithms, find the fastest, most accurate option…

    Another thing which is missing in my current implementation and in language id in general, so far as I can tell, is taking complete leverage of the script information which is built into Unicode. Languid does some of that, but it’s hand crafted, and it seems to me that extracting character→script mappings from Unicode’s data sources, namely Scripts.txt, documented here.

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> . Don't forget to close them after use.