h
a
c
k
l
o
g

The Zero-width Space

Written by Patrick Hall, December 28th, 2006

Here’s something I need to look up, but I thought I’d blog it first, so you can share in my confusion (or alleviate it).

Several language don’t use spaces to separate words: Thai, Chinese, Japanese, Khmer, Lao… (I’ve blogged about this elsewhere before).

But I’m pretty sure it’s safe to say that every language has “words.” You have to be able to identify words if you want to create lexicons, be they monolingual or bilingual.

Now, Unicode has this code point called U+200B ZERO WIDTH SPACE. It seems to me (and this is what I need to look up) that one could use said character to represent the divisions between words without actually “damaging” the orthography of the languages that don’t officially divide sequences of characters into words.

What I don’t know:

  • Whether that’s what this character is for
  • Whether people actually already do that in any such language when typing (somehow)
  • Whether this character would be appropriate to insert automatically with software that identifies word boundaries automatically (spellcheckers, for instance)

I’m guessing that the answers are “Yes, No, Yes.”

(I just posted this before looking it up on the outside chance that some informed person might drop by and get an answer out there for the search engines… will update…)

Types, Tokens, Typology

Written by Patrick Hall, December 22nd, 2006

Introduction to Information Retrieval book online

Written by Patrick Hall, December 15th, 2006

Goblin

Written by Patrick Hall, December 15th, 2006

New Public CLDR Mailing List

Written by Patrick Hall, December 13th, 2006

A list for you think about

Written by Patrick Hall, December 12th, 2006

Okay, sometimes it’s funny.

Written by Patrick Hall, December 11th, 2006

Ascii-only Languages

Written by Patrick Hall, December 6th, 2006