The Zero-width Space
Here’s something I need to look up, but I thought I’d blog it first, so you can share in my confusion (or alleviate it).
Several language don’t use spaces to separate words: Thai, Chinese, Japanese, Khmer, Lao… (I’ve blogged about this elsewhere before).
But I’m pretty sure it’s safe to say that every language has “words.” You have to be able to identify words if you want to create lexicons, be they monolingual or bilingual.
Now, Unicode has this code point called U+200B ZERO WIDTH SPACE. It seems to me (and this is what I need to look up) that one could use said character to represent the divisions between words without actually “damaging” the orthography of the languages that don’t officially divide sequences of characters into words.
What I don’t know:
- Whether that’s what this character is for
- Whether people actually already do that in any such language when typing (somehow)
- Whether this character would be appropriate to insert automatically with software that identifies word boundaries automatically (spellcheckers, for instance)
I’m guessing that the answers are “Yes, No, Yes.”
(I just posted this before looking it up on the outside chance that some informed person might drop by and get an answer out there for the search engines… will update…)
