The Zero-width Space
Here’s something I need to look up, but I thought I’d blog it first, so you can share in my confusion (or alleviate it).
Several language don’t use spaces to separate words: Thai, Chinese, Japanese, Khmer, Lao… (I’ve blogged about this elsewhere before).
But I’m pretty sure it’s safe to say that every language has “words.” You have to be able to identify words if you want to create lexicons, be they monolingual or bilingual.
Now, Unicode has this code point called U+200B ZERO WIDTH SPACE. It seems to me (and this is what I need to look up) that one could use said character to represent the divisions between words without actually “damaging” the orthography of the languages that don’t officially divide sequences of characters into words.
What I don’t know:
- Whether that’s what this character is for
- Whether people actually already do that in any such language when typing (somehow)
- Whether this character would be appropriate to insert automatically with software that identifies word boundaries automatically (spellcheckers, for instance)
I’m guessing that the answers are “Yes, No, Yes.”
(I just posted this before looking it up on the outside chance that some informed person might drop by and get an answer out there for the search engines… will update…)
7 comments.
Technorati tags: Code, Language and the Web, unicode
Yes, that is absolutely what ZWSP is for. However, the main use is neither to assist lexical analysis nor to record its results (though those are quite plausible uses) but to reflect valid word breaks in Thai.
Thai, as you say, does not mark word breaks visibly, but (unlike Chinese and Japanese) allows line breaks only at word breaks. Consequently, deciding where to word-wrap Thai text requires a complex morphological analyzer that knows quite a lot about Thai and still sometimes gets it wrong — unless ZWSP is used to assist. That’s why (the last I heard) there was still a separate Thai edition of Windows, even though all other editions are now interchangeable if you alter the i18n settings.
I should also point out that the notion of “word” in Chinese is more than a little difficult. The word for “word” is ci2, which is a technical term; Chinese people don’t naturally think of their speech or writing as being organized at that level. Instead, the normal level of organization (and of line-breaking) is at the zi4 level, which can be translated “character”, “morpheme”, or “syllable” — all of them technical terms in English.
Hi John,
Word breaks are what I had in mind. I tend to use “lexicon” to mean something like “machine readable dictionary,” and even then, only in the simplest sense; so I wasn’t thinking of morphemes or anything fancy like that. What I’m mostly concerned with is extracting words from text so I can look them up in bilingual Thai/* dictionaries.
The definition of “word” in the context of Chinese is a whole ‘nother ball of wax, as the saying goes. Lately for whatever reason I’ve been thinking about the Thai/Lao/Khmer case — those seem to be birds of a feather.
All of this reminds me that I have a half-written blog post with some thoughts about statistical word splitting that I should finish up and post…
Thanks for dropping by!
Lao Script for Windows, an add on to help use Lao with Windows applications, has used automatic insertion of ZWSP to split Unicode Lao text for several years, since MS do not yet provide display-time wrapping for Lao. Most but not all Unicode-aware applications handle it correctly. Lao is simpler than Thai (in some ways) as spelling is phonetic, so syllable-rule based insertion works reasonably well. But LSWin currently also provides dictionary-based insertion/wrapping. More info about this is on my website, or email me directly.
* Whether that’s what this character is for
Yes.
The Unicode 5.0 book about ZWSP: “The U+200B ZERO WIDTH SPACE indicates a word boundary, except that it has no width. Zero-width space characters are intended to be used in languages that have no visible word spacing to represent word breaks, such as Thai, Khmer, and Japanese.”
* Whether people actually already do that in any such language when typing (somehow)
No.
From my experience with Thai and Japanese translators, they don’t use ZWSP.
* Whether this character would be appropriate to insert automatically with software that identifies word boundaries automatically (spellcheckers, for instance)
I would say “rather not.”
If the spell-checker/lexer/line-breaking algorithms are tightly integrated with the application doing the layout, then it should not change the text (same as the module hyphenating English text does not physically add a hyphen in the text representation).
If the component that is aware of the language works outside the main application, and the main application is not language-aware, then it is probably acceptable. This might be a hack to add some level of language support to application that don’t have one (scenario: adding ZWSP on the Thai text for application X, if the application understands ZWSP, but “does not know Thai”).
John -
Thanks for the comment, it’s interesting to compare your opinion on inserting the ZWSP into Lao text in light of Mihai’s comment. Does your tool end up saving those characters into the text which was input somehow? Could you give a rough description of just how big the dictionary of exceptions you use for Lao is?
Mihai -
Thanks for the detailed description. I just got a shiny copy of the Standard, I’ll definitely look it up (I’m sure it’s on unicode.org somewhere too…)
I find the answer to the third issue quite interesting. As it happens, the application we’re building does a fair amount of modification to incoming content in some cases (running Tidy on it, removing Javascript, etc, for instance), so I can’t see where adding ZWSP would be too big of a deal. I certainly agree that adding them into someone’s desktop Word file or something, on the other hand, would not be wise.
By the way, I checked out your site and I really like it. A couple favorite posts:
Internationalization Cookbook - Basic lingo (like those illustrations)
Internationalization Cookbook - What is a “Unicode application”?
As a Linux guy, sometimes I must admit I’m jealous of all the cool i18n stuff that MS programmers have available, thanks to the fine work of folks like Michael Kaplan.
Looking forward to more posts!
I am newly learning Khmer, which is a latecomer to Unicode and computerization.
In the push to get everyone creating Khmer text to use a standard character set, it appears that the unicode push is having the side effect of training typists to insert the ZWSP between words. The evidence I’ve run across is anecdotal and scanty, but it would appear that “not typing a space” would be accurate for prior keyboarding, and that it is no longer true for Khmer Unicode keyboarding. I think the need to assist browsers to effect line breaks is what re-inforces this, as anyone with a computer is sensitive to how their content is displayed on the web.
The development of automated programs to insert word-breaks is thus being spurred to facilitate transformation into Unicoded text in Khmer, as well.
(Again, I report as a neophyte.)
Roger Sperberg
Hi,
I’m a Khmer speaker. I followed your link at KhmerOS website.
I just want to report that ZWSP does exist in NiDA Standard Khmer Unicode keyboard (formally KhmerOS keyboard) on “Space Bar” key. Most of the time I typed Khmer, I use this character especially in Word Processor. I then got the word breaking down to the next line automatically if that word does not fit to one line.