Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Don’t sort stuff in Unicode with Bash?

Written by Patrick Hall, 4 months, 2 weeks ago.
Tags: , .

Update: Okay, duh: I shouldn’t have called it “Bash”. What I meant was, “whatever the sort utility is in my default terminal.” Which, as Bryan points out in a comment below, has nothing to do with Bash: it’s GNU Sort. More updates below.

I have a little text file with “Hello World” in lots of languages, which I often use for testing. I extracted a few lines with various scripts and saved that as helloworld.txt.

$ cat helloworld.txt
สวัสดีราคาถูก!  Thai
Habari dunia!   Kiswahili
Halló heimur!   Icelandic
Saluton Mondo!  Esperanto
Sveika, pasaule!        Latvian
Привет, мир!    Russian
ሠላም ዓለም!        Amharic
안녕, 세상!     Korean
Chào thế giới!  Vietnamese
Hallo, wrâld    Frisian
Hallo verden!   Norwegian/Bokmal
Laba ryta, pasauli!     Lithuanian

For my first amazing trick, I sort the file with the Bash shell built-in:

$ sort helloworld.txt
ሠላም ዓለም!        Amharic
Chào thế giới!  Vietnamese
Habari dunia!   Kiswahili
Halló heimur!   Icelandic
Hallo verden!   Norwegian/Bokmal
Hallo, wrâld    Frisian
안녕, 세상!     Korean
Laba ryta, pasauli!     Lithuanian
Saluton Mondo!  Esperanto
Sveika, pasaule!        Latvian
สวัสดีราคาถูก!  Thai
Привет, мир!    Russian

…which sucks. Because obviously Bash is ignoring anything fancy (Amharic, Korean, Thai) and sorting strictly by whatever ASCII shows up in the line. (Hard to say whether the «ó» in Icelandic is being considered, but shouldn’t it come after «o» anyway?)

I also installed and tried another terminal called rxvt-unicode, which supposedly has better Unicode support. I got the same results as what I got in Bash under gnome-terminal, which suggests to me that the problem is Bash, or somewhere deeper, and not the terminal itself. I got the same result.

$ python
>>> lines = open('helloworld.txt').read().decode('utf-8').splitlines()
>>> for line in sorted(lines): print line
...
Chào thế giới!  Vietnamese
Habari dunia!   Kiswahili
Hallo verden!   Norwegian/Bokmal
Hallo, wrâld    Frisian
Halló heimur!   Icelandic
Laba ryta, pasauli!     Lithuanian
Saluton Mondo!  Esperanto
Sveika, pasaule!        Latvian
Привет, мир!    Russian
สวัสดีราคาถูก!  Thai
ሠላም ዓለም!        Amharic
안녕, 세상!     Korean

Python does better; clearly things are being sorted according to their Unicode code points. Which of course is a far cry from following UTS #10: Unicode Collation Algorithm, but that has to do with locales and all that.

In any case, I won’t be trusting Bash to sort Unicode files any more.

(I’d be interested to know what the default sort does to the initial input in various other programming languages, comments welcome.)

Update:

After Bryan’s comment pointed out that it wasn’t Bash that I was even dealing with, but rather GNU sort , reading through the manual I discovered the following trick in a footnote:

$ export LC_ALL=C; sort hw.txt
Chào thế giới!  Vietnamese
Habari dunia!   Kiswahili
Hallo verden!   Norwegian/Bokmal
Hallo, wrâld    Frisian
Halló heimur!   Icelandic
Laba ryta, pasauli!     Lithuanian
Saluton Mondo!  Esperanto
Sveika, pasaule!        Latvian
Привет, мир!    Russian
สวัสดีราคาถูก!  Thai
ሠላም ዓለም!        Amharic
안녕, 세상!     Korean

Which seems to be what I was looking for.

“Translation is hugely useful!”

Written by Patrick Hall, 5 months ago.
Tags: .

Here’s a post from Salon.com’s interesting How the World Works Globalization blog about an amazing translator:

Salon.com Technology | Fragments of the Tocharian

Translation is an under-appreciated art, but Ji Xianlin risked his life for his craft:

…he secretly translated the entire Indian epic, “The Ramayana,” from the original Sanskrit into Chinese, while experiencing the travails that afflicted nearly all Chinese intellectuals during the Cultural Revolution.

Ji’s observation about the utility of translation is trenchant:

It is translation that has preserved the perpetual youth of Chinese civilization. Translation is hugely useful!

We agree.

New Wikipedia Parser i18n

Written by Patrick Hall, 7 months, 2 weeks ago.
Tags: , .

From the Wikipedia mailing list:

[Wikipedia-l] New parser in the works - please help

On the outside chance that any reader of this blog hasn’t edited Wikipedia, what they’re talking about here is the parser that interprets the markup that’s used to write Wikipedia articles. (You can see some by clicking “edit” on any Wikipedia article.) This stuff is called “Wikitext.”

An interesting detail of the project from our point of view is that attention is being made to internationalizing the parser:

Some of what some people would think of as a “stupid parser trick” is
in fact important - e.g. L”’uomo'’ which renders as L<i>uomo</i>
(necessary for French and Italian).

So: we need to know what MediaWiki quirks are supporting important
constructs in languages other than English (which is the language the
list is in, and is the native language of most of the participants),
and particularly in non-European languages.

This list is unlikely to implement new features, e.g. (an example
brought up by GerardM) the double-apostrophe in Neapolitan. But we
really need to know about present important features that wouldn’t be
obvious to an English-speaker going through the present parser code.

It will be interesting to follow this stuff; I do a fair amount of trying to parse up Wikitext myself.

(PS. Wordpress has totally borkedup that Italian example in the quote above; just check it out on the original link.)

What the Heck is a Language Model? 5 Minute Answer

Written by Patrick Hall, 7 months, 3 weeks ago.
Tags: , .

Scrabble.

Scrabble is based on a language model.

Specifically, the set of Scrabble tiles, each with a letter and a numerical value, constitutes a language model.

When you score a word in Scrabble (forget triple word scores and all that), you’re using a language model to evaluate how “good” the word is.

Now it so happens that the Scrabble language model of English is a bit odd. Namely, you get a high score in Scrabble for words that have uncommon letters. So if you look at records of professional Scrabble games, the words sometimes get so nutty they barely resemble English. (X’s and Y’x and Q’s all over the place, but nary an S.)

Usually in Natural Language Processing you want your model to capture “normalness,” not “weirdness.” So instead of giving a high score to the rare letters “Q” or “X”, you’d give a high scores to the far more common “S” and “E”. An easy way to build a simple language model along those lines is simply to take a bunch of text in the language in question and count up each letter. The letter’s score is the letter’s frequency (maybe normalized to a score, between 0 and 100, say).

Now, we’re talking about a very simple language model here, but that’s the general idea.

(And it’s not hard to think of a quick application: you could probably take the Scrabble letter distributions in lots of languages and use them as a simple language recognition tool. Score a mystery text according to each of those models, and see which one returns the lowest value. That’d be your best guess. It might suck ―identifying languages is easier if you use sequences of two or more letters―but it might work, too.)

Loonicode U+0009

Written by Patrick Hall, 7 months, 3 weeks ago.
Tags: , .
Loonicode 9

A Unicode Question: Character Decompositions

Written by Patrick Hall, 7 months, 3 weeks ago.
Tags: , , , , .

I fancy myself something of a Unicode fanatic, but I don’t pretend to understand all, or even most, of the specs on the topic. I’m very much a learn-as-I-go kind of guy, which I think is an okay way to learn Unicode stuff, since I pretty much deal with it every day.

End of preamble, beginning of post-preamble:

Some letters can be automatically broken down (”decomposed,” I think, is the right term) into more characters, some of which don’t normally stand on their own.

For instance, here’s a Thai “letter”:

ด้

It’s actually a consonant plus a vowel symbol, and it’s possible to rip those two parts out and look at them:

U+0E14: THAI CHARACTER DO DEK (ด)

U+0E49: THAI CHARACTER MAI THO (◌้)

As you can see (well, as you can see if you have a pretty complete Thai font), there is a “letter” called Do Dek, and another called Mai Tho. Mai Tho is the vowel, and it’s attached to Do Dek. As a loose analogy for Roman alphabet fans, it’s as if an i with a dot and an i without a dot were distinct letters.

Come to think of it, they are, in Turkish: U+0131: LATIN SMALL LETTER DOTLESS I (ı).

But anyway, The point is that sometimes from a linguistic viewpoint you want to do this ripping apart. For an automatic transliteration project I’ve been working on(about which more later), it will be useful to be able to access this kind of info for Thai; it sort of turns an abugida into an alphabet.

However, it doesn’t seem to be the case that such decompositions are universal in Unicode land. The specific case I have in mind is Amharic, which is also an abugida (that’s the language that the word comes from, as a matter of fact), for which there are appears to be no decomposition.

That is to say, there is no way to decompose the characters:

U+1200: ETHIOPIC SYLLABLE HA (ሀ)

U+1201: ETHIOPIC SYLLABLE HU (ሁ)

…in such a way that we can “get at the vowel parts” as independent characters, and see that they are both variant in some sense of the “h part.”

So:

  1. Am I wrong about Amharic?
  2. Is this sort of thing purely script-specific in Unicode, or is there a general policy that says “decompositions should be available if possible”?
  3. If the answer to #2 is “there is no general policy,” is there at least a list somewhere that will tell me which writing systems do and do not have such decompositions?

And now a one-line post

Written by Patrick Hall, 7 months, 4 weeks ago.
Tags: .

Putting Unicode text into LaTeX is way too hard, the end.

A Revolution in Linguistic Mapping?

Written by Patrick Hall, 8 months ago.
Tags: .

I just got my Country Codes of the World map in the mail from John Yunker at Byte Level Research. It puts scaled country domains onto a map of the world. Quite interesting to look at―a few things that stood out for me:

  • Australia is so much dinkier than I thought.
  • Everybody knows that China is huge, but India is more or less equally huge.
  • There are some surprises: Golly, Bangladesh.

Looking at the map got me thinking about possibilities for mapping languages: there are a lot more languages out there than country domains, but, with widespread GPS-enabled cell phones becoming ubiquitous this kind of research seems poised to explode.

The interface is easy to picture: you go to a URL on your cell phone and it says “What language(s) do you use?” The application then map those results onto a map. Simple.

This would deal with a pet peeve of mine regarding most linguistic mapping: it’s really hard to get current information on who speaks what where. In my own country, the US, you will often see language maps that would lead one to believe that the eastern seaboard is flowing with speakers of Algonquian languages.

Now, the fact that Algonquian and so many other historical languages are no longer flourishing here is something I mourn, but it’s also a fact. There are certainly more Mandarin speakers in Massachusetts than there are speakers of Massachusett. It’s just the state of languages in our world.

But we still don’t have enough detail about that state. It will be very interesting to see, when we do. (In the case of the US, I would hope that such an application could thoroughly deep-six the notion that this country is “monolingual,” an absurd myth.)

Even without the use of GPS, you can run informal mapping “experiments” already: here’s one I did on a neat new site called Ask500People:

Do you hear more than one language when you walk around where you live?

Dear Jeff Bezos

Written by Patrick Hall, 8 months, 1 week ago.
Tags: , , , , .

You are losing sales because your sites are in different encodings.

Amazon.com is in ISO-8859-1.

Amazon.co.jp is in SHIFT-JIS.

If your customers type a Japanese book title (like ねじまき鳥クロニクル) on amazon.com, they get hello crazy talk, like this: ねじまき鳥クロニクル

“Hello crazy talk” is best translated as “no sale for you.”

Sincerely,

The UTF-8 Avenger

Counting Words in the Age of Unicode

Written by Patrick Hall, 8 months, 1 week ago.
Tags: , , .

It’s pretty easy to get a reasonable word count in a language where words are separated by spaces, using Python in this case:

(Note: I’m using the Universal Declaration Human Rights as sample texts — you can too: http://www.unicode.org/udhr/ )


>>> len(open('udhr_eng.txt').read().split())
1778

It’s a bit trickier if the text is in a language that uses something besides a space to delimit words, like Amharic. But it’s still not hard:


>>> len(open('udhr_amh.txt').read().decode('utf-8'))
6603
>>> len(open('udhr_amh.txt').read().decode('utf-8').split(u'፡'))
974

(Splitting on U+1361 ETHIOPIC WORDSPACE there, which you probably can’t see unless you have an Ethiopic font.)

But, for languages like Japanese, things get tricky:


>>> len(open('udhr_jpn.txt').read().decode('utf-8').split())
123

Which is obviously wrong.

Because Japanese doesn’t delimit words with… well, anything really. The same sort of challenge holds for Chinese, and also for Southeast Asian languages like Thai, Khmer, and Cambodian, which uses spaces to delimit phrases, not words.

So what’s a translator to do?

Here’s my opinion:

1. Count letters (codepoints, to be precise)
2. Come up with an average letters/word number for each language, eg, “Thai text has n letters per word on average” (with a real value for n which would require a bit of research)
3. Calculate (1)/(2), and charge based on that number.

One question immediately springs to mind: how much variation is there in the average letters/word number? Is the ratio too variable to be useful?

« Previous PageNext Page »