Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

New Wikipedia Parser i18n

Written by Patrick Hall, 6 months ago.
Tags: , .

From the Wikipedia mailing list:

[Wikipedia-l] New parser in the works - please help

On the outside chance that any reader of this blog hasn’t edited Wikipedia, what they’re talking about here is the parser that interprets the markup that’s used to write Wikipedia articles. (You can see some by clicking “edit” on any Wikipedia article.) This stuff is called “Wikitext.”

An interesting detail of the project from our point of view is that attention is being made to internationalizing the parser:

Some of what some people would think of as a “stupid parser trick” is
in fact important - e.g. L”’uomo'’ which renders as L<i>uomo</i>
(necessary for French and Italian).

So: we need to know what MediaWiki quirks are supporting important
constructs in languages other than English (which is the language the
list is in, and is the native language of most of the participants),
and particularly in non-European languages.

This list is unlikely to implement new features, e.g. (an example
brought up by GerardM) the double-apostrophe in Neapolitan. But we
really need to know about present important features that wouldn’t be
obvious to an English-speaker going through the present parser code.

It will be interesting to follow this stuff; I do a fair amount of trying to parse up Wikitext myself.

(PS. Wordpress has totally borkedup that Italian example in the quote above; just check it out on the original link.)

What the Heck is a Language Model? 5 Minute Answer

Written by Patrick Hall, 6 months ago.
Tags: , .

Scrabble.

Scrabble is based on a language model.

Specifically, the set of Scrabble tiles, each with a letter and a numerical value, constitutes a language model.

When you score a word in Scrabble (forget triple word scores and all that), you’re using a language model to evaluate how “good” the word is.

Now it so happens that the Scrabble language model of English is a bit odd. Namely, you get a high score in Scrabble for words that have uncommon letters. So if you look at records of professional Scrabble games, the words sometimes get so nutty they barely resemble English. (X’s and Y’x and Q’s all over the place, but nary an S.)

Usually in Natural Language Processing you want your model to capture “normalness,” not “weirdness.” So instead of giving a high score to the rare letters “Q” or “X”, you’d give a high scores to the far more common “S” and “E”. An easy way to build a simple language model along those lines is simply to take a bunch of text in the language in question and count up each letter. The letter’s score is the letter’s frequency (maybe normalized to a score, between 0 and 100, say).

Now, we’re talking about a very simple language model here, but that’s the general idea.

(And it’s not hard to think of a quick application: you could probably take the Scrabble letter distributions in lots of languages and use them as a simple language recognition tool. Score a mystery text according to each of those models, and see which one returns the lowest value. That’d be your best guess. It might suck ―identifying languages is easier if you use sequences of two or more letters―but it might work, too.)

Loonicode U+0009

Written by Patrick Hall, 6 months ago.
Tags: , .
Loonicode 9

A Unicode Question: Character Decompositions

Written by Patrick Hall, 6 months ago.
Tags: , , , , .

I fancy myself something of a Unicode fanatic, but I don’t pretend to understand all, or even most, of the specs on the topic. I’m very much a learn-as-I-go kind of guy, which I think is an okay way to learn Unicode stuff, since I pretty much deal with it every day.

End of preamble, beginning of post-preamble:

Some letters can be automatically broken down (”decomposed,” I think, is the right term) into more characters, some of which don’t normally stand on their own.

For instance, here’s a Thai “letter”:

ด้

It’s actually a consonant plus a vowel symbol, and it’s possible to rip those two parts out and look at them:

U+0E14: THAI CHARACTER DO DEK (ด)

U+0E49: THAI CHARACTER MAI THO (◌้)

As you can see (well, as you can see if you have a pretty complete Thai font), there is a “letter” called Do Dek, and another called Mai Tho. Mai Tho is the vowel, and it’s attached to Do Dek. As a loose analogy for Roman alphabet fans, it’s as if an i with a dot and an i without a dot were distinct letters.

Come to think of it, they are, in Turkish: U+0131: LATIN SMALL LETTER DOTLESS I (ı).

But anyway, The point is that sometimes from a linguistic viewpoint you want to do this ripping apart. For an automatic transliteration project I’ve been working on(about which more later), it will be useful to be able to access this kind of info for Thai; it sort of turns an abugida into an alphabet.

However, it doesn’t seem to be the case that such decompositions are universal in Unicode land. The specific case I have in mind is Amharic, which is also an abugida (that’s the language that the word comes from, as a matter of fact), for which there are appears to be no decomposition.

That is to say, there is no way to decompose the characters:

U+1200: ETHIOPIC SYLLABLE HA (ሀ)

U+1201: ETHIOPIC SYLLABLE HU (ሁ)

…in such a way that we can “get at the vowel parts” as independent characters, and see that they are both variant in some sense of the “h part.”

So:

  1. Am I wrong about Amharic?
  2. Is this sort of thing purely script-specific in Unicode, or is there a general policy that says “decompositions should be available if possible”?
  3. If the answer to #2 is “there is no general policy,” is there at least a list somewhere that will tell me which writing systems do and do not have such decompositions?

And now a one-line post

Written by Patrick Hall, 6 months, 1 week ago.
Tags: .

Putting Unicode text into LaTeX is way too hard, the end.

A Revolution in Linguistic Mapping?

Written by Patrick Hall, 6 months, 2 weeks ago.
Tags: .

I just got my Country Codes of the World map in the mail from John Yunker at Byte Level Research. It puts scaled country domains onto a map of the world. Quite interesting to look at―a few things that stood out for me:

  • Australia is so much dinkier than I thought.
  • Everybody knows that China is huge, but India is more or less equally huge.
  • There are some surprises: Golly, Bangladesh.

Looking at the map got me thinking about possibilities for mapping languages: there are a lot more languages out there than country domains, but, with widespread GPS-enabled cell phones becoming ubiquitous this kind of research seems poised to explode.

The interface is easy to picture: you go to a URL on your cell phone and it says “What language(s) do you use?” The application then map those results onto a map. Simple.

This would deal with a pet peeve of mine regarding most linguistic mapping: it’s really hard to get current information on who speaks what where. In my own country, the US, you will often see language maps that would lead one to believe that the eastern seaboard is flowing with speakers of Algonquian languages.

Now, the fact that Algonquian and so many other historical languages are no longer flourishing here is something I mourn, but it’s also a fact. There are certainly more Mandarin speakers in Massachusetts than there are speakers of Massachusett. It’s just the state of languages in our world.

But we still don’t have enough detail about that state. It will be very interesting to see, when we do. (In the case of the US, I would hope that such an application could thoroughly deep-six the notion that this country is “monolingual,” an absurd myth.)

Even without the use of GPS, you can run informal mapping “experiments” already: here’s one I did on a neat new site called Ask500People:

Do you hear more than one language when you walk around where you live?

Dear Jeff Bezos

Written by Patrick Hall, 6 months, 2 weeks ago.
Tags: , , , , .

You are losing sales because your sites are in different encodings.

Amazon.com is in ISO-8859-1.

Amazon.co.jp is in SHIFT-JIS.

If your customers type a Japanese book title (like ねじまき鳥クロニクル) on amazon.com, they get hello crazy talk, like this: ねじまき鳥クロニクル

“Hello crazy talk” is best translated as “no sale for you.”

Sincerely,

The UTF-8 Avenger