h
a
c
k
l
o
g

A tale of language identification

Written by Patrick Hall, January 6th, 2009

Random story, which I found rather interesting:

In my endless trudging through every story related to language that comes over the wire, I came across this sad tale:

NewsChannel 5.com - Nashville, Tennessee - Translation Services Needed In Domestic Homicide

The story is about a homicide in Tennessee involving immigrants from Africa. But the article doesn’t say where in Africa the involved people are from, it just gives their names:

  • Yoranda Ntahomvukiye
  • Pascal Gahungu
  • Hitimana Noel

I started by searching those names:

hitimana gahungu ntahomvukiye - Google Search

(I noted that some of the names seem to be French in origin, but left those out of the search.)

Unsurprisingly, the article itself came up first. But three hits down I ran across this:

Survit-Banguka, Tutsi du Burundi

Another unfortunate-looking article… but it wasn’t the content I was interested in, just the names. Clearly this particular article was about Burundi, but it could still be a chance resemblance to another Bantu language (there are tons of those).

Since Wikipedia tells us that the languages of Burundi are Kirundi, French, and a bit of Swahili, we have a pretty small set to work with (and more evidence if we consider the names Pascal and Noel).

At this point we need a bit of secondary validation… an opportunity to try automatic language id and see if it supports the idea that the language in question is in fact something from Burundi. It so happened that the names in that web page here were all in upper case, so I just saved it in a shell file and spewed out the worst tokenizer in human history:

$ cat Act160206.txt |\
tr '\12' ' ' |\
python -c 'import sys; print sys.stdin.read().lower()'
macumi miburo nyawenda nyabenda nyandwi gahungu hakizimana
minani habonimana hatungimana kayoya nahimana ndayisenga
ndikumana nsabimana ntakimazi ntirampeba nzeyimana bigirimana
bucumi karenzo nizigiyimana ntirandekura prison singirankabo misago
ndereyimana rwasa sindakira baranyizigiye bukuru congera matenari
mayoya mbonihankuye misigaro mpangaje muhutu musabimana
mvuyishanga ndabaneze ndaruzaniye ndenzako ndikumugongo ndikuriyo
ndiyenivyo ndogomba ngendabanyikwa ngenzebuhoro ngerageze nibigira
nibona nimpagaritse niyonkuru ntaconayigize ntahimpera ntahondi
ntahongendera ntezahorirwa ntirabampa ntungukobiri nyambikiye
nyedetse nzigirabarya nzoyisaba rukeratabaro sindayigaya siniremera
sinzinkayo sinzobakwira sinzumunsi toyi twagirayezu yongoro butoyi
gahutu kabura manirakiza mbarushimana mbonimpa mushengezi
mvuyekure nduhirubusa nduwimana ngendakumana ngiriyabandi
niyonzima nkezabahizi nsanzerugeze nshimirimana ntahombaye
bagumako  bambarukontari  bangirinama  banyiyezako  bidahari
bukobero  camihigo  habimana  kaburo  kana   kayemba  kibugebuge  kubukubu

That gave me a chunk of text that looked big enough to paste into my language identifier, et voilà, the response is Rundi. It happens to be the case that Ki- and similar prefixes are found with the names of many Bantu languages, so it did in fact identify that list of names as Kirundi.

Curious to know if anyone out there has reason to doubt this identification of the language of those names? Could I convince myself more, somehow?

UDHR, UDHR on the wall, who’s the x-iest language of them all?

Written by Patrick Hall, January 6th, 2009

Mysteries in translation seem …

Written by Patrick Hall, January 4th, 2009

Onomatopoeia

Written by Patrick Hall, January 1st, 2009

Translations of the US Declara…

Written by Patrick Hall, December 28th, 2008

An idea for building a Japanese dictionary from Wikipedia

Written by Patrick Hall, December 26th, 2008

Russian on the New York Times,…

Written by Patrick Hall, December 25th, 2008

Mining for stuff to translate …

Written by Patrick Hall, December 24th, 2008

Dutch names and your database columns

Written by Patrick Hall, December 23rd, 2008

A great illustrated explanatio…

Written by Patrick Hall, December 23rd, 2008
« Previous PageNext Page »