A tale of language identification
Random story, which I found rather interesting:
In my endless trudging through every story related to language that comes over the wire, I came across this sad tale:
NewsChannel 5.com - Nashville, Tennessee - Translation Services Needed In Domestic Homicide
The story is about a homicide in Tennessee involving immigrants from Africa. But the article doesn’t say where in Africa the involved people are from, it just gives their names:
- Yoranda Ntahomvukiye
- Pascal Gahungu
- Hitimana Noel
I started by searching those names:
hitimana gahungu ntahomvukiye - Google Search
(I noted that some of the names seem to be French in origin, but left those out of the search.)
Unsurprisingly, the article itself came up first. But three hits down I ran across this:
Survit-Banguka, Tutsi du Burundi
Another unfortunate-looking article… but it wasn’t the content I was interested in, just the names. Clearly this particular article was about Burundi, but it could still be a chance resemblance to another Bantu language (there are tons of those).
Since Wikipedia tells us that the languages of Burundi are Kirundi, French, and a bit of Swahili, we have a pretty small set to work with (and more evidence if we consider the names Pascal and Noel).
At this point we need a bit of secondary validation… an opportunity to try automatic language id and see if it supports the idea that the language in question is in fact something from Burundi. It so happened that the names in that web page here were all in upper case, so I just saved it in a shell file and spewed out the worst tokenizer in human history:
$ cat Act160206.txt |\ tr '\12' ' ' |\ python -c 'import sys; print sys.stdin.read().lower()' macumi miburo nyawenda nyabenda nyandwi gahungu hakizimana minani habonimana hatungimana kayoya nahimana ndayisenga ndikumana nsabimana ntakimazi ntirampeba nzeyimana bigirimana bucumi karenzo nizigiyimana ntirandekura prison singirankabo misago ndereyimana rwasa sindakira baranyizigiye bukuru congera matenari mayoya mbonihankuye misigaro mpangaje muhutu musabimana mvuyishanga ndabaneze ndaruzaniye ndenzako ndikumugongo ndikuriyo ndiyenivyo ndogomba ngendabanyikwa ngenzebuhoro ngerageze nibigira nibona nimpagaritse niyonkuru ntaconayigize ntahimpera ntahondi ntahongendera ntezahorirwa ntirabampa ntungukobiri nyambikiye nyedetse nzigirabarya nzoyisaba rukeratabaro sindayigaya siniremera sinzinkayo sinzobakwira sinzumunsi toyi twagirayezu yongoro butoyi gahutu kabura manirakiza mbarushimana mbonimpa mushengezi mvuyekure nduhirubusa nduwimana ngendakumana ngiriyabandi niyonzima nkezabahizi nsanzerugeze nshimirimana ntahombaye bagumako bambarukontari bangirinama banyiyezako bidahari bukobero camihigo habimana kaburo kana kayemba kibugebuge kubukubu
That gave me a chunk of text that looked big enough to paste into my language identifier, et voilà, the response is Rundi. It happens to be the case that Ki- and similar prefixes are found with the names of many Bantu languages, so it did in fact identify that list of names as Kirundi.
Curious to know if anyone out there has reason to doubt this identification of the language of those names? Could I convince myself more, somehow?
