Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Loonicode+0005

Written by Patrick Hall, 1 year, 10 months ago.
Tags: , , .
Loonicode 5

In which I tell you a Secret

Written by Patrick Hall, 1 year, 10 months ago.
Tags: , .

You know, I spend all my time working on Blogamundo, but there is another linguistic endeavor else I really want to get into some day, and that’s to do fieldwork on an unwritten language.

Which is why Eric Bacovic’s post on language documentation over at Language Log caught my eye.

It’s always seemed to me that “field work” (a term I’ve never liked) is the quintessential linguistics task. But alas, I never really have, to speak of. UCSD students, it seems, are luckier in this regard, as the article Eric points to describes:

Linguistics is the scientific study of language. It can be a highly theoretical field, and a minority of linguistics graduate programs in the country require hands-on courses in documenting and unraveling little-studied languages.

Somehow I managed to miss out on Linguistics 140 at Berkeley. I did take John Ohala’s phonology course, where we all interviewed a speaker of some language, and then wrote a description of the sound system of their language. That was my absolute favorite class–I must have driven the poor exchange student down the hallway well-nigh insane with my semester’s worth of pestering about the sounds of Japanese. Forgive me, Ayumi Taniguchi, wherever you are.

But Mary Haas was right: to really study a language, you have go the whole nine yards, and that means producing a dictionary, a grammar, and finally some texts. Sounds about right to me.

But who knows, maybe some day I’ll go back to school or figure out how to get a grant or something and then do some real fieldwork.

Man, that would be fun.

My Favorite Data Structure

Written by Patrick Hall, 1 year, 10 months ago.
Tags: , .

The “dictionary” data structure (also called mappings or hashes or associative arrays) is the workhorse of representing relationships between pieces of data in many programming languages, including Python.

But these data structures represent a one-to-one mapping.

I’ll show you what I mean. We start with a dictionary:

(By the way, should you try to cut & paste any of this code, watch out for the newlines… I’ll post a plain text version soon. Wordpress has some rather… helpful ideas about how to fix quotes and stuff. There are also some wayward backslashes but… my will grows weak.)

>>> d = {}

And then we add a few keys and values:

>>> d['a'] = 1
	
>>> d['b'] = 2
>>> d['c'] = 3
>>> print d
{'a': 1, 'c': 3, 'b': 2}

But if we put in a new value for ‘a’:

>>> d['a'] = 10
>>> print d
{'a': 10, 'c': 3, 'b': 2}

The old value is lost.

But what if you want to keep the old value around, that is, to collect all the values that ‘a’ is set to?

That’s where we need something that might be called a “Collection.” (Thanks to Jonas for helping industrialize this). If you’re familiar with Python classes, you’ll recognize that this is a subclass of the dictionary type — that means that you can add to the collection and look things up just as if it were a normal dictionary.

class Collection(dict):
   def __init__(self, *args, **kargs):
       dict.__init__(self, *args, **kargs)
       for k, v in self.items():
        dict.__setitem__(self, k, [v])
   def __setitem__(self, item, value):
      if item in self:
        self[item].append(value)
      else:
        dict.__setitem__(self, item, [value])

The implemenation is a bit tricky, perhaps, but it’s easy enough to use…

>>> from collection import Collection
	
>>> c = Collection()
>>> c['a'] = 1
>>> c['a'] = 10
>>> c['b'] = 2
>>> c['c'] = 3

Notice that we entered 1 and 10 as the values of the key ‘a’, now:

>>> print c
{'a': [1, 10], 'c': [3], 'b': [2]}
	

So if we look up the value of the key ‘a’, we get both of them back in a list:

>>> print c['a']
[1, 10]

This is great, because it lets you pigeonhole data into categories. For instance, say you wanted to group words into lists according to their first letter:

>>> randomwords = \"about always asking authors brainymedia
brainyquote comedian december dictionary exact
history inquire paula poundstone taken their topics trivia trying”.split()
	
>>> c = Collection()
>>> for word in randomwords:
…     firstletter = list(word)[0]
…     c[firstletter] = word
	
>>> for k,v in c.items(): print k, v
a [’about’, ‘always’, ‘asking’, ‘authors’]
c [’comedian’]
b [’brainymedia’, ‘brainyquote’]
e [’exact’]
d [’december’, ‘dictionary’]
i [’inquire’]
h [’history’]
p [’paula’, ‘poundstone’]
t [’taken’, ‘their’, ‘topics’, ‘trivia’, ‘trying’]

Neat right?

Here’s a somewhat more realistic example, where we create a mapping of letters to all the words that contain those letters:

>>> letters = list(set(''.join(randomwords)))
>>> # this creates a list all the letters in our random word list
>>> print letters
['a', 'c', 'b', 'e', 'd', 'g', 'i', 'h', 'k', 'm', 'l', 'o',
‘n’, ‘q’, ‘p’, ’s’, ‘r’, ‘u’, ‘t’, ‘w’, ‘v’, ‘y’, ‘x’]
	
>>> c = Collection()
	
>>> for letter in letters:
…     for word in randomwords:
…             if letter in word:
…                     c[letter] = word
>>> print c[’a']
[’about’, ‘always’, ‘asking’, ‘authors’, ‘brainymedia’, ‘brainyquote’, ‘comedian’, ‘dictionary’, ‘exact’, ‘paula’, ‘taken’, ‘trivia’]
>>> for k,v in c.items(): print k, len(v)
a 12
c 5
b 4
e 9
d 5
g 2
i 11
h 3
k 2
m 3
l 2
o 8
n 9
q 2
p 3
s 6
r 10
u 6
t 12
w 1
v 1
y 6
x 1

There’s nothing revolutionary about any of this, it’s just that it’s convenient, and I’ve found that for whatever reason this structure is useful in linguistic contexts.

Another reason Wikipedia Works

Written by Patrick Hall, 1 year, 10 months ago.
Tags: , , .

John Yunker has an interesting observation about why Wikipedia is so successful as a multlingual site:

How Wikipedia Manages Multilingual Content Expectations. Going Global: Adventures in Web globalization and all that it entails

Not only does it offer content in more than 100 languages, it does a good job of managing content expectations.

It does not hide how much content it offers in each language. In fact, it tells you upfront how many articles it offers, such as 168,000+ articles in Swedish and 143,000+ articles in Português.

This is really true. It doesn’t take much reflection to agree that it’s annoying to come across a website language options that turns out to have only a few pages in the language. I for one would like to know what the relative proportions are of multilingual content on multinational sites like Deutsche Welle.

For Blogamundo, the information I’m most curious to see is just which pairs of languages people choose to translate between. As far as I know that sort of information isn’t terribly easy to come by. We’ll make it as accessible as we can, here.

Hey Google…

Written by Patrick Hall, 1 year, 11 months ago.
Tags: .

What the heck?

I search for punc and get hits for punch.

Head a splodes!

Þšéüđø-lõçålîżáŧïòň? Pseudolocalization.

Written by Patrick Hall, 1 year, 11 months ago.
Tags: , , , , .

What a cool idea. Within a wide-ranging thread I wandered into via Found in Translation, I ended up at this post by Erik Schwiebert:

One of the ways we deal with that is a process called “pseudo-localization.” This has nothing to do with ‘pseudo-code’; instead, it is a way of forcing text into some translation automatically, yet still have that text be mostly readable. It works by taking the normal Roman alphabet and changing each of the characters into some similar character, perhaps one with an accent, or a copyright symbol instead of a C. We also pad each string with extra text to make it wider to check for dialog mis-layout and string insertions.

So “pseudo-localization” might become “[=== Þšéüđø-lõçålîżáŧïòň ===]” — still mostly humanly-readable, wider to force dialog layout, and bracketed so we can tell if a dev hardcoded string insertions. We can do this in an entirely automated fashion, and this technique lets us test perhaps 50% of Office as if it were localized, so that we can catch obvious dev mistakes right away.

This reminds me of Sam Ruby’s Survival Guide to Internationalization, where he uses “Iñtërnâtiônàlizætiøn” as a test phrase. In a previous post I advocated smushing any old non-ASCII text into every nook and cranny of your application.

I still think that’s a worth doing, but there are advantages to this “pseudo-localization” tool. For one thing, it’s easier to pronounce something like “Þšéüđø-lõçålîżáŧïòň”. (You can just say “pseudo-localization”!) This is useful when all the developers you’re working with don’t speak the same set of languages.

Far more important is the fact that it’s automated. That means that you can use this sort of stuff in unittests, for instance.


# -*- coding: utf-8 -*-
import re
import random

pseudo = u"Þšéüđølõçålîżáŧïòň"
plain = u"pseudolocalization"
pseudomap = dict(zip(list(plain), list(pseudo)))

sample = ''.join(list(set(list(plain))))
sampleRE = re.compile('^[' + sample + ']+$')

allwords = open('/usr/share/dict/words').readlines()
allwords = [w.strip() for w in allwords]

samplewords = [w for w in allwords if sampleRE.match(w)]
random.shuffle(samplewords)
afewwords = samplewords[:25]
for w in afewwords: print w, ''.join([pseudomap[c] for c in w])

The excitement is unbearable!

autopilot áüŧòÞïlòŧ
canine çáňïňé
clueless çlüéléšš
colonialists çòlòňïálïšŧš
consciousnesses çòňšçïòüšňéššéš
cuddliest çüđđlïéšŧ
ipecacs ïÞéçáçš
opines òÞïňéš
outlast òüŧlášŧ
postponed ÞòšŧÞòňéđ
punctuates Þüňçŧüáŧéš
salon šálòň
saltiest šálŧïéšŧ
sappiest šáÞÞïéšŧ
snot šňòŧ
spectacles šÞéçŧáçléš
titillates ŧïŧïlláŧéš
toilette ŧòïléŧŧé

Is there such a thing as “Linguistic Computing”?

Written by Patrick Hall, 1 year, 11 months ago.
Tags: , .

Jon Udell has this to say about mentorship and open source:

Open source software development, as a profession, is an early adopter of a work style that can also characterize many other professions. The key aspects of that work style are transparency, accountability, network-mediated collaboration, and narration of work.

My own educational background is in linguistics, not programming. The only reason that I’ve been able to become a hacker is the dynamic that Jon describes: if you want to learn to hack, there is a “guild” waiting for you. You can rise to whatever level of expertise you work up to. It’s a society, a culture. There are tour guides. There are neighborhood hangouts. There are public libraries. And there are blogs, too many to link.

This was all a welcome discovery to me when I learned the hard way that the job opportunites for a language geek pale in comparison to job security for a language geek who can program.

So, it’s no exaggeration to say that open source has offered me a career.

But, some of the things I’ve had to learn were too hard to learn. I’ve read more conflicting theories about the right way to handle Unicode than I care to remember. But Unicode should be bread and butter to anyone who studies language with a computer! Unicode is an absolute fundamental. Imagine if there were no man page for bash, and you start to understand the sort of hoops a would-be language hacker has to leap through when approaching Unicode for the first time.

Programmers often scoff at such complaints, because they’ll tell you “it’s just character encoding, what’s so hard about it?” That’s because most programmers don’t remember what’s it’s like not to know a character encoding from a salad fork. And the way that individual writing systems are set up within Unicode has its complications as well. And this is just one subtopic.

Unlike the “guild” that welcomed me as I tiptoed my way into general programming and hackerdom, there are no open arms to help you into the world of “linguistic computing.” There simply is no “conscious community” of people interested in thinking of the intersection of these topics as a unified discipline.

Perhaps this is all my own wishful thinking. Perhaps there is no discernable discipline surrounding the intersection of language and computing.

You tell me: is there a thread here?

  • localization
  • web-based language study
  • language creation (!)
  • computational linguistics
  • natural language processing
  • Unicode
  • cross-language retrieval
  • computer aided translation
  • internationalization
  • statistical language modelling
  • information retrieval
  • encoding and keyboarding issues

I look at that list and I see a connection: they’re all to do with “linguistic computing.” (And no, I don’t like that term either, but “language hacking” sounds even worse…)

I know that there are people who share an interest in this “thing.” And sometimes, when a programmer happens to get involved in this stuff, they become afficionados — Jonas (the real programming brains behind Blogamundo), wasn’t into this stuff at all, really, before we became friends, but now he’s dyed-in-the-wool Unicode fanatic, and increasingly a language geek in his own right. But by and large, it’s not “discoverable” on the web.

Why isn’t there more of a sense of community among:

  • linguists who who want to dip their feet into programming
  • programmers who want to learn more about the mysterious number crunching that goes on in stuff like chardet and SpamBayes
  • Language technology professionals who want to promote their ideas and their code to a wider audience — be those ideas from machine translation, information retrieval, l10n, etc.

There are web communities for environmental geeks, economics geeks, nanotech geeks, and god knows how many for politics geeks.

And I hasten to point out that there is no shortage of fantastic blogs about linguistics , language, globalization, writing systems , translation, or localization issues.

But I still feel like this “discipline” of linguistic computing is a distinct thing, which could and should serve as the basis for a vibrant community of people who help each other learn about code, language, and all the rest of this stuff, together.

Am I just being a hippy with all this talk, or does anyone agree?

Some random linkage…

Written by Patrick Hall, 1 year, 11 months ago.
Tags: , , , .

Here’s a couple thoughts.

Detecting Spam Web Pages through Content Analysis
Via News you can Bruise, a nice paper on spam detection. But this little buried tidbit was of most interest to yours truly: “In our data set, the majority of the pages (about 54%) are written in the English language, as determined by the parser used by MSN Search.” Just over half, that’s all! The authors included a member of Microsoft Research, so presumably they had pretty unlimited access to the crawler. Their language identification algorithms is proprietary, according to the footnote, so who knows how good it is. But if that number is accurate, it’s even more evidence that English isn’t the 500-pound gorilla of the net anymore…
Indiantelevision.com > Media, Advertising & Marketing Watch > NRS 2005: ‘Jagran’ topples ‘Bhaskar’ to claim top slot
Remember print? I would suspect that in India that newspapers are still a relatively more important source of news than the web or TV. (Anybody know whether that’s true?) Anyway, this article describes a telling recent change: the newspaper with the largest circulation on the subcontinent is now in Hindi, not English: India Today. (Just checked their site… it’s not Unicode, ew, ew!!!)

*ehem*

Written by Patrick Hall, 1 year, 11 months ago.
Tags: , .

No comment.

How do you Google Something you can’t Spell?

Written by Patrick Hall, 1 year, 11 months ago.
Tags: , , , , , .

The most recent installment of a series of fun posts on language identification came and went too soon for me to notice and send in my guess (I woulda been right, too, dagnabbit!), but the ensuing discussion has gotten me thinking about something. I’ll just bring the topic up tonight. Er, this morning.

(Psst: dig their new book.)

Anyway, the cat’s out of the bag now, the language in question was Romanian.

It just so happens that a few days ago I was reading an interesting short paper called “Using N-grams to Process Hindi Queries with Transliteration Variations” (pdf). As you may be aware, There’s a lot of Hindi on the web that’s written in transliterated form — especially Bollywood lyrics. (The motivations of why such texts are transliterated instead of written in Devanāgarī merits another post…)

I’ve been hacking on a bit of code to simulate the approach in the paper, but it’s not really ready for prime time yet. Anyway, here’s what reminded me of the paper:

Language Log Transcription Transcription found online
Gaşca-i adunată din mii
Nu s-a schimbat
În pliu fu şi în fiţe o ţine ne-ncetat
Tatuaje noi, inele şi cercei
Stând doar pe MTV
Şi nu ne pasă ce zic ei

E o lume nouă
Una nouă nouă nouă
E o lume nouă

Gashca-i adunata,
nimic nu s-a schimbat
In chefuri si in fite o tinem ne-ncetat
Cu tatuaje noi, inele si cercei
Stam doar pe MTV
si nu ne pasa ce zic ei

E o lume 9, 1@999
E o lume 9, yeah-yeah-yeah-yeah
E o lume 9, 1@999
E o lume 9, hei-hei-hei-hei.

The “authentic” version is the one with the “wrong” orthography. That’s a pretty common situation — Brazilians, for instance, will write «naum» for não, «eh» for é, and so on. These web orthographies may not make language teachers smile, but they’re by no means marginal in statistical terms.

And from the point of view of a web search, the difference is more than academic. After all, the search Gaşca-i adunată din mii fails, but Gashca-i adunata, nimic nu s-a schimbat succeeds.

And this problem is what the paper describes: a simple method of using n-grams (substrings of words) to perform fuzzy matches. Queries for a song titles, as a matter of fact, such as jane na nazar jigar pehchanay in one of a bazillion idiosyncratic transliterations in a database where the song might actually be recorded as jaane na nazar pehchaane jigar yeh kaun — I presume that pehchanay and pehchaane, jane and jaane are in fact different ways to transcribe the same word.

The search engines of today don’t really allow you to get around such problems — and while Google’s spelling suggestion tool makes an effort, it doesn’t really help much here.

Anyway, like I said, I’m just bringing up the topic. Fuzzy matching on stuff like this is a fun thing to code. It’s surprising how successful a simple approach can be. I’ll try to post that code if I get around to making it readable.