h
a
c
k
l
o
g

Is it possible to mine translations from the Flickr API’s clusters?

Written by Patrick Hall, August 25th, 2008

I have too much to do at the moment to think much about this, but here ya go, half baked for your masticatory delectation:

I was talking to my homey Carlos about the possibility of getting useful translation information out of Flickr’s API.

In particular, I had been tipped off by a rather old article on Flickr’s i18n endeavors, which mentioned the possibility of finding translations for tags within automatic tag clustering:

…Flickr’s tag cluster analysis tools, which monitor which tags are commonly used in conjunction with other tags, can bridge gaps. For example, a Japanese user who types in the Japanese characters for “Tokyo” can click to see clusters of related tags, the top one of which is the English term “Tokyo.”

But while Japanese-language Flickr users evidently often add the “Tokyo” tag in English, the converse isn’t necessarily true, meaning that the tag cluster bridge in some cases runs only one way. Flickr’s “Toyko” English tag cluster doesn’t include the Japanese characters for Tokyo as a common tag companion.

The idea of mining Flickr tags for translations is certainly intriguing, but as far as I can tell, the difficulty lies in the fact that the clusters that Flickr returns are hard to filter by language, let alone by meaning.

So yes, tagged with tokyo cluster japan, night, shibuya, shinjuku, harajuku, street, 東京, 日本, people, city contains “東京” (”Tokyo”), but it also contains “日本” (Nihon, “Japan”). I don’t see an obvious path to nailing down just which of those terms is the right translation from such a list, statistically or otherwise.

But I haven’t though too hard, hoping that maybe someone else has!

A plain-English description of a Computational Linguistics Thesis

Written by Patrick Hall, August 22nd, 2008

No Comment Necessary.

Written by Patrick Hall, August 19th, 2008

Scripts.txt - How to look up what writing system a Unicode character is from (uh, kind of)

Written by Patrick Hall, August 14th, 2008

Microsoft trademarked “i’m”?

Written by Patrick Hall, August 6th, 2008

Computational, You Say?

Written by Patrick Hall, August 5th, 2008

Is there something wrong with putting Unicode into Javascript source code?

Written by Patrick Hall, August 5th, 2008

Language Selection on Linkedin

Written by Patrick Hall, August 4th, 2008