Is it possible to mine translations from the Flickr API’s clusters?
I have too much to do at the moment to think much about this, but here ya go, half baked for your masticatory delectation:
I was talking to my homey Carlos about the possibility of getting useful translation information out of Flickr’s API.
In particular, I had been tipped off by a rather old article on Flickr’s i18n endeavors, which mentioned the possibility of finding translations for tags within automatic tag clustering:
…Flickr’s tag cluster analysis tools, which monitor which tags are commonly used in conjunction with other tags, can bridge gaps. For example, a Japanese user who types in the Japanese characters for “Tokyo” can click to see clusters of related tags, the top one of which is the English term “Tokyo.”
But while Japanese-language Flickr users evidently often add the “Tokyo” tag in English, the converse isn’t necessarily true, meaning that the tag cluster bridge in some cases runs only one way. Flickr’s “Toyko” English tag cluster doesn’t include the Japanese characters for Tokyo as a common tag companion.
The idea of mining Flickr tags for translations is certainly intriguing, but as far as I can tell, the difficulty lies in the fact that the clusters that Flickr returns are hard to filter by language, let alone by meaning.
So yes, tagged with tokyo cluster japan, night, shibuya, shinjuku, harajuku, street, 東京, 日本, people, city contains “東京” (”Tokyo”), but it also contains “日本” (Nihon, “Japan”). I don’t see an obvious path to nailing down just which of those terms is the right translation from such a list, statistically or otherwise.
But I haven’t though too hard, hoping that maybe someone else has!
