Musings on a Language Survey of Twitter
In the wake of my painfully hand-constructed survey of languages on twitter on election night , my interest is piqued: what would it take to come up with some REAL numbers, automatically?
The problem with running statistically language identification on Twitter is that twits are too short—140 characters.* My simple language ID tool needs a lot more than that.
But it seems to me that there is a good solution, like this:
- Get a bunch of twits (somehow )
- Eliminate everything that does show up as English
- Get more twits from the users that posted the non-English ones (using the API, again)
- Run language ID on a bunch of those users’ twits, assuming that they twitter in just one language (it’s quite common for people to twitter in many languages, but gotta start somewhere…)
- Count.
I’ll get around to that… any minute now.
*Incidentally, do multibyte languages get their maximum length cropped on Twitter?

Interesting question. I signed up for Twitter just in order to answer it. The answer ist YES. The Javascript interface appears to count characters while the server counts bytes. You FAIL, Twitter!
Well, not so fast… I’m thinking Twitter’s not to blame here. Keep in mind that Twitter is designed to be used over SMS, and that’s a dinosaur of a technology right there. So I think that the must-mentioned 140 “character” limit is really the 140 byte limit built into SMS.
So then, it wouldn’t be Twitter’s fault, would it?
The inconsistency between what the Web interface tells me and the actual limit would definitely be Twitter’s fault.
I tried the tool, and found that it doesn’t cover a very large set of languages. Do you have the list anywhere?
For a part of my MSc thesis, I used a very simple algorithm to identify Persian text. The challenge is it shares the alphabet with Arabic, Urdu, and 7-8 other languages.
Will let you know more if you were interested.
Yo Ke:
Can Javascript count bytes? It seems to count everything in code points for me:
>>> "abc".length3
>>> "カキク".length
3
>>> "أبج".length
3
I’m not sure how one could go about getting Javascript to count bytes, but it seems to me that that’s what it would take to get a count of how many bytes would actually be used in some given text.
Oh wait, they could do this:
http://www.inter-locale.com/demos/countBytes.html
So yeah, fail on Twitter’s part. Thanks.
Hi Farzaneh,
Thanks for trying out the language id tool. It’s still in development, and it’s a long way from perfect. I haven’t updated the language models in some time, and I’m pretty sure that Persian is one of the languages that are missing.
Is your thesis online? My algorithm is very simple and doesn’t do anything language-specific at all; it just builds a simple bigram model for each sample language, builds a model of the unknown text, and then uses a vector similarity measure to compare the unknown to all the known models.
The plan is to release the thing at some point, but I’m just too busy lately.
Cool thing you’re building a language recognizer and planning to release it. Here’s another implementation of the same idea, done by a friend of mine (we used this quite successfully in a class project): http://trac2.assembla.com/irproject/browser/LACrIMoSA/src/jdellert/nlp/recognizer