Google and UPenn’s Ngrams
The Google folks have updated the blog post about releasing a veritable avalanche of ngrams, which we mentioned here a while back.
Unfortunately, unless I’m mistaken, it seems the data’s all English. Which is really a great thing, if you’re interested in English exclusively, but not so much for us, since we aren’t. It’s also $150, which I suppose is fair enough, considering the amount of work that must have gone into spidering all that data, and converting 24 gigs to UTF-8 (no mean feat, that), and then filtering out everything but English.
(That last bit is where I cry.)
Anyway, here’s hoping there will be some similarly cool multilingual content somewhere down the road.
No comments yet.
Technorati tags: Code, Language and the Web