h
a
c
k
l
o
g

Google and UPenn’s Ngrams

Written by Patrick Hall, September 25th, 2006

The Google folks have updated the blog post about releasing a veritable avalanche of ngrams, which we mentioned here a while back.

Unfortunately, unless I’m mistaken, it seems the data’s all English. Which is really a great thing, if you’re interested in English exclusively, but not so much for us, since we aren’t. It’s also $150, which I suppose is fair enough, considering the amount of work that must have gone into spidering all that data, and converting 24 gigs to UTF-8 (no mean feat, that), and then filtering out everything but English.

(That last bit is where I cry.)

Anyway, here’s hoping there will be some similarly cool multilingual content somewhere down the road.

idle in peace, lilo

Written by Patrick Hall, September 16th, 2006