Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Wow @ Google

Written by Patrick Hall, 1 year, 11 months ago.
Tags: No Tags.

Frabjous day!

This isn’t directly related to Blogamundo, but, uh, it’s just too awesome not to post about:

Official Google Research Blog: All Our N-gram are Belong to You

We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That’s why we decided to share this enormous dataset with everyone. We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.

Things I’m wondering right off the top of my head:

  1. Is it gonna cost… like… a lot?
  2. Please, FSM, let it not be a trillion woirds of English.
  3. Many cheers for the Linguistic Data Consortium. (Perhaps someone at Google took Mark Liberman’s open letter to Microsoft to heart?)

The first project that comes to my mind would be to try to sample writing systems—what the Unicode folks call scripts. It’s a given that most text on the web is in the Latin script or its derivatives, and second place could very well be Chinese (but which encoding?)… but then things get interesting.

Just how much Ethiopic is out there? Georgian? Persian? Tifinagh?

A trillion words.

A TRILLION!

That’s even comparable to the US debt!

2 Comments for 'Wow @ Google'

  1. Comment received 1 year, 10 months ago from anonymous

    Mark Leiberman could put his money where his mouth is and make the UPenn resources a lot more accessible.

  2. Comment received 1 year, 10 months ago from Patrick Hall

    I think that’s a pretty cheap shot, anonymous. It’s not like hosting is free; the LDC has to keep its head above water somehow. Yes, it would be great if huge corpora were free on the web.

    Why not criticize Google for not opening up their database to all comers?

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <img src="" alt=""> <strike> <strong> . Don't forget to close them after use.