Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

What the Heck is a Language Model? 5 Minute Answer

Written by Patrick Hall, 7 months, 3 weeks ago.
Tags: , .

Scrabble.

Scrabble is based on a language model.

Specifically, the set of Scrabble tiles, each with a letter and a numerical value, constitutes a language model.

When you score a word in Scrabble (forget triple word scores and all that), you’re using a language model to evaluate how “good” the word is.

Now it so happens that the Scrabble language model of English is a bit odd. Namely, you get a high score in Scrabble for words that have uncommon letters. So if you look at records of professional Scrabble games, the words sometimes get so nutty they barely resemble English. (X’s and Y’x and Q’s all over the place, but nary an S.)

Usually in Natural Language Processing you want your model to capture “normalness,” not “weirdness.” So instead of giving a high score to the rare letters “Q” or “X”, you’d give a high scores to the far more common “S” and “E”. An easy way to build a simple language model along those lines is simply to take a bunch of text in the language in question and count up each letter. The letter’s score is the letter’s frequency (maybe normalized to a score, between 0 and 100, say).

Now, we’re talking about a very simple language model here, but that’s the general idea.

(And it’s not hard to think of a quick application: you could probably take the Scrabble letter distributions in lots of languages and use them as a simple language recognition tool. Score a mystery text according to each of those models, and see which one returns the lowest value. That’d be your best guess. It might suck ―identifying languages is easier if you use sequences of two or more letters―but it might work, too.)

3 Comments for 'What the Heck is a Language Model? 5 Minute Answer'

  1. Comment received 6 months, 3 weeks ago from ReallyEvilCanine

    It won’t work because Scrabble’s scoring isn’t based just in frequency (and therefore ease). Point values (and number of tiles) don’t correlate in order to make the game more challenging. For example, there are only four “S” tiles, each worth 1 point. The infrequency prevents slapping that letter on the end of a word to make a lazy plural. In German a “Z” is so common that it takes the place of the “Y” on German keyboards, yet there’s only 1 tile worth 3 points.

    Letter frequencies can lead to fairly easy discovery of a language or language family but Scrabble distributions are designed to force more “interesting” words.

    P.S.: Your kindacatcha doesn’t work in FireFerret.

  2. Comment received 1 month, 2 weeks ago from Murray D Glynn

    I have loaded Scrabble disc and am puzzled. In many instances the words the computer comes up with bear little resemblance to English. Many cannot be found in Oxford or Websters dictionaries. Scrabble English seems to present a parallel universe with “words” so outlandish as to defy description. What’s going on?

  3. Comment received 4 days, 13 hours ago from tom

    It looks nice. I like it.

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <img src="" alt=""> <strike> <strong> . Don't forget to close them after use.