Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Python for Linguistics 1: How to Count Letters with Python

Written by Patrick Hall, 2 years, 3 months ago.
Tags: .

Okay, here goes.

Some observations before you check it out:

  • It’s really hard to write a tutorial for newbies. You have to dig out all your own assumptions and figure out which ones matter.
  • This is the first time I’ve written such a thing, I hope it’s comprehensible.
  • I’ve decided to keep these things outside of the blog so that I can pretend they were perfect in the first place improve them over time.

Please feel free to post comments here! Tell me where you got confused, whether you found it boring, fun, useful, useless, totally misleading, in need of a bipartisan commission to pass judgement on my sanity, whatever.

Also, I guess it’s not fair to say that it’s appropriate for total newbs: I didn’t explain what variables are, and I didn’t get down to the “click the icon to open the Python prompt” sort of directions. There’s a lot of that out there on the web already, but it depends on your OS, etc etc.

Anyway, enough prolegomenizing:

Python for Linguistics 1: How to Count Letters with Python

By the way, we Blogamundistas tend to hang out in #blogamundo on chat.freenode.net, if you do that IRC thing. If you’re trying to get through this tutorial and get stuff, someone (probably me!) will be happy to try to help you.

5 Comments for 'Python for Linguistics 1: How to Count Letters with Python'

  1. Comment received 2 years, 1 month ago from Tom

    I tried to run this python script on a dump file that is 1.3 gigs, which I realize is pretty huge. My question is, when I run the script I get a memory error. Is there a way around this that you know of?

    [******@*******]$ python forumcount.py
    Traceback (most recent call last):
    File “forumcount.py”, line 4, in ?
    forumletters = list(forumtext_lower)
    MemoryError

  2. Comment received 2 years, 1 month ago from Patrick Hall

    Yeah, that’s not terribly surprising with a file that large, though I’ve not tried it myself. (I have some several-gig Wikipedia dumps I should try it on, just to watch my CPU smoke.)

    The code in that little tutorial is designed to be as simple to read as possible, and to show that even a brain-numblingly simple approach can be used on a fairly significant corpus, such as Moby Dick.

    But from a practical point of view, it doesn’t really make much sense to load such a huge file into RAM. That’s what happens here:


    mobytext = open('moby10b.txt', 'U').read() # Python just read the whole book.

    I’m not really sure what the relationship between the amount of RAM your machine has and how much “space” there is in Python’s memory. Like I say, I’m no C hacker, so I’m not terribly clued in on things like memory management (and frankly, I don’t care too much).

    But there’s another question: when does more data become useless?

    In other words, since in this particular task we were simply counting the number of occurrences of each letter, and since we know that those ratios are pretty consistent within a language, it’s practically certain that the ratios will level off before you get to the end of your gigabyte file.

    An interesting experiment would be to come up with some way to measure the rate of change of the letter frequencies somehow as you read more data (reading the input in chunks, rather than one fell swoop).

    So for instance, in the first few hundred letters the proportions of frequencies will be all over the place, but as the data comes in the ranking of letters will tend toward ETAOIN SHRDLU, and they’ll stay there.

    I’m sure that there’s some formula which can be used to determine this involving lots of sigmas and whatnot, but it might be fun to try it even without looking those up…

  3. Comment received 1 year, 9 months ago from Elwood

    Good tutorial. Thanks for taking the time to make it available. LA

  4. Comment received 1 year, 9 months ago from Elwood

    and for the record, I am pretty new at this programming stuff. I’ve worked through four chapters of Guido’s tutorial and had a friend/colleague talk me though some stuff, but this is the first time I’ve done something practical with Python, that is, something related to what I intend to use it for.

  5. Comment received 4 months, 1 week ago from Mary

    I loved this tutorial, it was very helpful. I did have one question though. Is there anyway to add two dictionarys? Lets say I count moby dick’s letters and I count Charles Dickens. Can I add the number of ‘a’ and the number of ‘b’ and so forth?
    Thanks!

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <img src="" alt=""> <strike> <strong> . Don't forget to close them after use.