Goal
For our first amazing trick, we'll learn to count letters with Python.
Input
A text file. I used the Project Gutenberg version of Moby Dick, you should go there and download the file moby10b.txt into your working directory, so you can compare your results to mine.
Output
Something like this:
a 78554
c 22832
b 17059
...
y 17131
x 1135
z 640
Note that we're just counting number of occurrences of each letter -- the frequency is not the same thing -- a distinction worth thinking about. Also, we're using a ridiculously narrow definition of "alphabet": the 26 lowercase letters from a to z, although it pains me to do so. Future iterations will imrpove on both those aspects.
Counting letters may seem like a mundane thing to want to do, but there's a surprising amount of utility that can be wrung out of even this simple maneuver: after a few more installments we'll try building a simple language identification tool. I'm not sure it will work.
That makes it more interesting, though, right?
Python has a small set of "data structures". A data structure is just a way to capture the relationships between different tidbits of information. We'll be using the Python data structures called lists and dictionaries.
Here's the game plan:
Python's idea of a list is pretty much what the common sense one is: a sequence of things, strung along in a specific order. We're going to build a single list with all the letters in our Moby Dick file. If you open up the file in your browser or an editor you'll see that it begins like this:
**The Project Gutenberg Etext of Moby Dick...
And which ends, several bazillion words later, with:
...text of Moby Dick, by Herman Melville
We'll create the list of letters by opening the file you downloading, and telling Python to read it. (Unfortunately, it can't write that paper for you. Sorry.)
So, our ultimate list of letters in the file will look be if we had typed in the whole file letter-by-letter into a list:
mobyletters = ['*', '*', 't', 'h', 'e', ' ', 'p', 'r', 'o', 'j', 'e', 'c', 't', ' ',
(maaany more letters here...) 'm', 'e', 'l', 'v', 'i', 'l', 'l', 'e', '\r', '\n', '\r', '\n']
Before we start slinging around a data structure with that much info, let's try creating one "by hand" at the Python prompt:
>>> mylist = ['a', 'b', 'c']
Simple enough -- a list of three letters.
Python can also tell us the length of the list (when we finally build the list mobyletters, it will be far longer of course):
>>> len(mylist) # len() tells us the length of the list.
3
Now, here's the key feature of lists:
Just as two sentences with the same words in different orders are not the same sentence, two Python lists with the same elements in a different order are not the same list:
>>> ['man', 'bites', 'dog'] == ['man', 'bites', 'dog']
True
>>> ['man', 'bites', 'dog'] == ['dog', 'bites', 'man']
False
(That == is a special equality sign, it means "Are these things exactly the same?")
Python also knows how to alphabetize:
>>> mylist = ['c', 'a', 'b']
>>> print mylist
['c', 'a', 'b']
>>> mylist.sort() # mylist, sort thyself!
>>> print mylist
['a', 'b', 'c']
Why am I going on about this "lists have an order" idea? Because it will help you to understand this:
>>> mylist = ['c', 'a', 'b']
>>> for listelement in mylist:
... print listelement
c
a
b
You can probably deduce what's going on there. We're telling Python to go through each element of mylist and print it out. This is called "iteration," and because lists have order, when you iterate through a list's elements, you can get them out in the order they were put in.
This is a very useful thing to do, of course. Any time you face a problem where there is a sequence in a specific order, you probably want to represent the sequence as a list.
Enough theorizing -- now that we have an idea of what a list is and how to iterate through it, let's dive right in and start burning up some memory: we'll build a really big list: mobyletters, which will contain all the letters in Moby Dick.
To do so, we need to open up moby10b.txt and read its contents into one gigantic string. Here's how you read that file and stuff its contents into a variable:
>>> mobytext = open('moby10b.txt', 'U').read() # Python just read the whole book.
You don't have to worry too much about precisely what's going on there, but what it does is open up the file moby10b.txt and read its contents into the variable mobytext.
If you were to do:
>>> print mobytext
at this point, you'll be sitting around waiting for the entire novel to finish flying across the screen. So don't do that. Unless... you want to.
Let's merge the distinction between upper case and lower case letters too, we'll just lowercase by using .lower().
>>> mobytext_lower = mobytext.lower()
And now we'll convert it into a list of letters:
>>> mobyletters = list(mobytext_lower)
Now, there's actually more characters than just a-z in that file, some of which are weird things like * and \r and \n and who knows what else. Ignoring for the moment the fact that we only really care about a-z, how many characters do we have?
>>> len(mobyletters)
1232923
One million, two-hundred and fifty-six thousand, one-hundred and sixty-seven characters.
Good job, Herman.
And now you have a touch of list-fu.
But we want to count individual letters, not a big string of characters.
To hold that information, we need another data structure. This one is a bit more complex than a list. Before looking at the code, let's look at a picture to get an idea of the sort of data we're going to be representing:
Notice that we're dealing with pairs. In Python such relationships are represented with a "dictionary." Here's a crude way to create a new Python dictionary:
>>> wordlength = {'green': 5, 'ideas': 5, 'colorless': 9}
Also, notice that the relationship this dictionary is recording is actually the length of the words, not occurrence of letters. Dictionaries can contain any relationship at, but it's up to you to figure out how to get the data in there.
As for the syntax, think of the colon as meaning "is related to" -- 'green' is related to 5, 'ideas' is related to 5, and 'colorless' is related to 9. In Pythonese, whatever's on the left is called the key, and whatever's on the right is called the value.
This is where the term "dictionary" comes from -- the key and the value correspond to the headword and the definition or translation in a real dictionary. But a Python dictionary is more like an online dictionary than a paper one -- the order in a paper dictionary serves only to help you look up the word... it's not part of the "content." To put it another way, if you somehow took all the entries in an English to French dictionary and shuffled them around, the information would still be there, right? It would just be hard for you to find. The order in a dictionary isn't an essential part of the relationships that the dictionary records.
That's why I drew the pairs scattered about like that. It's worth emphasizing, because this trips people up:
It's not that they're not in order -- they don't have order. If you tell a Python dictionary to sort itself, it says "nonsense!" Well, actually it says:
>>> wordlength.sort()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'dict' object has no attribute 'sort'
Which means "nonsense, foolish human!"
Dictionaries are for representing relationships between pairs of things.
It's counter intuitive at first to say you can't sort a dictionary, but once you understand just what the information the dictionary holds, it makes sense. If you put a bunch of a bunch of little tiles like the ones above in a bag, and then asked what "order" they're in, the answer would be, "Dude, they're not in order, they're just... in the same bag."
Now I'll give you a peek at our final goal. Let's look at how we can represent the number of times that the letters a, b, c, d, and e occur in Moby Dick. (Just take my word on the values for the time being--you'll believe me soon enough ☺ ). We'll call the dictionary mobycount:
And here's the crude way of telling Python all that (it's crude because you have to type all that data!):
>>> mobycount = {'a': 75843, 'c': 21641, 'b': 15602, 'e': 116960, 'd': 37754}
Again, note that the order of the key : value pairs doesn't matter. If I had shuffled the pairs in the dictionary when I defined it:
>>> mobycount = { 'c': 21641, 'd': 37754, 'b': 15602, 'e': 116960, 'a': 75843 }
Then mobycount would have ended up being the exact same thing, as far as Python was concerned. No difference whatsoever.
>>> mobycount = { 'c': 21641, 'd': 37754, 'b': 15602, 'e': 116960, 'a': 75843 }
>>> mobycount2 = { 'a': 75843, 'c': 21641, 'd': 37754, 'b': 15602, 'e': 116960, }
>>> mobycount == mobycount2 # they're the same data.
True
Why, it's a dictionary. You look things up.
How many letters are in wordlength?
>>> wordlength['colorless']
9
Or:
How many times does 'a' occur in Moby Dick?
>>> mobycount['a']
75843
We saw in the first step that you can fill up a list with over a million letters with just a few commands. Just so, you can automatically fill up a dictionary with pairs of things, but it takes a bit more effort. A common way to do so is to go through a list, and then for each thing in the list, do something to it, and then add the thing and the something to the dictionary...
That was clear, huh? Here, look:
>>> chomsky = ['colorless', 'green', 'ideas', 'sleep', 'furiously']
>>> wordlength = {}
>>> for word in chomsky:
... wordlength[word] = len(word)
>>> print wordlength
{'sleep': 5, 'furiously': 9, 'green': 5, 'ideas': 5, 'colorless': 9}
In this case, the "thing" in each case is each element in the list chomsky, and the "something" that you produce inside the for loop is the word's length.
When you're inside the for loop:
... wordlength[word] = len(word)
You're just setting the value of the dictionary for the current key (which we've chose to call word), to the length of that word. Here's a picture of the process. Imagine that you paused the for loop above right after the third word ('ideas') was added to the dictionary:
Yes, the information we've stored is very simple, almost trivial. But figuring out how to encode relationships is really what programming is all about -- most programming consists of figuring out how to represent relationships, and then to manipulate those representations to extract new information. In Python, there are only a few simple ways to represent things, and you've already seen two of them.
Now we're approaching our original goal: to record the number of occurences of each letter in a text. The only essential difference between our task and the previous example is that this time, inside the for loop, we're counting occurrences instead of counting word lengths.
Here's how you count the occurrences of a particular element in a list:
>>> mylist = ['a', 'b', 'c', 'a']
>>> mylist.count('a')
2
>>> mylist.count('x')
0
Congratulations, you've earned white belts in dictionary-fu and list-fu. We're ready for the grand finale. It will require:
Here's a complete Python program that does just that, see if you can follow it:
# Reading and lowercasing the contents of the file.
MobyDick = open('moby10b.txt').read()
mobytext_lower = MobyDick.lower() # make the whole novel lowercase.
# Creating a list of all the letters in the text, and another for the letters in the alphabet.
mobyletters = list(mobytext_lower)
alphabet = list('abcdefghijklmnopqrstuvwxyz')
# Creating a dictionary to hold the counts.
occurrences = {}
# Going through the alphabet and counting each letter's occurences, and storing this in the dictionary.
for letter in alphabet:
occurrences[letter] = mobyletters.count(letter)
# Going through the dictionary and printing out the results.
for letter in occurrences:
print letter, occurrences[letter]
Here are my results -- do they match yours?
a 78554 c 22832 b 17059 e 118323 d 38558 g 21025 f 21057 i 66141 h 63232 k 8104 j 1128 m 23535 l 43174 o 70188 n 66227 q 1567 p 17499 s 64772 r 52832 u 27013 t 89003 w 22337 v 8691 y 17131 x 1135 z 640
Please direct your lavish praise and/or hate mail here: Hacklog: Blogamundo » Blog Archive » Python for Linguistics 1: How to Count Letters with Python
