Types, Tokens, Typology
Here’s a chatlog… it’s kind of a mess but… oh well. Maybe it’s fun to read someone else’s chat logs?
Pat: hey want to hear an interesting little thing i came across?
Pat: i haven’t written it up yet
Jonas: sure
Pat: maybe i will make a hacklog post of it
Pat: ok so
Pat: have you ever heard of “isolating” versus “agglutinative” or “synthetic” languages?
Pat: basically the distinction has to do with suffixes and prefixes (collectively called “affixes”)
Pat: a language like portuguese has more affixes than english, and english has more than vietnamese
Pat: there are some more details in the distinction between “agglutinative” and “synthetic” actually
Pat: but the basic idea is, in isolating languages, words don’t have “sub parts”
Jonas: sounds complicated, but i’m listening heh
Pat: they’re just pure
Jonas: hmm
Pat: it’s not, really
Pat: like
Jonas: i see
Pat: in chinese and vietnamese, order
is everything
Pat: actually interestingly enough, in brazilian portuguese, there’s a tendency towards becoming an isolating compared to continental portuguese
Pat: you know the joke about how a brazilian will do anything to do avoid conjugating a verb?
Pat: heh
Pat: like for instance
Pat: brazilians prefer to use the 3rd person verb form whenever possible (at least in conversation)
Pat: so you get things like “a gente vai …”
Pat: instead of “nós vamos…”
Pat: you see what i mean?
Pat: like, it’s a tendency toward a reduction in the complexity of word forms
Pat: reducing the overall number of word forms
Pat: and languages sort of can be placed along a line from purely isolating to heavily affixing
Pat: at one end you have chinese, vietnamese, at the other, you have finnish, russian, latin…
Jonas: latin is filled with prefixes and suffixes, right?
Pat: yup
Pat: now, here’s the other piece of the puzzle
Pat: there’s a metric called the “type/token ratio”
Pat: which is sometimes used to measure how “difficult” a text is
Pat: for instance, if you are a teacher
Pat: you want to find texts of the appropriate complexity for your students
Pat: and so this number, the “type/token ratio” is used as a sort of crude measure of complexity
Pat: here’s how it works:
Pat: a “type” is a word in the sense we normally think of it
Pat: a “token” is an instance of a word
Pat: so take this short text:
Pat: “a wop bop a loo bop! tutti
frutti!”
Jonas: 2 tokens of a type
Pat: yup.
Pat: now, in that highly literary and poetic text (thank you, http://en.wikipedia.org/wiki/Little%20Richard)
Pat: the type ‘a’ appears twice
Pat: that is to say
Pat: there are 2 tokens ‘a’
Pat: and ‘bop’ is a type, of which there are 2 tokens
Pat: pretty simple right
Pat: ‘frutti’ can be considered as a type, or as a token, of count 1 in both cases
Pat: make sense?
Jonas: yeah
Pat: ok
Pat: so
Pat: the “type/token ratio” is just that
Pat: you count all the types, and count all the tokens, and divide the former by the latter
Pat: in a text which is “difficult”
Pat: you will have a low type / token
ratio
Pat: because there are lots of varied words
Pat: like that one guy on that mailing list that you were talking about…
Pat: heh
Pat: who uses too much vocab…
Pat: heh
Pat: er wait
Pat: did i get that backwards
Pat: yeah i did
Pat: he would have a high type token ratio
Pat: of all the words in a text he writes, a lot are unique
Jonas: which makes the text more difficult
Pat: yeah, at least approximately
Pat: because you have to “know” more word forms right
Pat: here’s a really simple expression of the difference in Python:
Pat:
text = u\"a wop bop a loo bop! tutti frutti!\"
tokens = text.split()
types = set(text.split())
Pat: oh, the punctuation is messing it up
Pat:
text = u\"a wop bop a loo bop tutti frutti\"
tokens = text.split()
types = set(text.split())
tokens
[u'a', u'wop', u'bop', u'a', u'loo', u'bop', u'tutti', u'frutti']
types
set([u'a', u'tutti', u'bop', u'wop', u'loo', u'frutti'])
Pat: like that.
Jonas: re.split(’+\w’, re.U)
Pat: ah, even better. very leet :-D
Jonas: actually, let me try it
Pat: so, the type/token ratio of that text is:
Pat:
float(len(types)) / float(len(tokens))
0.75
Pat: (you have to convert those ints to floats of course, at least in python (not sure about ruby…))
Jonas: yeah
Pat: the punctuation really only mattered in this case because it was such a short text
Pat: if you have a text of sufficient length you’ll be able to get a pretty accurate type/token ratio even if you ignore punctuation
Pat: now, here’s my observation
Pat: you can actually use this metric
Pat: not to compare “difficulty” of
texts
Pat: but rather, to compare languages
Jonas: hmmmm
Pat: along the not-affixy to highly-affixy, or chinese-like to latin-like axis
Pat: because: if a language has a lot of suffixes and affixes and “parts” in its word forms (by the way, linguists call all this those parts “morphology”)
Pat: it means that words have variant forms
Pat: so, in portugal “vamos” might show up more than in brazil, right?
Pat: well, let’s stick to portuguese vs latin, it’s easier to keep things straight
Pat: in latin, nouns had lots of endings
Pat: not just plural and
masculine/feminine
Pat: so, any particular noun could show up in one of maybe 10 cases (i don’t actually know how many cases latin had…)
Pat: but in portuguese (and spanish… french… romanian… etc)
Pat: a lot of those cases merged
Pat: so, the type/token ratio of latin vs those languages “went down” over time
Pat: in theory
Pat: and i tested this theory
Pat: by going through the whole udhr corpus, and measuring this ratio
Pat: and it was really interesting to see
Pat: the pattern really seems to hold up:
Pat: vietnamese is way up at the top
Pat: and finnish, latin, etc, are at the other end
Pat: the most rewarding part is finding languages you’ve never heard of
Pat: but seeing where they fit
Pat: for instance, it turns out that abkhaz is toward the finnish end
Pat: and an african language called baoulé is at the chinese end
Pat: isn’t that cool?
Jonas: could this kind of information be useful to bam in any way?
Pat: well… it might play a role in stemming
Pat: because, we can tell from this info that it’s not really that useful to try to stem baoulé
Pat: i mean, i don’t know how many speakers that language has, so I don’t know if it will come up
Pat: but i do happen to know that there are a significant number of abkhaz speakers, so who knows
Pat: it might make sense to work on an abkhaz stemmer
Pat: but i think even from a simple linguistic point of view
Pat: it’s a really instructive example about what unicode allows
Pat: because you can run a really simple metric on a wide variety of languages
Pat: and despite what people say, it’s really not possible to do that if you have to fight a bazillion encodings
Jonas: yeah
Pat: in linguistics, people always talk about chinese as “the” isolating language
Pat: and maybe vietnamese
Pat: so linguistics students have this impression that “asia is isolating”
Pat: which is really baloney
Pat: it’s not scientific, it’s not a big enough sample, you know?
Pat: here’s baoulé , by theway: http://udhrinunicode.org/d/udhr_bci.html
Pat: you can almost see how isolating it is, you know? a perfectly good example of an isolating language from africa
Pat: short words, pretty clearly very little morphology
Pat: here’s abkhaz: http://udhrinunicode.org/d/udhr_abk.html
Pat: it has lots of big words which probably have suffixes…
Pat: it’s written with cyrillic, but interestingly enough, the writing system doesn’t matter for this
metric
Pat: well… it has to have spaces. chinese and thai come out all wonky.
Jonas: heh, im gonna save this chat and read it all again later
Pat: :)
No comments yet.
Technorati tags: Blogamundo, Code, Language and the Web