Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Types, Tokens, Typology

Written by Patrick Hall, 1 year, 7 months ago.
Tags: , , .

Here’s a chatlog… it’s kind of a mess but… oh well. Maybe it’s fun to read someone else’s chat logs?

Pat: hey want to hear an interesting little thing i came across?

Pat: i haven’t written it up yet

Jonas: sure

Pat: maybe i will make a hacklog post of it

Pat: ok so

Pat: have you ever heard of “isolating” versus “agglutinative” or “synthetic” languages?

Pat: basically the distinction has to do with suffixes and prefixes (collectively called “affixes”)

Pat: a language like portuguese has more affixes than english, and english has more than vietnamese

Pat: there are some more details in the distinction between “agglutinative” and “synthetic” actually

Pat: but the basic idea is, in isolating languages, words don’t have “sub parts”

Jonas: sounds complicated, but i’m listening heh

Pat: they’re just pure

Jonas: hmm

Pat: it’s not, really

Pat: like

Jonas: i see

Pat: in chinese and vietnamese, order
is everything

Pat: actually interestingly enough, in brazilian portuguese, there’s a tendency towards becoming an isolating compared to continental portuguese

Pat: you know the joke about how a brazilian will do anything to do avoid conjugating a verb?

Pat: heh

Pat: like for instance

Pat: brazilians prefer to use the 3rd person verb form whenever possible (at least in conversation)

Pat: so you get things like “a gente vai …”

Pat: instead of “nós vamos…”

Pat: you see what i mean?

Pat: like, it’s a tendency toward a reduction in the complexity of word forms

Pat: reducing the overall number of word forms

Pat: and languages sort of can be placed along a line from purely isolating to heavily affixing

Pat: at one end you have chinese, vietnamese, at the other, you have finnish, russian, latin…

Jonas: latin is filled with prefixes and suffixes, right?

Pat: yup

Pat: now, here’s the other piece of the puzzle

Pat: there’s a metric called the “type/token ratio”

Pat: which is sometimes used to measure how “difficult” a text is

Pat: for instance, if you are a teacher

Pat: you want to find texts of the appropriate complexity for your students

Pat: and so this number, the “type/token ratio” is used as a sort of crude measure of complexity

Pat: here’s how it works:

Pat: a “type” is a word in the sense we normally think of it

Pat: a “token” is an instance of a word

Pat: so take this short text:

Pat: “a wop bop a loo bop! tutti
frutti!”

Jonas: 2 tokens of a type

Pat: yup.

Pat: now, in that highly literary and poetic text (thank you, http://en.wikipedia.org/wiki/Little%20Richard)

Pat: the type ‘a’ appears twice

Pat: that is to say

Pat: there are 2 tokens ‘a’

Pat: and ‘bop’ is a type, of which there are 2 tokens

Pat: pretty simple right

Pat: ‘frutti’ can be considered as a type, or as a token, of count 1 in both cases

Pat: make sense?

Jonas: yeah

Pat: ok

Pat: so

Pat: the “type/token ratio” is just that

Pat: you count all the types, and count all the tokens, and divide the former by the latter

Pat: in a text which is “difficult”

Pat: you will have a low type / token
ratio

Pat: because there are lots of varied words

Pat: like that one guy on that mailing list that you were talking about…

Pat: heh

Pat: who uses too much vocab…

Pat: heh

Pat: er wait

Pat: did i get that backwards

Pat: yeah i did

Pat: he would have a high type token ratio

Pat: of all the words in a text he writes, a lot are unique

Jonas: which makes the text more difficult

Pat: yeah, at least approximately

Pat: because you have to “know” more word forms right

Pat: here’s a really simple expression of the difference in Python:

Pat:


text = u\"a wop bop a loo bop! tutti frutti!\"
tokens = text.split()
types = set(text.split())

Pat: oh, the punctuation is messing it up

Pat:


text = u\"a wop bop a loo bop tutti frutti\"
tokens = text.split()
types = set(text.split())
tokens
[u'a', u'wop', u'bop', u'a', u'loo', u'bop', u'tutti', u'frutti']
types
set([u'a', u'tutti', u'bop', u'wop', u'loo', u'frutti'])

Pat: like that.

Jonas: re.split(’+\w’, re.U)

Pat: ah, even better. very leet :-D

Jonas: actually, let me try it

Pat: so, the type/token ratio of that text is:

Pat:


float(len(types)) / float(len(tokens))
0.75

Pat: (you have to convert those ints to floats of course, at least in python (not sure about ruby…))

Jonas: yeah

Pat: the punctuation really only mattered in this case because it was such a short text

Pat: if you have a text of sufficient length you’ll be able to get a pretty accurate type/token ratio even if you ignore punctuation

Pat: now, here’s my observation

Pat: you can actually use this metric

Pat: not to compare “difficulty” of
texts

Pat: but rather, to compare languages

Jonas: hmmmm

Pat: along the not-affixy to highly-affixy, or chinese-like to latin-like axis

Pat: because: if a language has a lot of suffixes and affixes and “parts” in its word forms (by the way, linguists call all this those parts “morphology”)

Pat: it means that words have variant forms

Pat: so, in portugal “vamos” might show up more than in brazil, right?

Pat: well, let’s stick to portuguese vs latin, it’s easier to keep things straight

Pat: in latin, nouns had lots of endings

Pat: not just plural and
masculine/feminine

Pat: so, any particular noun could show up in one of maybe 10 cases (i don’t actually know how many cases latin had…)

Pat: but in portuguese (and spanish… french… romanian… etc)

Pat: a lot of those cases merged

Pat: so, the type/token ratio of latin vs those languages “went down” over time

Pat: in theory

Pat: and i tested this theory

Pat: by going through the whole udhr corpus, and measuring this ratio

Pat: and it was really interesting to see

Pat: the pattern really seems to hold up:

Pat: vietnamese is way up at the top

Pat: and finnish, latin, etc, are at the other end

Pat: the most rewarding part is finding languages you’ve never heard of

Pat: but seeing where they fit

Pat: for instance, it turns out that abkhaz is toward the finnish end

Pat: and an african language called baoulé is at the chinese end

Pat: isn’t that cool?

Jonas: could this kind of information be useful to bam in any way?

Pat: well… it might play a role in stemming

Pat: because, we can tell from this info that it’s not really that useful to try to stem baoulé

Pat: i mean, i don’t know how many speakers that language has, so I don’t know if it will come up

Pat: but i do happen to know that there are a significant number of abkhaz speakers, so who knows

Pat: it might make sense to work on an abkhaz stemmer

Pat: but i think even from a simple linguistic point of view

Pat: it’s a really instructive example about what unicode allows

Pat: because you can run a really simple metric on a wide variety of languages

Pat: and despite what people say, it’s really not possible to do that if you have to fight a bazillion encodings

Jonas: yeah

Pat: in linguistics, people always talk about chinese as “the” isolating language

Pat: and maybe vietnamese

Pat: so linguistics students have this impression that “asia is isolating”

Pat: which is really baloney

Pat: it’s not scientific, it’s not a big enough sample, you know?

Pat: here’s baoulé , by theway: http://udhrinunicode.org/d/udhr_bci.html

Pat: you can almost see how isolating it is, you know? a perfectly good example of an isolating language from africa

Pat: short words, pretty clearly very little morphology

Pat: here’s abkhaz: http://udhrinunicode.org/d/udhr_abk.html

Pat: it has lots of big words which probably have suffixes…

Pat: it’s written with cyrillic, but interestingly enough, the writing system doesn’t matter for this
metric

Pat: well… it has to have spaces. chinese and thai come out all wonky.

Jonas: heh, im gonna save this chat and read it all again later

Pat: :)

No Comments for 'Types, Tokens, Typology'

No comments yet.

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <img src="" alt=""> <strike> <strong> . Don't forget to close them after use.