Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

The Zero-width Space

Written by Patrick Hall, 1 year, 4 months ago.
Tags: , , .

Here’s something I need to look up, but I thought I’d blog it first, so you can share in my confusion (or alleviate it).

Several language don’t use spaces to separate words: Thai, Chinese, Japanese, Khmer, Lao… (I’ve blogged about this elsewhere before).

But I’m pretty sure it’s safe to say that every language has “words.” You have to be able to identify words if you want to create lexicons, be they monolingual or bilingual.

Now, Unicode has this code point called U+200B ZERO WIDTH SPACE. It seems to me (and this is what I need to look up) that one could use said character to represent the divisions between words without actually “damaging” the orthography of the languages that don’t officially divide sequences of characters into words.

What I don’t know:

  • Whether that’s what this character is for
  • Whether people actually already do that in any such language when typing (somehow)
  • Whether this character would be appropriate to insert automatically with software that identifies word boundaries automatically (spellcheckers, for instance)

I’m guessing that the answers are “Yes, No, Yes.”

(I just posted this before looking it up on the outside chance that some informed person might drop by and get an answer out there for the search engines… will update…)

Types, Tokens, Typology

Written by Patrick Hall, 1 year, 4 months ago.
Tags: , , .

Here’s a chatlog… it’s kind of a mess but… oh well. Maybe it’s fun to read someone else’s chat logs?

Pat: hey want to hear an interesting little thing i came across?

Pat: i haven’t written it up yet

Jonas: sure

Pat: maybe i will make a hacklog post of it

Pat: ok so

Pat: have you ever heard of “isolating” versus “agglutinative” or “synthetic” languages?

Pat: basically the distinction has to do with suffixes and prefixes (collectively called “affixes”)

Pat: a language like portuguese has more affixes than english, and english has more than vietnamese

Pat: there are some more details in the distinction between “agglutinative” and “synthetic” actually

Pat: but the basic idea is, in isolating languages, words don’t have “sub parts”

Jonas: sounds complicated, but i’m listening heh

Pat: they’re just pure

Jonas: hmm

Pat: it’s not, really

Pat: like

Jonas: i see

Pat: in chinese and vietnamese, order
is everything

Pat: actually interestingly enough, in brazilian portuguese, there’s a tendency towards becoming an isolating compared to continental portuguese

Pat: you know the joke about how a brazilian will do anything to do avoid conjugating a verb?

Pat: heh

Pat: like for instance

Pat: brazilians prefer to use the 3rd person verb form whenever possible (at least in conversation)

Pat: so you get things like “a gente vai …”

Pat: instead of “nós vamos…”

Pat: you see what i mean?

Pat: like, it’s a tendency toward a reduction in the complexity of word forms

Pat: reducing the overall number of word forms

Pat: and languages sort of can be placed along a line from purely isolating to heavily affixing

Pat: at one end you have chinese, vietnamese, at the other, you have finnish, russian, latin…

Jonas: latin is filled with prefixes and suffixes, right?

Pat: yup

Pat: now, here’s the other piece of the puzzle

Pat: there’s a metric called the “type/token ratio”

Pat: which is sometimes used to measure how “difficult” a text is

Pat: for instance, if you are a teacher

Pat: you want to find texts of the appropriate complexity for your students

Pat: and so this number, the “type/token ratio” is used as a sort of crude measure of complexity

Pat: here’s how it works:

Pat: a “type” is a word in the sense we normally think of it

Pat: a “token” is an instance of a word

Pat: so take this short text:

Pat: “a wop bop a loo bop! tutti
frutti!”

Jonas: 2 tokens of a type

Pat: yup.

Pat: now, in that highly literary and poetic text (thank you, http://en.wikipedia.org/wiki/Little%20Richard)

Pat: the type ‘a’ appears twice

Pat: that is to say

Pat: there are 2 tokens ‘a’

Pat: and ‘bop’ is a type, of which there are 2 tokens

Pat: pretty simple right

Pat: ‘frutti’ can be considered as a type, or as a token, of count 1 in both cases

Pat: make sense?

Jonas: yeah

Pat: ok

Pat: so

Pat: the “type/token ratio” is just that

Pat: you count all the types, and count all the tokens, and divide the former by the latter

Pat: in a text which is “difficult”

Pat: you will have a low type / token
ratio

Pat: because there are lots of varied words

Pat: like that one guy on that mailing list that you were talking about…

Pat: heh

Pat: who uses too much vocab…

Pat: heh

Pat: er wait

Pat: did i get that backwards

Pat: yeah i did

Pat: he would have a high type token ratio

Pat: of all the words in a text he writes, a lot are unique

Jonas: which makes the text more difficult

Pat: yeah, at least approximately

Pat: because you have to “know” more word forms right

Pat: here’s a really simple expression of the difference in Python:

Pat:


text = u\"a wop bop a loo bop! tutti frutti!\"
tokens = text.split()
types = set(text.split())

Pat: oh, the punctuation is messing it up

Pat:


text = u\"a wop bop a loo bop tutti frutti\"
tokens = text.split()
types = set(text.split())
tokens
[u'a', u'wop', u'bop', u'a', u'loo', u'bop', u'tutti', u'frutti']
types
set([u'a', u'tutti', u'bop', u'wop', u'loo', u'frutti'])

Pat: like that.

Jonas: re.split(’+\w’, re.U)

Pat: ah, even better. very leet :-D

Jonas: actually, let me try it

Pat: so, the type/token ratio of that text is:

Pat:


float(len(types)) / float(len(tokens))
0.75

Pat: (you have to convert those ints to floats of course, at least in python (not sure about ruby…))

Jonas: yeah

Pat: the punctuation really only mattered in this case because it was such a short text

Pat: if you have a text of sufficient length you’ll be able to get a pretty accurate type/token ratio even if you ignore punctuation

Pat: now, here’s my observation

Pat: you can actually use this metric

Pat: not to compare “difficulty” of
texts

Pat: but rather, to compare languages

Jonas: hmmmm

Pat: along the not-affixy to highly-affixy, or chinese-like to latin-like axis

Pat: because: if a language has a lot of suffixes and affixes and “parts” in its word forms (by the way, linguists call all this those parts “morphology”)

Pat: it means that words have variant forms

Pat: so, in portugal “vamos” might show up more than in brazil, right?

Pat: well, let’s stick to portuguese vs latin, it’s easier to keep things straight

Pat: in latin, nouns had lots of endings

Pat: not just plural and
masculine/feminine

Pat: so, any particular noun could show up in one of maybe 10 cases (i don’t actually know how many cases latin had…)

Pat: but in portuguese (and spanish… french… romanian… etc)

Pat: a lot of those cases merged

Pat: so, the type/token ratio of latin vs those languages “went down” over time

Pat: in theory

Pat: and i tested this theory

Pat: by going through the whole udhr corpus, and measuring this ratio

Pat: and it was really interesting to see

Pat: the pattern really seems to hold up:

Pat: vietnamese is way up at the top

Pat: and finnish, latin, etc, are at the other end

Pat: the most rewarding part is finding languages you’ve never heard of

Pat: but seeing where they fit

Pat: for instance, it turns out that abkhaz is toward the finnish end

Pat: and an african language called baoulé is at the chinese end

Pat: isn’t that cool?

Jonas: could this kind of information be useful to bam in any way?

Pat: well… it might play a role in stemming

Pat: because, we can tell from this info that it’s not really that useful to try to stem baoulé

Pat: i mean, i don’t know how many speakers that language has, so I don’t know if it will come up

Pat: but i do happen to know that there are a significant number of abkhaz speakers, so who knows

Pat: it might make sense to work on an abkhaz stemmer

Pat: but i think even from a simple linguistic point of view

Pat: it’s a really instructive example about what unicode allows

Pat: because you can run a really simple metric on a wide variety of languages

Pat: and despite what people say, it’s really not possible to do that if you have to fight a bazillion encodings

Jonas: yeah

Pat: in linguistics, people always talk about chinese as “the” isolating language

Pat: and maybe vietnamese

Pat: so linguistics students have this impression that “asia is isolating”

Pat: which is really baloney

Pat: it’s not scientific, it’s not a big enough sample, you know?

Pat: here’s baoulé , by theway: http://udhrinunicode.org/d/udhr_bci.html

Pat: you can almost see how isolating it is, you know? a perfectly good example of an isolating language from africa

Pat: short words, pretty clearly very little morphology

Pat: here’s abkhaz: http://udhrinunicode.org/d/udhr_abk.html

Pat: it has lots of big words which probably have suffixes…

Pat: it’s written with cyrillic, but interestingly enough, the writing system doesn’t matter for this
metric

Pat: well… it has to have spaces. chinese and thai come out all wonky.

Jonas: heh, im gonna save this chat and read it all again later

Pat: :)

Introduction to Information Retrieval book online

Written by Patrick Hall, 1 year, 5 months ago.
Tags: , , .

Via my homey Kenji’s del.icio.us links (valeu hein!)

Cool online book tip:

Fans of Chris Manning & Hinrich Schütze’s 1999 Foundations of Statistical Natural Language Processing (like me!) will be pleased to discover that what looks like a companion book is in the works:

Introduction to Information Retrieval

And even cooler, a draft version of the text is available as pdfs at the link.

The layout of the book looks quite the same as FSNLP (which I’ve always thought was nice and readable… what’s the font in the headings, anyway?).

To judge by the table of contents, this text will really fill a gap in the literature, I imagine it will end up being heavily used. Unlike other (nonetheless excellent) standard texts,* this looks to be tuned to IR in the Age of the Intarwebs:

  • Chapter 10: XML retrieval
  • Chapter 19: Web search basics
  • Chapter 20: Web crawling and indexes
  • Chapter 21: Link analysis

You can also get a good idea of the content by taking a gander at this course syllabus from last year. Nice slides, too.(Extreme compression sounds pretty hard core, eh?)

If the authors take the same approach as they did with their first book, the PDFs will eventually be taken down once the printed version is available, so lickety split. Scratch that, “the book will remain online after its publication by Cambridge UP in 2007.” Sweet!

* Go Bears. :P

Goblin

Written by Patrick Hall, 1 year, 5 months ago.
Tags: , .

For some reason I popped перевод, which is Russian for “translate” or “translation” or something (I’m honestly not sure) into YouTube’s search engine, and I came across something pretty suprising: a guy who’s apparently a well-known film dubber. Here’s an interview from Russian MTV (in Russian) where you can get an idea of the sort of work he does, even if you don’t know Russian: YouTube - Гоблин на питерском MTV.

My rusty Russian was enough to help me figure out that the guy’s name was Гоблин (Goblin), and, via the Russian Wikipedia, that his real name is Дмитрий Юрьевич Пучков (Dmitry Yuryevich Puchkov). And lo and behold, there’s a full-length bio in the English Wikipedia: Dmitry Puchkov.

Puchkov is a pretty unique as interpreters go. For one thing, he’s famous. Like, has-his-own-tshirts famous. And for another, as you can tell from the video linked above, his interpretation style is not traditional: he translates the parts of all the characters by himself, be they male, female, or orc.

New Public CLDR Mailing List

Written by Patrick Hall, 1 year, 5 months ago.
Tags: , , .

I’ve blogged before about the Common Locale Data Repository, or CLDR, which is hosted at Unicode.org/cldr/. It’s a project to collect locale data for as many languages as possible. (Here’s an example in Welsh, here’s one in French, here’s English, etc.)

I got an announcement today about a new CLDR mailing list, so I’m forwarding it along for all my things-linguistical-and-computational-loving wumungs friends:

The Unicode Consortium is starting a new public mailing list for internationalization issues related to locales, including in particular the Unicode Common Locale Data Repository (CLDR), the Locale Data Markup Language (LDML), and international identifiers (language, script, region, currency, and timezones).

Those interested in these topics can subscribe by visiting our revised mail list page:

http://www.unicode.org/consortium/distlist.html#cldr_list

Oh yeah, good times.

A list for you think about

Written by Patrick Hall, 1 year, 5 months ago.
Tags: , .

This is random:

I was thinking about stemming, and I got to thinking something like this:

Because affixes (prefixes and suffixes, in the case of English) have a grammatical function, and because English grammar happens to insist on marking several grammatical categories, it stands to reason that those affixes are probably more common, as a rule, than the average lexical “root.”

So here’s what I did:

I took my poor, battered (digital) copy of Moby Dick, which I have subjected to all manner of computational horrors, and read it into memory (yeah, RAM. burn, CPU, burn). Then, I chopped it up in to a set of ngrams of lengths one to five. Count, sort, take the most common, and here’s what I get:


66 es
66 on␣
67 as
68 w
69 ic
69 ␣b
72 is
72 ␣in
74 ig
74 le
75 ha
75 te
78 ve
80 ␣p
83 ed
83 nt
84 ␣to␣
85 r␣
85 to␣
87 ␣to
89 ␣f
89 ␣s
90 tio
90 tion
90 to
91 it
91 ␣h
92 of␣
92 ␣of␣
93 at
94 ␣r
96 ,
96 ,␣
96 or
96 o␣
96 ␣of
97 of
98 f␣
99 ri
99 ␣i
101 v
103 y␣
106 and␣
106 en
106 ion
106 ␣and
106 ␣and␣
110 b
110 nd␣
111 and
111 t␣
112 io
114 l␣
119 the␣
119 ␣the␣
125 al
125 he␣
129 ␣an
129 ␣the
130 re
137 in
142 nd
147 er
147 the
148 ␣th
150 n␣
154 p
162 ti
167 y
169 g
169 ␣o
171 d␣
176 an
177 he
188 m
189 s␣
192 on
193 u
194 th
228 f
235 ␣a
251 ␣t
296 c
321 d
335 e␣
402 l
454 h
467 s
617 r
677 a
710 n
712 i
717 o
805 t
1059 e
1196 ␣␣␣␣␣
1287 ␣␣␣␣
1379 ␣␣␣
1471 ␣␣
3243 ␣

Random observations:

  • Obviously considering ngrams of length one isn’t terribly easy to decipher — «s» actually is a suffix, but «e» isn’t. (Is it?)
  • ETAOIN SHRDLU is in there, but not contiguously. «u» wanders off into obscurity behind some bigrams.
  • Spaces complicate everything.
  • Several words show up before affixes: «on», «he», «an». But «an», at least, has the additional wrinkle of being not just a word on its own, but a substring of «and», which is also common.

Nonetheless, it seems beyond doubt that one can find a bunch of affixes this way, possibly most of them.

Anyway. Please let me know if I’m insane.

And here’s another similar list for Portuguese:


3558 it
3568 ro
3584 id
3597 aç
3610 di
3644 ção␣
3670 ei
3730 io
3753 qu
3786 ca
3809 q
4001 ␣n
4121 an
4250 as␣
4267 ção
4268 çã
4304 f
4372 ia
4379 .␣
4418 ti
4458 me
4535 r␣
4569 m␣
4570 pr
4574 em
4636 tr
4696 on
4721 se
4759 ␣co
4834 is
4913 ent
4940 in
4969 do␣
5098 ar
5262 st
5360 ta
5368 ci
5405 al
5529 o␣d
5563 ão␣
5626 ri
5994 b
6081 ␣o
6125 or
6125 to
6136 os␣
6184 ␣s
6377 ç
6406 g
6466 ão
6473 ã
6574 co
6819 as
6876 ad
6933 te
7094 ␣c
7168 er
7526 da
7552 v
7617 ␣de␣
8108 os
8383 en
8571 ␣p
8693 nt
8742 re
8871 do
9100 ,␣
9103 ␣e
9234 ,
9310 de␣
9408 ra
9518 es
9862 ␣de
10237 ␣a
13947 de
15731 s␣
15844 p
16192 l
16667 a␣
17216 ␣d
17954 u
18973 e␣
20100 m
21288 c
24172 o␣
24634 .....
24842 ....
25058 ...
25276 ..
29640 n
31451 t
31941 .
35826 d
42138 r
43966 s
45972 i
60928 o
64577 a
69885 e
113666 ␣

(Sure enough, there’s «ção».)

P.s.: Oh look, I’m not insane. Someone named Harald Hammarström has a paper which seems to be quite similar to this idea: Poor Man’s Stemming: Unsupervised Recognition of Same-Stem Words (.pdf preprint). Reading the abstract while it’s printing out, it’s clear that Mr. Hammarström has thought this out a lot more than I have.

Okay, sometimes it’s funny.

Written by Patrick Hall, 1 year, 5 months ago.
Tags: .

I try not to get on the “make fun of Machine Translation” bandwagon—for one thing, I think MT is cool. I also think it’s already quite useful for certain tasks. And I also think that in most cases the sort of media coverage that MT gets is either mostly speculation, or just too difficult to evaluate; after all, evaluating the quality of MT is an academic project unto itself.

And then there’s my pet peeve with regard to popular evaluation of MT: “round trip translation.” It works like this: First, put some English into an English » French MT system. Then, put that output back into a French » English system. “Hahah! It doesn’t look anything like the original!”

Except, why should it? We never evaluate how good a human translator is by asking them to do round-trip translation, so what’s the point of evaluating mechanical translation in that way?

But enough hedging, this is the part where I link to the really funny, bad MT quote. ☺

Great Google Homepage translation · Geekness - with fresh and clean air

Ascii-only Languages

Written by Patrick Hall, 1 year, 5 months ago.
Tags: .

This is a perverse topic given this blog’s positively fanatical obsession with Unicode, but it’s fun nonetheless:

What languages have orthographies that can be written entirely in ASCII?

After a bit of messy experimenting, I have discovered that the answer seems to be “quite a few,” at least dozens, probably.

Anyway, here’s my grubby .txt file if you’d like to take a look:

ASCII-only languages

I generated the list like this: I went through all the documents in the Universal Declaration of Human Rights, and grabbed the set of all the letters in the document, assuming that was at least a reasonable approximation of the alphabet of each language. Then I checked to see if it was a subset of ASCII. That is all.

(And I was disappointed to discover that Marshallese, although it does appear, no longer seems to use the ampersand sign as a vowel. Alas.)

I’d be interested in hearing about errors, being pelted with angry orthographical rants, etc, etc.