A list for you think about
This is random:
I was thinking about stemming, and I got to thinking something like this:
Because affixes (prefixes and suffixes, in the case of English) have a grammatical function, and because English grammar happens to insist on marking several grammatical categories, it stands to reason that those affixes are probably more common, as a rule, than the average lexical “root.”
So here’s what I did:
I took my poor, battered (digital) copy of Moby Dick, which I have subjected to all manner of computational horrors, and read it into memory (yeah, RAM. burn, CPU, burn). Then, I chopped it up in to a set of ngrams of lengths one to five. Count, sort, take the most common, and here’s what I get:
66 es
66 on␣
67 as
68 w
69 ic
69 ␣b
72 is
72 ␣in
74 ig
74 le
75 ha
75 te
78 ve
80 ␣p
83 ed
83 nt
84 ␣to␣
85 r␣
85 to␣
87 ␣to
89 ␣f
89 ␣s
90 tio
90 tion
90 to
91 it
91 ␣h
92 of␣
92 ␣of␣
93 at
94 ␣r
96 ,
96 ,␣
96 or
96 o␣
96 ␣of
97 of
98 f␣
99 ri
99 ␣i
101 v
103 y␣
106 and␣
106 en
106 ion
106 ␣and
106 ␣and␣
110 b
110 nd␣
111 and
111 t␣
112 io
114 l␣
119 the␣
119 ␣the␣
125 al
125 he␣
129 ␣an
129 ␣the
130 re
137 in
142 nd
147 er
147 the
148 ␣th
150 n␣
154 p
162 ti
167 y
169 g
169 ␣o
171 d␣
176 an
177 he
188 m
189 s␣
192 on
193 u
194 th
228 f
235 ␣a
251 ␣t
296 c
321 d
335 e␣
402 l
454 h
467 s
617 r
677 a
710 n
712 i
717 o
805 t
1059 e
1196 ␣␣␣␣␣
1287 ␣␣␣␣
1379 ␣␣␣
1471 ␣␣
3243 ␣
Random observations:
- Obviously considering ngrams of length one isn’t terribly easy to decipher — «s» actually is a suffix, but «e» isn’t. (Is it?)
- ETAOIN SHRDLU is in there, but not contiguously. «u» wanders off into obscurity behind some bigrams.
- Spaces complicate everything.
- Several words show up before affixes: «on», «he», «an». But «an», at least, has the additional wrinkle of being not just a word on its own, but a substring of «and», which is also common.
Nonetheless, it seems beyond doubt that one can find a bunch of affixes this way, possibly most of them.
Anyway. Please let me know if I’m insane.
And here’s another similar list for Portuguese:
3558 it
3568 ro
3584 id
3597 aç
3610 di
3644 ção␣
3670 ei
3730 io
3753 qu
3786 ca
3809 q
4001 ␣n
4121 an
4250 as␣
4267 ção
4268 çã
4304 f
4372 ia
4379 .␣
4418 ti
4458 me
4535 r␣
4569 m␣
4570 pr
4574 em
4636 tr
4696 on
4721 se
4759 ␣co
4834 is
4913 ent
4940 in
4969 do␣
5098 ar
5262 st
5360 ta
5368 ci
5405 al
5529 o␣d
5563 ão␣
5626 ri
5994 b
6081 ␣o
6125 or
6125 to
6136 os␣
6184 ␣s
6377 ç
6406 g
6466 ão
6473 ã
6574 co
6819 as
6876 ad
6933 te
7094 ␣c
7168 er
7526 da
7552 v
7617 ␣de␣
8108 os
8383 en
8571 ␣p
8693 nt
8742 re
8871 do
9100 ,␣
9103 ␣e
9234 ,
9310 de␣
9408 ra
9518 es
9862 ␣de
10237 ␣a
13947 de
15731 s␣
15844 p
16192 l
16667 a␣
17216 ␣d
17954 u
18973 e␣
20100 m
21288 c
24172 o␣
24634 .....
24842 ....
25058 ...
25276 ..
29640 n
31451 t
31941 .
35826 d
42138 r
43966 s
45972 i
60928 o
64577 a
69885 e
113666 ␣
(Sure enough, there’s «ção».)
P.s.: Oh look, I’m not insane. Someone named Harald Hammarström has a paper which seems to be quite similar to this idea: Poor Man’s Stemming: Unsupervised Recognition of Same-Stem Words (.pdf preprint). Reading the abstract while it’s printing out, it’s clear that Mr. Hammarström has thought this out a lot more than I have.
No comments yet.
Technorati tags: Code, Language and the Web