Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

A list for you think about

Written by Patrick Hall, 1 year, 6 months ago.
Tags: , .

This is random:

I was thinking about stemming, and I got to thinking something like this:

Because affixes (prefixes and suffixes, in the case of English) have a grammatical function, and because English grammar happens to insist on marking several grammatical categories, it stands to reason that those affixes are probably more common, as a rule, than the average lexical “root.”

So here’s what I did:

I took my poor, battered (digital) copy of Moby Dick, which I have subjected to all manner of computational horrors, and read it into memory (yeah, RAM. burn, CPU, burn). Then, I chopped it up in to a set of ngrams of lengths one to five. Count, sort, take the most common, and here’s what I get:


66 es
66 on␣
67 as
68 w
69 ic
69 ␣b
72 is
72 ␣in
74 ig
74 le
75 ha
75 te
78 ve
80 ␣p
83 ed
83 nt
84 ␣to␣
85 r␣
85 to␣
87 ␣to
89 ␣f
89 ␣s
90 tio
90 tion
90 to
91 it
91 ␣h
92 of␣
92 ␣of␣
93 at
94 ␣r
96 ,
96 ,␣
96 or
96 o␣
96 ␣of
97 of
98 f␣
99 ri
99 ␣i
101 v
103 y␣
106 and␣
106 en
106 ion
106 ␣and
106 ␣and␣
110 b
110 nd␣
111 and
111 t␣
112 io
114 l␣
119 the␣
119 ␣the␣
125 al
125 he␣
129 ␣an
129 ␣the
130 re
137 in
142 nd
147 er
147 the
148 ␣th
150 n␣
154 p
162 ti
167 y
169 g
169 ␣o
171 d␣
176 an
177 he
188 m
189 s␣
192 on
193 u
194 th
228 f
235 ␣a
251 ␣t
296 c
321 d
335 e␣
402 l
454 h
467 s
617 r
677 a
710 n
712 i
717 o
805 t
1059 e
1196 ␣␣␣␣␣
1287 ␣␣␣␣
1379 ␣␣␣
1471 ␣␣
3243 ␣

Random observations:

  • Obviously considering ngrams of length one isn’t terribly easy to decipher — «s» actually is a suffix, but «e» isn’t. (Is it?)
  • ETAOIN SHRDLU is in there, but not contiguously. «u» wanders off into obscurity behind some bigrams.
  • Spaces complicate everything.
  • Several words show up before affixes: «on», «he», «an». But «an», at least, has the additional wrinkle of being not just a word on its own, but a substring of «and», which is also common.

Nonetheless, it seems beyond doubt that one can find a bunch of affixes this way, possibly most of them.

Anyway. Please let me know if I’m insane.

And here’s another similar list for Portuguese:


3558 it
3568 ro
3584 id
3597 aç
3610 di
3644 ção␣
3670 ei
3730 io
3753 qu
3786 ca
3809 q
4001 ␣n
4121 an
4250 as␣
4267 ção
4268 çã
4304 f
4372 ia
4379 .␣
4418 ti
4458 me
4535 r␣
4569 m␣
4570 pr
4574 em
4636 tr
4696 on
4721 se
4759 ␣co
4834 is
4913 ent
4940 in
4969 do␣
5098 ar
5262 st
5360 ta
5368 ci
5405 al
5529 o␣d
5563 ão␣
5626 ri
5994 b
6081 ␣o
6125 or
6125 to
6136 os␣
6184 ␣s
6377 ç
6406 g
6466 ão
6473 ã
6574 co
6819 as
6876 ad
6933 te
7094 ␣c
7168 er
7526 da
7552 v
7617 ␣de␣
8108 os
8383 en
8571 ␣p
8693 nt
8742 re
8871 do
9100 ,␣
9103 ␣e
9234 ,
9310 de␣
9408 ra
9518 es
9862 ␣de
10237 ␣a
13947 de
15731 s␣
15844 p
16192 l
16667 a␣
17216 ␣d
17954 u
18973 e␣
20100 m
21288 c
24172 o␣
24634 .....
24842 ....
25058 ...
25276 ..
29640 n
31451 t
31941 .
35826 d
42138 r
43966 s
45972 i
60928 o
64577 a
69885 e
113666 ␣

(Sure enough, there’s «ção».)

P.s.: Oh look, I’m not insane. Someone named Harald Hammarström has a paper which seems to be quite similar to this idea: Poor Man’s Stemming: Unsupervised Recognition of Same-Stem Words (.pdf preprint). Reading the abstract while it’s printing out, it’s clear that Mr. Hammarström has thought this out a lot more than I have.

No Comments for 'A list for you think about'

No comments yet.

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <img src="" alt=""> <strike> <strong> . Don't forget to close them after use.