Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Dictionaries for Minimalists

Written by Patrick Hall, 9 months, 2 weeks ago.
Tags: .

And now a dictionary format I like to parse.

Because lazy can be good.

I like dictionary files like this:

house
casa

book
livro

Entries are separated by \n\n, and terms are separated by \n.

I think it’s better than this, which for some reason is quite popular in lexicons on the web:

house - casa
book - livro

Or this:

house casa
book livro

(Those are supposed to be tabs, but may well not be by the time they got to you… which is why I don’t like tabs.)

Arguably, you might want to add language tag labels to my favorite format, because it’s a bit safer:

EN house
PT casa

EN book
PT livro

…because sometimes people would accidentally put “casa” preceeding “house,” instead of after it, which is the convention this file is apparently following.

But, you run that risk in any format. You run that risk in XML, as a matter of fact. So you have to deal with it anyway. Which brings me back to my favorite format, which is easy to type, easy to parse, easy to store. It’s easy.

house
casa

book
livro

2 Comments for 'Dictionaries for Minimalists'

  1. Comment received 9 months, 2 weeks ago from MBM

    I’m sure the format is very easy to parse and everything, and may be appropriate for some applications, but it totally fails to capture the complicated nature of equivalence between two languages. I prefer my dictionaries complex and difficult to parse ;-)

  2. Comment received 9 months, 2 weeks ago from Patrick Hall

    Hi MBM,

    Well, that’s the “for minimalists” bit! :P

    Actually I remember reading somewhere that one of the first applications occasioned the writing of the spec for XML (as a subset/wayward love child of SGML) was Tim Bray’s work on the Oxford English Dictionary. Surely the OED qualifies as being on the complex end of the dictionary spectrum, and XML can handle it.

    Even so, I find myself working with word lists like this all the time, and I’ve written about a bazillion little parsers to deal with them. It says nothing whatever about the nature of the relationships — for instance, is it ok to have:

    house
    casa

    house
    casinha

    domicile
    casa

    Or something (those are dumb examples), where you have the same link going in various directions, and even “lexical loops.” Of course, trying to formalize rules like that is a whole separate ball o wax.

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <img src="" alt=""> <strike> <strong> . Don't forget to close them after use.