Dictionaries for Minimalists
And now a dictionary format I like to parse.
Because lazy can be good.
I like dictionary files like this:
house
casabook
livro
Entries are separated by \n\n, and terms are separated by \n.
I think it’s better than this, which for some reason is quite popular in lexicons on the web:
house - casa
book - livro
Or this:
house casa
book livro
(Those are supposed to be tabs, but may well not be by the time they got to you… which is why I don’t like tabs.)
Arguably, you might want to add language tag labels to my favorite format, because it’s a bit safer:
EN house
PT casaEN book
PT livro
…because sometimes people would accidentally put “casa” preceeding “house,” instead of after it, which is the convention this file is apparently following.
But, you run that risk in any format. You run that risk in XML, as a matter of fact. So you have to deal with it anyway. Which brings me back to my favorite format, which is easy to type, easy to parse, easy to store. It’s easy.
house
casabook
livro
2 comments.
Technorati tags: Linguistic Computing
I’m sure the format is very easy to parse and everything, and may be appropriate for some applications, but it totally fails to capture the complicated nature of equivalence between two languages. I prefer my dictionaries complex and difficult to parse ;-)
Hi MBM,
Well, that’s the “for minimalists” bit! :P
Actually I remember reading somewhere that one of the first applications occasioned the writing of the spec for XML (as a subset/wayward love child of SGML) was Tim Bray’s work on the Oxford English Dictionary. Surely the OED qualifies as being on the complex end of the dictionary spectrum, and XML can handle it.
Even so, I find myself working with word lists like this all the time, and I’ve written about a bazillion little parsers to deal with them. It says nothing whatever about the nature of the relationships — for instance, is it ok to have:
Or something (those are dumb examples), where you have the same link going in various directions, and even “lexical loops.” Of course, trying to formalize rules like that is a whole separate ball o wax.