h
a
c
k
l
o
g

New Wikipedia Parser i18n

Written by Patrick Hall, November 17th, 2007

From the Wikipedia mailing list:

[Wikipedia-l] New parser in the works - please help

On the outside chance that any reader of this blog hasn’t edited Wikipedia, what they’re talking about here is the parser that interprets the markup that’s used to write Wikipedia articles. (You can see some by clicking “edit” on any Wikipedia article.) This stuff is called “Wikitext.”

An interesting detail of the project from our point of view is that attention is being made to internationalizing the parser:

Some of what some people would think of as a “stupid parser trick” is
in fact important - e.g. L”’uomo” which renders as L<i>uomo</i>
(necessary for French and Italian).

So: we need to know what MediaWiki quirks are supporting important
constructs in languages other than English (which is the language the
list is in, and is the native language of most of the participants),
and particularly in non-European languages.

This list is unlikely to implement new features, e.g. (an example
brought up by GerardM) the double-apostrophe in Neapolitan. But we
really need to know about present important features that wouldn’t be
obvious to an English-speaker going through the present parser code.

It will be interesting to follow this stuff; I do a fair amount of trying to parse up Wikitext myself.

(PS. Wordpress has totally borkedup that Italian example in the quote above; just check it out on the original link.)

What the Heck is a Language Model? 5 Minute Answer

Written by Patrick Hall, November 13th, 2007

Loonicode U+0009

Written by Patrick Hall, November 12th, 2007

A Unicode Question: Character Decompositions

Written by Patrick Hall, November 11th, 2007

And now a one-line post

Written by Patrick Hall, November 9th, 2007

A Revolution in Linguistic Mapping?

Written by Patrick Hall, November 2nd, 2007

Dear Jeff Bezos

Written by Patrick Hall, November 1st, 2007