New Wikipedia Parser i18n
From the Wikipedia mailing list:
[Wikipedia-l] New parser in the works - please help
On the outside chance that any reader of this blog hasn’t edited Wikipedia, what they’re talking about here is the parser that interprets the markup that’s used to write Wikipedia articles. (You can see some by clicking “edit” on any Wikipedia article.) This stuff is called “Wikitext.”
An interesting detail of the project from our point of view is that attention is being made to internationalizing the parser:
Some of what some people would think of as a “stupid parser trick” is
in fact important - e.g. L”’uomo'’ which renders as L<i>uomo</i>
(necessary for French and Italian).So: we need to know what MediaWiki quirks are supporting important
constructs in languages other than English (which is the language the
list is in, and is the native language of most of the participants),
and particularly in non-European languages.This list is unlikely to implement new features, e.g. (an example
brought up by GerardM) the double-apostrophe in Neapolitan. But we
really need to know about present important features that wouldn’t be
obvious to an English-speaker going through the present parser code.
It will be interesting to follow this stuff; I do a fair amount of trying to parse up Wikitext myself.
(PS. Wordpress has totally borkedup that Italian example in the quote above; just check it out on the original link.)
4 comments.
Technorati tags: Code, Language and the Web
The basic problem is that MediaWiki’s wikitext is not defined anywhere except as “whatever the parser code does.” So the effort is to actually define the stuff properly, because then it can be implemented independently. The internationalisation comes in because some bits of what the present parser does may be first thought to be irrelevant quirks, but turn out to be very important in some languages …
The list is at http://lists.wikimedia.org/mailman/listinfo/wikitext-l - so far some progress is being made in writing an ANTLR grammar (more likely to work than EBNF, which several people have tried and failed), and I put out the call for important quirks you quoted.
If we can free wikitext from the parser code, there’s all sorts of interesting things we could do - optimised implementation in C, third-party wikitext-to-whatever translators, WYSIWYG editors …
Hi David!
I have about a bazillion ideas related to such a thing + translation. (I have no experience with parser writing myself, but I have dug my way through editing a bunch of Wikipedias, so I will try to watch the conversation and perhaps lend a comment if possible.)
Thanks for the comment!
Hey Patrick(?), just commenting re: your comments on Language Hat. I’d love to collaborate on the Nenets-Nganasan dictionary update if such a thing happens. I’ve got some basic PHP skills, decent MySQL skills, and a good understanding of Django and how sites written in such frameworks need to work. This is why I think it would be awesome to get all of their data into a database; we could do really wikkiid things with it.
Also, addendum– I think our web/language/linguistics skills might be compatible for other collaborations that might come up, if you’re interested! I’ve always been feeling the urge to get more and more into computational linguistics and making language more of a transparent thing. Keep in touch!