Yay!
Dr. Christopher Manning’s course on Natural Language Processing at Stanford* is now online with streaming video and transcripts.
Apparently whatever the copyright issues were has been resolved by editing out particular parts of particular lectures.
Well, it’s still a first for online NLP education as far as I know, and you can’t beat the price.
* Go Bears
So like, are there non-English dialects of lolcat out there, somewhere?
The world needs to know.
Kthx.
Dear lazyweb,
I would like a Javascript function to work like this:
magicalFunction('カ')
→ 'Katakana'
magicalFunction('a')
→ 'Latin'
magicalFunction('አ')
→ 'Ethiopic'
In other words, I want to be able to access the script property described in UAX #24: Script Names.
This actually exists already in Perl regular expressions, where you can just say \p{Katakana} in a regex to match Katakana characters.
Maybe such a thing could end up in the next version of Javascript… not that I have the slightest idea where to make that suggestion. But in the meantime, it seems to me that there should be a unicodescripts.js or some such.
Any ideas on what would be an efficient programming approach to implementing such a data structure, something that might be reasonably squeezed into a .js file?
Update: Longtime reader Edward O’Connor emails to suggest xregexp:
…you should check out the unicode plugin for xregexp:
http://blog.stevenlevithan.com/archives/xregexp-unicode-plugin
This does pretty much exactly what you want.
Lazyweb, the greatest programming platform in history!
In vaguely related news, rubyistas out there should check out Edwards’s talk from MerbCamp.
Yippee! Angry language arguments!
Scene the first:
Kid in South Africa decides to study Zulu instead of Afrikaans, because he says Afrikaans is dying:
The Times - White schoolboy chooses to learn Zulu over ‘dying language’
Scene the second:
Inevitable “Oh yeah???” response:
The Times - Afrikaans very much alive and kicking
I have no insights on this matter.
That is all.
The article here is interesting in its own right, but I just thought I’d point out an interesting note which caught my attention:
Editor’s Note: This interview was conducted in English and Korean, and the article was written in Korean. Mr. Wales’s comments have therefore been translated from English to Korean and back into English.
[Interview] Wikipedia founder critical of real-name Internet system
In the wake of my painfully hand-constructed survey of languages on twitter on election night , my interest is piqued: what would it take to come up with some REAL numbers, automatically?
The problem with running statistically language identification on Twitter is that twits are too short—140 characters.* My simple language ID tool needs a lot more than that.
But it seems to me that there is a good solution, like this:
- Get a bunch of twits (somehow )
- Eliminate everything that does show up as English
- Get more twits from the users that posted the non-English ones (using the API, again)
- Run language ID on a bunch of those users’ twits, assuming that they twitter in just one language (it’s quite common for people to twitter in many languages, but gotta start somewhere…)
- Count.
I’ll get around to that… any minute now.
*Incidentally, do multibyte languages get their maximum length cropped on Twitter?
A while back I posted about an interesting experiment on sending fonts over the web as Javascript. Here’s a look at a more robust solution to the same problem: web fonts.
Check out this list of Wikipedias. Do you see a lot of question marks or little meaningless boxes? The reason you get that junk is that you don’t happen to have the right fonts installed locally on your computer.
That sucks. It’s not fair. Why should English or French or Russian or Japanese speakers be treated as first-class citizens of the web, but Tibetan or Khmer or Inuktitut speakers be treated as weirdos?
When “web fonts” catch on, this regrettable state of affairs could start to fade away. Content providers will be able to host fonts on the server, right next to the content requiring those fonts, thus ensuring that readers will be able to see content in any language.
John Resig has an article that explains how all this works: An introduction to W3C Web Fonts. Happily, as John points out, it seems that browser support for web fonts is on the upswing.
I’m convinced that a lack of fonts is a major barrier to an increase in the amount of content in certain languages on the web. Web fonts would go a long way toward fulfilling the “world wide” promise of the World Wide Web.
I ran across this on the Wikipedia mailing list and thought I’d repost it for my fellow language nerds:
We’re working on an update to the Wikipedia logo, which can be used in 3-D, which will be correcting all the incorrect glyphs, and include
many other scripts that are not presently in the logo.
The project page is at http://meta.wikimedia.org/wiki/Wikipedia/Logo and we’re still looking for community members to discuss, to help sort
out characters, font styles and representations for the additional
alphabets as well as continue discussing the current glyphs on the
talk page at http://meta.wikimedia.org/wiki/Talk:Wikipedia/Logo.
Your input is greatly appreciated!
That post was from Cary Bass.
The project page has a nifty diagram of the work in progress (where you can see the borked up Katakana on the current logo, can’t believe I’d never noticed that before…):

As a famous pig once put it, “That’ll do.”