Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Unicode Script property and Javascript

Written by Patrick Hall, November 21st, 2008

Dear lazyweb,

I would like a Javascript function to work like this:

magicalFunction('カ')
→ 'Katakana'
	
magicalFunction('a')
→ 'Latin'
	
magicalFunction('አ')
→ 'Ethiopic'

In other words, I want to be able to access the script property described in UAX #24: Script Names.

This actually exists already in Perl regular expressions, where you can just say \p{Katakana} in a regex to match Katakana characters.

Maybe such a thing could end up in the next version of Javascript… not that I have the slightest idea where to make that suggestion. But in the meantime, it seems to me that there should be a unicodescripts.js or some such.

Any ideas on what would be an efficient programming approach to implementing such a data structure, something that might be reasonably squeezed into a .js file?

Reports of Afrikaans’ Death Greatly Contested

Written by Patrick Hall, November 15th, 2008

Yippee! Angry language arguments!

Scene the first:

Kid in South Africa decides to study Zulu instead of Afrikaans, because he says Afrikaans is dying:

The Times - White schoolboy chooses to learn Zulu over ‘dying language’

Scene the second:

Inevitable “Oh yeah???” response:

The Times - Afrikaans very much alive and kicking

I have no insights on this matter.

That is all.

English → Korean → English: Jimmy Wales in Korea

Written by Patrick Hall, November 14th, 2008

The article here is interesting in its own right, but I just thought I’d point out an interesting note which caught my attention:

Editor’s Note: This interview was conducted in English and Korean, and the article was written in Korean. Mr. Wales’s comments have therefore been translated from English to Korean and back into English.

[Interview] Wikipedia founder critical of real-name Internet system

Musings on a Language Survey of Twitter

Written by Patrick Hall, November 12th, 2008

In the wake of my painfully hand-constructed survey of languages on twitter on election night , my interest is piqued: what would it take to come up with some REAL numbers, automatically?

The problem with running statistically language identification on Twitter is that twits are too short—140 characters.* My simple language ID tool needs a lot more than that.

But it seems to me that there is a good solution, like this:

  1. Get a bunch of twits (somehow )
  2. Eliminate everything that does show up as English
  3. Get more twits from the users that posted the non-English ones (using the API, again)
  4. Run language ID on a bunch of those users’ twits, assuming that they twitter in just one language (it’s quite common for people to twitter in many languages, but gotta start somewhere…)
  5. Count.

I’ll get around to that… any minute now.

*Incidentally, do multibyte languages get their maximum length cropped on Twitter?

Web fonts and language equality

Written by Patrick Hall, November 11th, 2008

A while back I posted about an interesting experiment on sending fonts over the web as Javascript. Here’s a look at a more robust solution to the same problem: web fonts.

Check out this list of Wikipedias. Do you see a lot of question marks or little meaningless boxes? The reason you get that junk is that you don’t happen to have the right fonts installed locally on your computer.

That sucks. It’s not fair. Why should English or French or Russian or Japanese speakers be treated as first-class citizens of the web, but Tibetan or Khmer or Inuktitut speakers be treated as weirdos?

When “web fonts” catch on, this regrettable state of affairs could start to fade away. Content providers will be able to host fonts on the server, right next to the content requiring those fonts, thus ensuring that readers will be able to see content in any language.

John Resig has an article that explains how all this works: An introduction to W3C Web Fonts. Happily, as John points out, it seems that browser support for web fonts is on the upswing.

I’m convinced that a lack of fonts is a major barrier to an increase in the amount of content in certain languages on the web. Web fonts would go a long way toward fulfilling the “world wide” promise of the World Wide Web.

Wikipedia multilingual logo update

Written by Patrick Hall, November 10th, 2008

I ran across this on the Wikipedia mailing list and thought I’d repost it for my fellow language nerds:

We’re working on an update to the Wikipedia logo, which can be used in 3-D, which will be correcting all the incorrect glyphs, and include
many other scripts that are not presently in the logo.

The project page is at http://meta.wikimedia.org/wiki/Wikipedia/Logo and we’re still looking for community members to discuss, to help sort
out characters, font styles and representations for the additional
alphabets as well as continue discussing the current glyphs on the
talk page at http://meta.wikimedia.org/wiki/Talk:Wikipedia/Logo.

Your input is greatly appreciated!

That post was from Cary Bass.

The project page has a nifty diagram of the work in progress (where you can see the borked up Katakana on the current logo, can’t believe I’d never noticed that before…):

Tentative Predictions

Written by Patrick Hall, November 4th, 2008

We’re going to go out on a limb here, and call it for Spanish:

   1 cs
   1 el
   1 nl
   1 no
   1 sv
   1 th
   1 zh
   3 ro
   6 it
  13 de
  15 fr
  23 pt
  26 es
http://twitter.com/drturnuseverin/status/990376485
 http://twitter.com/memoriavirtual/status/990399354
 http://twitter.com/ToobyTweet/status/990403106
 http://twitter.com/elnuevodia/status/990442096
 http://twitter.com/emtemporeal/status/990444566
 http://twitter.com/Octa/status/990417495
 http://twitter.com/marcovicini/status/990449383
 http://twitter.com/gigold/status/990457821
 http://twitter.com/david_martos/status/990456842
 http://twitter.com/vascocampilho/status/990472707
 http://twitter.com/rafaelrodrigues/status/990473585
 http://twitter.com/piticanella/status/990477228
 http://twitter.com/enricoescalona/status/990480981
 http://twitter.com/pickupjojo/status/990482411
 http://twitter.com/metalkrim/status/990484675
 http://twitter.com/Solrak/status/990486778
 http://twitter.com/panconqueso/status/990486766
 http://twitter.com/versac/status/990486748
 http://twitter.com/cleitonkamikaze/status/990489746
 http://twitter.com/panconqueso/status/990488965
 http://twitter.com/ToobyTweet/status/990493358
 http://twitter.com/sepulveda/status/990492554
 http://twitter.com/geografosubjeti/status/990490375
 http://twitter.com/zolliker/status/990494120
 http://twitter.com/lciusa2008/status/990496903
 http://twitter.com/denispedroso/status/990495493
 http://twitter.com/yuri_music/status/990505305
 http://twitter.com/viniciuskmax/status/990503799
 http://twitter.com/Tim_Booth/status/990502126
 http://twitter.com/DerWesten/status/990499340
 http://twitter.com/ghostdog19/status/990510542
 http://twitter.com/Curvaspoliticas/status/990508418
 http://twitter.com/rdc/status/990511316
 http://twitter.com/danzflor/status/990511263
 http://twitter.com/upmarine/status/990513273
 http://twitter.com/alaovest/status/990514628
 http://www.welt.de/politik/article2676551/Barack-Obama-siegt-und-siegt-und-siegt.html#reqRSS
 http://twitter.com/fraugrasdackel/status/990517518
 http://twitter.com/TheGhost/status/990516827
 http://twitter.com/teletekst/status/990519501
 http://twitter.com/Neto/status/990519503
 http://twitter.com/Cooperativa/status/990524619
 http://twitter.com/teban/status/990522421
 http://twitter.com/gazetadopovo/status/990525502
 http://twitter.com/dsobuitenland/status/990526198
 http://twitter.com/inixia/status/990528510
 http://twitter.com/bank_xavi/status/990535142
 http://twitter.com/lilaEule/status/990540587
 http://twitter.com/notivagos/status/990539959
 http://twitter.com/touffik91/status/990541813
 http://twitter.com/marieclairee/status/990541801
 http://twitter.com/PhilippG/status/990541769
 http://twitter.com/noticiasrtp/status/990541722
 http://twitter.com/Emergent007/status/990547657
 http://twitter.com/florida_mike/status/990548476
 http://twitter.com/florida_mike/status/990548476
 http://twitter.com/ToobyTweet/status/990549906
 http://twitter.com/magicasland/status/990553319
 http://twitter.com/Emergent007/status/990554709
 http://twitter.com/pedrooliver/status/990556803
 http://twitter.com/pickupjojo/status/990558617
 http://twitter.com/juancamon/status/990558117
 http://twitter.com/alexvalente/status/990562733
 http://twitter.com/biab/status/990560773
 http://twitter.com/Wikio_LaUne/status/990564799
 http://twitter.com/cotidianul/status/990564115
 http://twitter.com/Miuxapop/status/990564110
 http://twitter.com/Wikio_LaUne/status/990564799
 http://twitter.com/Borsenalle/status/990559355
 http://twitter.com/timdream/status/990568725
 http://twitter.com/Anomalo/status/990567489
 http://twitter.com/FoggyMind/status/990572601
 http://twitter.com/deredgar/status/990572597
 http://twitter.com/ghostdog19/status/990572574
 http://twitter.com/ladylazarus/status/990575996
 http://twitter.com/beeck/status/990580824
 http://twitter.com/geografosubjeti/status/990577280
 http://twitter.com/genevenews/status/990584686
 http://twitter.com/admiyn/status/990588709
 http://twitter.com/petevalle/status/990590484
 http://twitter.com/matyasgabor/status/990592124
 http://twitter.com/TSFRadio/status/990592956
 http://twitter.com/YannickGelinas/status/990594621
 http://twitter.com/geografosubjeti/status/990596911
 http://twitter.com/news_rss/status/990599330
 http://twitter.com/juanlusanchez/status/990598555
 http://twitter.com/_tillwe_/status/990597742
 http://twitter.com/cashblog/status/990597701
 http://twitter.com/coreanomac/status/990601582
 http://twitter.com/lanacioncom/status/990615456
 http://twitter.com/DiamondLion/status/990623868
 http://twitter.com/matesola/status/990603066
 http://twitter.com/cotidianul/status/990630327

Have a bunch of links

Written by Patrick Hall, November 3rd, 2008

As a famous pig once put it, “That’ll do.”

I feel compelled to mention

Written by Patrick Hall, November 2nd, 2008

That for Halloween I was U+0310, U+0304, U+0302, U+0303, U+030A, U+0E44, and U+0940.

That is all.

Ffaaaaantastig.

Written by Patrick Hall, October 31st, 2008

Well heck, here’s one more horrifyingly Halloweeny language-related tale:

BBC NEWS | UK | Wales | E-mail error ends up on road sign

The English is clear enough to lorry drivers - but the Welsh reads “I am not in the office at the moment. Please send any work to be translated.”

Not in the office

Next Page »