Don’t sort stuff in Unicode with Bash?
Update: Okay, duh: I shouldn’t have called it “Bash”. What I meant was, “whatever the sort utility is in my default terminal.” Which, as Bryan points out in a comment below, has nothing to do with Bash: it’s GNU Sort. More updates below.
I have a little text file with “Hello World” in lots of languages, which I often use for testing. I extracted a few lines with various scripts and saved that as helloworld.txt.
$ cat helloworld.txt สวัสดีราคาถูก! Thai Habari dunia! Kiswahili Halló heimur! Icelandic Saluton Mondo! Esperanto Sveika, pasaule! Latvian Привет, мир! Russian ሠላም ዓለም! Amharic 안녕, 세상! Korean Chào thế giới! Vietnamese Hallo, wrâld Frisian Hallo verden! Norwegian/Bokmal Laba ryta, pasauli! Lithuanian
For my first amazing trick, I sort the file with the Bash shell built-in:
$ sort helloworld.txt ሠላም ዓለም! Amharic Chào thế giới! Vietnamese Habari dunia! Kiswahili Halló heimur! Icelandic Hallo verden! Norwegian/Bokmal Hallo, wrâld Frisian 안녕, 세상! Korean Laba ryta, pasauli! Lithuanian Saluton Mondo! Esperanto Sveika, pasaule! Latvian สวัสดีราคาถูก! Thai Привет, мир! Russian
…which sucks. Because obviously Bash is ignoring anything fancy (Amharic, Korean, Thai) and sorting strictly by whatever ASCII shows up in the line. (Hard to say whether the «ó» in Icelandic is being considered, but shouldn’t it come after «o» anyway?)
I also installed and tried another terminal called rxvt-unicode, which supposedly has better Unicode support. I got the same results as what I got in Bash under gnome-terminal, which suggests to me that the problem is Bash, or somewhere deeper, and not the terminal itself. I got the same result.
$ python
>>> lines = open('helloworld.txt').read().decode('utf-8').splitlines()
>>> for line in sorted(lines): print line
...
Chào thế giới! Vietnamese
Habari dunia! Kiswahili
Hallo verden! Norwegian/Bokmal
Hallo, wrâld Frisian
Halló heimur! Icelandic
Laba ryta, pasauli! Lithuanian
Saluton Mondo! Esperanto
Sveika, pasaule! Latvian
Привет, мир! Russian
สวัสดีราคาถูก! Thai
ሠላም ዓለም! Amharic
안녕, 세상! Korean
Python does better; clearly things are being sorted according to their Unicode code points. Which of course is a far cry from following UTS #10: Unicode Collation Algorithm, but that has to do with locales and all that.
In any case, I won’t be trusting Bash to sort Unicode files any more.
(I’d be interested to know what the default sort does to the initial input in various other programming languages, comments welcome.)
Update:
After Bryan’s comment pointed out that it wasn’t Bash that I was even dealing with, but rather GNU sort , reading through the manual I discovered the following trick in a footnote:
$ export LC_ALL=C; sort hw.txt Chào thế giới! Vietnamese Habari dunia! Kiswahili Hallo verden! Norwegian/Bokmal Hallo, wrâld Frisian Halló heimur! Icelandic Laba ryta, pasauli! Lithuanian Saluton Mondo! Esperanto Sveika, pasaule! Latvian Привет, мир! Russian สวัสดีราคาถูก! Thai ሠላም ዓለም! Amharic 안녕, 세상! Korean
Which seems to be what I was looking for.
3 comments.
Technorati tags: Code, Linguistic Computing
