Don’t sort stuff in Unicode with Bash?
Update: Okay, duh: I shouldn’t have called it “Bash”. What I meant was, “whatever the sort utility is in my default terminal.” Which, as Bryan points out in a comment below, has nothing to do with Bash: it’s GNU Sort. More updates below.
I have a little text file with “Hello World” in lots of languages, which I often use for testing. I extracted a few lines with various scripts and saved that as helloworld.txt.
$ cat helloworld.txt สวัสดีราคาถูก! Thai Habari dunia! Kiswahili Halló heimur! Icelandic Saluton Mondo! Esperanto Sveika, pasaule! Latvian Привет, мир! Russian ሠላም ዓለም! Amharic 안녕, 세상! Korean Chào thế giới! Vietnamese Hallo, wrâld Frisian Hallo verden! Norwegian/Bokmal Laba ryta, pasauli! Lithuanian
For my first amazing trick, I sort the file with the Bash shell built-in:
$ sort helloworld.txt ሠላም ዓለም! Amharic Chào thế giới! Vietnamese Habari dunia! Kiswahili Halló heimur! Icelandic Hallo verden! Norwegian/Bokmal Hallo, wrâld Frisian 안녕, 세상! Korean Laba ryta, pasauli! Lithuanian Saluton Mondo! Esperanto Sveika, pasaule! Latvian สวัสดีราคาถูก! Thai Привет, мир! Russian
…which sucks. Because obviously Bash is ignoring anything fancy (Amharic, Korean, Thai) and sorting strictly by whatever ASCII shows up in the line. (Hard to say whether the «ó» in Icelandic is being considered, but shouldn’t it come after «o» anyway?)
I also installed and tried another terminal called rxvt-unicode, which supposedly has better Unicode support. I got the same results as what I got in Bash under gnome-terminal, which suggests to me that the problem is Bash, or somewhere deeper, and not the terminal itself. I got the same result.
$ python
>>> lines = open('helloworld.txt').read().decode('utf-8').splitlines()
>>> for line in sorted(lines): print line
...
Chào thế giới! Vietnamese
Habari dunia! Kiswahili
Hallo verden! Norwegian/Bokmal
Hallo, wrâld Frisian
Halló heimur! Icelandic
Laba ryta, pasauli! Lithuanian
Saluton Mondo! Esperanto
Sveika, pasaule! Latvian
Привет, мир! Russian
สวัสดีราคาถูก! Thai
ሠላም ዓለም! Amharic
안녕, 세상! Korean
Python does better; clearly things are being sorted according to their Unicode code points. Which of course is a far cry from following UTS #10: Unicode Collation Algorithm, but that has to do with locales and all that.
In any case, I won’t be trusting Bash to sort Unicode files any more.
(I’d be interested to know what the default sort does to the initial input in various other programming languages, comments welcome.)
Update:
After Bryan’s comment pointed out that it wasn’t Bash that I was even dealing with, but rather GNU sort , reading through the manual I discovered the following trick in a footnote:
$ export LC_ALL=C; sort hw.txt Chào thế giới! Vietnamese Habari dunia! Kiswahili Hallo verden! Norwegian/Bokmal Hallo, wrâld Frisian Halló heimur! Icelandic Laba ryta, pasauli! Lithuanian Saluton Mondo! Esperanto Sveika, pasaule! Latvian Привет, мир! Russian สวัสดีราคาถูก! Thai ሠላም ዓለም! Amharic 안녕, 세상! Korean
Which seems to be what I was looking for.
3 comments.
Technorati tags: Code, Linguistic Computing
Bash cannot possibly be your problem, since sort (like most other commands you run through bash) is a separate program. Python is doing better because you are explicitly telling it you’re dealing with UTF-8 text.
I’m not sure that what you’re trying to do actually makes any sense. If there is a correct collation of two text strings in completely unrelated languages and alphabets, that’s news to me.
but… does it make any sense to sort a multi language file? what would it be useful for?
It is just as sensible to sort a multi language file as a single language file.
Such a sorted file is useful as an index, just as a single language file would be useful as an index.
If you have, for example, a word processor contains English language text, some Arabic language text, some Greek language text, an automatically generated index will contain material in multiple scripts and will have to be originally sorted in some fashion or other. (You may then want to add material before eventually publishing, for example cross-references to equate the same name spelled in English script and in Arabic script.)
If you are sorting any set of fields in a Unicode database, then you expect the same sorting rules to apply for any characters, regardless of what languages are being used. It would not be acceptable for any characters to sort “randomly”. A list generated on Monday must be identical to a list generated on Tuesday from the identical files, different only where the data itself has changed.
Accordingly, you need some kind of generic collating algorithm, and would be best that this algorithm made sense according to normal human perception of scripts.
See http://www.unicode.org/reports/tr10/ which Patrick Hall has already mentioned for Unicode sort rules for default collation of all Unicode characters. The DUCET table referenced in this discussion is found at http://unicode.org/Public/UCA/latest/allkeys.txt . Any pieces of text sorted according to Unicode collation rules ought to sort exactly the same on any Unicode system.
The Unicode default collation is intended as a generic collation, one in which, for example, æ always follows a. This generic collation can be tailored to fit particular languages. In an English tailoring æ would normally collate as if identical to ae except that it follow follows ae at the third level (for example Aesop, Æsop, Aethopian). In Scandinavian collations æ collates after z at the first level as a separate letter. But in both English and Scandinavian tailorings the generic Unicode collation rules for Arabic, Hebrew, Han, Amharic and other scripts would be unchanged.
The Unicode collation is, I think, the base collation used in OpenOffice.org for text with no language attribute. If you assign a language attribute to some text, then it will sort using a version of the Unicode Collation Algorithm tailored to that language. I believe most high-level database products follows this procedure.
Note that Microsoft, who got into this collation business before Unicode did, has its own collation algorithm which differs from the Unicode algorithm. I presume this doesn’t matter much, as long as you aren’t simultaneously sorting material with projects using the Unicode system and projects using Microsoft’s system and expecting the results to be the same.
See http://download.microsoft.com/download/c/f/3/cf382f77-0c44-4a4b-b437-58718bc6998e/Unicode%20and%20Collation%20Support%20in%20Microsoft%20SQL%20Server_c.ppt for a power point presentation discussion Unicode sorting in Microsoft SQL Server.