Python String Reform
O frabjous day!
From Guido van Rossum’s slides about the upcoming major revision of Python, the very best page of all (the Unicode part, of course!):
String Types Reform
- bytes and str instead of str and unicode
- bytes is a mutable array of int (in range(256))
- encode/decode API? bytes(s, “Latin-1″)?
- bytes have some str-ish methods (e.g. b1.find(b2))
- but not others (e.g. not b.upper())
- All data is either binary or text
- all text data is represented as Unicode
- conversions happen at I/O time
- Different APIs for binary and text streams
- how to establish file encoding? (Platform decides)
I think when these changes come around Python will take the lead in Unicode support among scripting languages.
(…Unless Ruby gets its Unicode act together. I’m not terribly optimistic about that, myself. Like i18n in Rails, it just doesn’t seem to be too high of a priority in the zeitgeist. Maybe I’ll be proven wrong, I hope so.)
3 comments.
Technorati tags: Code, python, ruby, unicode
Well, for Ruby there are some interesting stuff coming, see for example Matz’s YAPC::Asia 2006 slides. The translation is there (at the end of the first post): http://www.ruby-forum.com/topic/60928
But well, I have no idea when it will be merged into the CVS tree.
Hi Scritch,
I’ve heard about Matz’s plan to “supersede” Unicode with a “character set independent” system, but frankly I don’t buy it.
He bases his argument on the premise that the following encodings are not included by default in Unicode:
Mojikyo
TRON
GB 18030
This all boils down to the whole topic of Han unification, whereby the Japanese and Chinese character sets are merged, sort of.
Yes, it’s messy. Maybe it wasn’t a good idea to begin with. And arguably, it’s not sufficient for philological work in CJK contexts. And so maybe the whole issue of character encodings has to be started over, by Matz. Because that’s what he’s talking about doing.
Now, Matz is an incredibly smart guy and quite frankly it’s possible that he, alone, can do what an entire of consortium of people from around the world could not — come up with a system that will generalize away the entire concept of character encodings. I’m the first to point out that I live purely in scripting land and haven’t the foggiest clue how all this low-level stuff will work.
But, it seems pretty clear to me that right now, despite the fact that UTF-8 is already the only viable choice for encoding the great majority of all the languages of the world in a single document, on the internet, that it’s nuts that he wouldn’t try to just fix Ruby right away so that it could handle UTF-8 sensibly, right now.
More power to him, if he thinks he can do this CSI thing soon enough to bring Ruby up to date with most other popular languages out there with regard to text handling — Python, Perl, heck, even Javascript.
Actually, especially Javascript.
I hope so, for Ruby’s sake.
I actually wonder what you ended up using. Py3k will come about the same time judging from the slides, but it’s still going to have these len(whatever).
As to what Python has now, it looks to me like even more bondage than what Ruby has to offer
http://thraxil.org/users/anders/posts/2005/11/01/unicodification/