Kind of Shocking Discovery about open() in Python
So, here’s why the tutorial is rather late:
I ran across a weird… er… feature of Python’s open() command. If you’re waiting for the tutorial, this will only freak you out, so ignore this post, hehe. If you’re a Python guru, I’d like to hear what you think.
Check this out:
$ python2.4
>>> len(open('moby10b.txt').read())
1256167
That file is Moby Dick, from here. Just a normal text file.
Length: 1,256,167 bytes.
Now, here’s the same command running on Windows, with the same file:
C:> python
>>> len(open('moby10b.txt').read())
1232923
Length: 1,232,923 bytes.
Yes, friends and neighbors, Python on Windows and Python on Linux disagree on how long a file is.
After much digging, I discovered that there is a way around this. I have no idea why, after several years of messing around with Python, I have only ever run across this solution when I went looking for a solution.
The trick is to pass in the “Universal Newline” option to open(), like this:
>>> len(open('moby10b.txt', 'U').read())
1232923
That gives the same results on Linux and Windows. Can anyone out there test this on OSX? I presume that it will have the same behavior as Linux, being based on BSD. But good grief. Does this mean that I need to use open(foo, ‘U’) every time I open a file if I want reliable file lengths across platforms?
To paraphrase my dad, “Beats the hell out of me, commander!”
Okay so, tutorial TOMORROW. We promises.
Or you could go and look at the current, broken version now, and confuse yourself. It will be unbroken tomorrow.
(Thanks to Won for helping me figure out the newline madness.)
9 comments.
Technorati tags: Code, python
To paraphrase Joey, “Whoa!”
It’s because in Windows a newline character is “\r\n” (carriage return and line feed) whereas on Linux a newline is denoted by a single linefeed character (”\n”).
If you ever need your script to find what the current newline type is then you can use the “os.linesep” variable.
http://www.ibiblio.org/g2swap/byteofpython/read/os-module.html
Hi Doug,
Thanks for the explanation. I was pretty sure it had to do with newlines, but there’s one bit that still confuses me: why is the file counted by
len()as being shorter on Windows if the Windows newline is two bytes,\r\n, while on Linux, where os.sep is just one byte,\n?Passing
'U'toopen()seems to solve the problem in any case, but I still don’t understand why the numbers work out like they do.Thanks for stopping by!
I guess this is because when you read() counts “\r\n” as one character on windows and as two characters on Linux. If you open the file with “U”, then it will always count it as one character.
Just a guess, though.
For me there are two different things. One is count of characters in file, and
len(open('file').read())does exactly that ('\r\n'counts for one or two chars depending of platform). For counting bytes in file there isos.path.getsize('file')and it returns exactly the same result on both Windows and Linux.The file already has \r\n ’s in it. When read on windows the two characters are interpreted as one. This is simply the behavior of the system call reflected in python. On linux they are interpreted normally.
If all you want is the size, you shouldn’t be doing file.read() for efficiency anyway. Reading the data requires linear time and may require the OS to throw out cached data. Worse, file.read() will return everything in a single huge allocation that it will immediately throw away.
Better to just get the output from the stat structure. os.path.getsize() as shnur said, or os.stat() or os.fstat() directly.
But I agree this is confusing. The \r\n -> \n convention is so backward. Stupid DOS. (Did CP/M do that, too?)
Actually, there are two ways of getting this to work. Universal newlines is one of them, and binary mode is the other.
len(open('moby10b.txt', 'U').read())will convert'\r\n'to'\n'on both Linux and Windows, resulting in a length of 1,232,923 characters read.len(open('moby10b.txt', 'U').read())will leave'\r\n'alone (not converting it to'\n') on both Linux and Windows, resulting in a length of 1,256,167 bytes read.Now here’s the thing. Look at this file via
ls -lin Linux andDIRin Windows. You’ll see that its real length is 1,256,167 bytes.So if you want the files you read to never be touched in any way, use the
'b'option toopen(). If you want newlines to show up as'\n'in your program no matter how the file was encoded, use'U'.I don’t think this is specific to Python. It should be the same way in C.