Hacklog: Blogamundo — poking holes in the language barrier since approximately 1 month from now

b
l
o
g
a
m
u
n
d
o

Kind of Shocking Discovery about open() in Python

Written by Patrick Hall, 2 years, 3 months ago.
Tags: , .

So, here’s why the tutorial is rather late:

I ran across a weird… er… feature of Python’s open() command. If you’re waiting for the tutorial, this will only freak you out, so ignore this post, hehe. If you’re a Python guru, I’d like to hear what you think.

Check this out:

$ python2.4
>>> len(open('moby10b.txt').read())
1256167

That file is Moby Dick, from here. Just a normal text file.

Length: 1,256,167 bytes.

Now, here’s the same command running on Windows, with the same file:

C:> python
>>> len(open('moby10b.txt').read())
1232923

Length: 1,232,923 bytes.

Yes, friends and neighbors, Python on Windows and Python on Linux disagree on how long a file is.

*head a splodes*

After much digging, I discovered that there is a way around this. I have no idea why, after several years of messing around with Python, I have only ever run across this solution when I went looking for a solution.

The trick is to pass in the “Universal Newline” option to open(), like this:

>>> len(open('moby10b.txt', 'U').read())
1232923

That gives the same results on Linux and Windows. Can anyone out there test this on OSX? I presume that it will have the same behavior as Linux, being based on BSD. But good grief. Does this mean that I need to use open(foo, ‘U’) every time I open a file if I want reliable file lengths across platforms?

To paraphrase my dad, “Beats the hell out of me, commander!”

Okay so, tutorial TOMORROW. We promises.

Or you could go and look at the current, broken version now, and confuse yourself. It will be unbroken tomorrow.

(Thanks to Won for helping me figure out the newline madness.)

9 Comments for 'Kind of Shocking Discovery about open() in Python'

  1. Comment received 2 years, 3 months ago from quotidia

    To paraphrase Joey, “Whoa!”

  2. Comment received 2 years, 3 months ago from Doug @ Straw Dogs

    It’s because in Windows a newline character is “\r\n” (carriage return and line feed) whereas on Linux a newline is denoted by a single linefeed character (”\n”).

    If you ever need your script to find what the current newline type is then you can use the “os.linesep” variable.

    http://www.ibiblio.org/g2swap/byteofpython/read/os-module.html

  3. Comment received 2 years, 2 months ago from Patrick Hall

    Hi Doug,

    Thanks for the explanation. I was pretty sure it had to do with newlines, but there’s one bit that still confuses me: why is the file counted by len() as being shorter on Windows if the Windows newline is two bytes, \r\n, while on Linux, where os.sep is just one byte, \n?

    Passing 'U' to open() seems to solve the problem in any case, but I still don’t understand why the numbers work out like they do.

    Thanks for stopping by!

  4. Comment received 2 years, 2 months ago from Vincent Untz

    I guess this is because when you read() counts “\r\n” as one character on windows and as two characters on Linux. If you open the file with “U”, then it will always count it as one character.

    Just a guess, though.

  5. Comment received 2 years, 2 months ago from shnur

    For me there are two different things. One is count of characters in file, and len(open('file').read()) does exactly that ('\r\n' counts for one or two chars depending of platform). For counting bytes in file there is os.path.getsize('file') and it returns exactly the same result on both Windows and Linux.

  6. Comment received 2 years, 2 months ago from dawg

    The file already has \r\n ’s in it. When read on windows the two characters are interpreted as one. This is simply the behavior of the system call reflected in python. On linux they are interpreted normally.

  7. Comment received 2 years, 2 months ago from Scott Lamb

    If all you want is the size, you shouldn’t be doing file.read() for efficiency anyway. Reading the data requires linear time and may require the OS to throw out cached data. Worse, file.read() will return everything in a single huge allocation that it will immediately throw away.

    Better to just get the output from the stat structure. os.path.getsize() as shnur said, or os.stat() or os.fstat() directly.

    But I agree this is confusing. The \r\n -> \n convention is so backward. Stupid DOS. (Did CP/M do that, too?)

  8. Comment received 2 years, 2 months ago from Robin Munn

    Actually, there are two ways of getting this to work. Universal newlines is one of them, and binary mode is the other.

    len(open('moby10b.txt', 'U').read()) will convert '\r\n' to '\n' on both Linux and Windows, resulting in a length of 1,232,923 characters read.

    len(open('moby10b.txt', 'U').read()) will leave '\r\n' alone (not converting it to '\n') on both Linux and Windows, resulting in a length of 1,256,167 bytes read.

    Now here’s the thing. Look at this file via ls -l in Linux and DIR in Windows. You’ll see that its real length is 1,256,167 bytes.

    So if you want the files you read to never be touched in any way, use the 'b' option to open(). If you want newlines to show up as '\n' in your program no matter how the file was encoded, use 'U'.

  9. Comment received 2 years, 2 months ago from Amit Patel

    I don’t think this is specific to Python. It should be the same way in C.

Leave a comment

(required)

(required)

Comment moderation may delay the posting of your comment. XHTML: You can use the following tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <img src="" alt=""> <strike> <strong> . Don't forget to close them after use.