gh-129005: Align FileIO.readall between _pyio and _io #129705

cmaloney · 2025-02-05T21:33:24Z

Utilize bytearray.resize() and os.readinto() to reduce copies and match behavior of _io.FileIO.readall().

There is still an extra copy which means twice the memory required compared to FileIO because there isn't a zero-copy path from bytearray -> bytes currently.

On my system reading a 2GB file
./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read -v

Goes from ~2.7 seconds -> ~2.2 seconds. The C _io implementation is ~1.2 seconds, so still some performance gap, but less.

Issue: Reduce copies when reading files in pyio, match behavior of _io #129005

Utilize `bytearray.resize()` and `os.readinto()` to reduce copies and match behavior of `_io.FileIO.readall()`. There is still an extra copy which means twice the memory required compared to FileIO because there isn't a zero-copy path from `bytearray` -> `bytes` currently. On my system reading a 2GB file `./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read -v` Goes from ~2.7 seconds -> ~2.2 seconds

Lib/_pyio.py

vstinner · 2025-02-05T22:17:02Z

Lib/_pyio.py

-            result += chunk
-
+            bytes_read += n
+        result.resize(bytes_read)


result = memoryview(result)[bytes_read:] would avoid a truncation which can imply a memory copy in the worst case, no?

the resize "shrink" in bytearray doesn't actually resize unless the buffer's "capacity" is 2x the requested size (https://github.com/python/cpython/blob/main/Objects/bytearrayobject.c#L201-L214). Just updates its internal "this is how long the bytes is" counter (which for things like full-file readall with known size, this should already be just one byte over the right size).

My plan currently is to make it so bytes(bytearray(10)) and bytearray(b'\0' * 10) both don't copy (Ongoing discussion in https://discuss.python.org/t/add-zero-copy-conversion-of-bytearray-to-bytes-by-providing-bytes/79164). Having a memoryview would mean there's more than one reference to the bytearray, and I couldn't do / use that optimization.

Ok, I'm fine with using result.resize() here.

Co-authored-by: Victor Stinner <[email protected]>

cmaloney · 2025-02-06T00:23:25Z

Hypothesis test failure in binascii / pretty sure unrelated

vstinner · 2025-02-06T09:54:07Z

Lib/_pyio.py

-                bufsize += max(bufsize, DEFAULT_BUFFER_SIZE)
-            n = bufsize - len(result)
+            if bytes_read >= bufsize:
+                # Parallels _io/fileio.c new_buffersize


In the C code, new_buffersize() argument is bytes_read, not bufsize. You may keep new_buffersize() as a private module-level function.

Updated the loop to no longer use bufsize at all, this is the only line that used it, and it feels more Pythonic to me to just use len(result).

That enables rewriting to:

try: # Read until EOF (n == 0) while n := os.readinto(self._fd, memoryview(result)[bytes_read:]): bytes_read += n if bytes_read >= len(result): result.resize(_new_buffersize(bytes_read)) except BlockingIOError: if not bytes_read: return None assert len(result) - bytes_read >= 1, \ "os.readinto buffer size 0 will result in erroneous EOF / returns 0" result.resize(bytes_read) return bytes(result)

which feels cleaner, but also starts changing structure relative to _io version.

decided to refactor to this. Control flow feels a lot simpler to me and a lot more readable than the branches and breaks.

…f bytes_read

cmaloney · 2025-02-06T22:17:19Z

Tests / Windows / build and test (Win32) failure is a urllib.error.HTTPError : HTTP error 504: Gateway Timeout [D:\a\cpython\cpython\PCbuild\pythoncore.vcxproj], believe unrelated

vstinner

LGTM

vstinner · 2025-02-07T11:06:18Z

Merged, thank you.

bedevere-app bot added the awaiting review label Feb 5, 2025

cmaloney changed the title ~~gh-12005: Align FileIO.readall between _pyio and _io~~ gh-129005: Align FileIO.readall between _pyio and _io Feb 5, 2025

bedevere-app bot mentioned this pull request Feb 5, 2025

Reduce copies when reading files in pyio, match behavior of _io #129005

Open

vstinner reviewed Feb 5, 2025

View reviewed changes

Update Lib/_pyio.py

4520ecc

Co-authored-by: Victor Stinner <[email protected]>

vstinner reviewed Feb 6, 2025

View reviewed changes

Use len(result) rather than bufsize, _new_buffersize, make in terms o…

6c3ac57

…f bytes_read

cmaloney and others added 4 commits February 6, 2025 14:41

Merge branch 'main' into fileio_readall

0e15dee

Simplify control structure

b50fb66

Fix whitespace

4da4e31

Merge branch 'main' into fileio_readall

09b52ed

vstinner approved these changes Feb 7, 2025

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting review labels Feb 7, 2025

vstinner merged commit a3d5aab into python:main Feb 7, 2025
43 checks passed

bedevere-app bot removed the awaiting merge label Feb 7, 2025

cmaloney deleted the fileio_readall branch February 7, 2025 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-129005: Align FileIO.readall between _pyio and _io #129705

gh-129005: Align FileIO.readall between _pyio and _io #129705

cmaloney commented Feb 5, 2025 •

edited

Loading

vstinner Feb 5, 2025

cmaloney Feb 5, 2025

vstinner Feb 6, 2025

cmaloney commented Feb 6, 2025

vstinner Feb 6, 2025

cmaloney Feb 6, 2025

cmaloney Feb 7, 2025

cmaloney commented Feb 6, 2025

vstinner left a comment

vstinner commented Feb 7, 2025

gh-129005: Align FileIO.readall between _pyio and _io #129705

gh-129005: Align FileIO.readall between _pyio and _io #129705

Conversation

cmaloney commented Feb 5, 2025 • edited Loading

vstinner Feb 5, 2025

Choose a reason for hiding this comment

cmaloney Feb 5, 2025

Choose a reason for hiding this comment

vstinner Feb 6, 2025

Choose a reason for hiding this comment

cmaloney commented Feb 6, 2025

vstinner Feb 6, 2025

Choose a reason for hiding this comment

cmaloney Feb 6, 2025

Choose a reason for hiding this comment

cmaloney Feb 7, 2025

Choose a reason for hiding this comment

cmaloney commented Feb 6, 2025

vstinner left a comment

Choose a reason for hiding this comment

vstinner commented Feb 7, 2025

cmaloney commented Feb 5, 2025 •

edited

Loading