Skip to content

Next TAR header offset recomputation is wrong for GNU sparse 1.0 file combined with 'size' PAX header key #136602

@mxmlnkn

Description

@mxmlnkn

Bug report

Bug description:

For a more detailed description, please see #136601.

I have a bug that causes TAR file parsing to end preemptively for very large sparse files. The computed next TAR header is off by one 512 B block.

The problem is the recomputation of the next TAR offset in case the PAX header contains a size key to override the overflowed (> 8GB) TAR size:

cpython/Lib/tarfile.py

Lines 1562 to 1569 in 47b01da

if "size" in pax_headers:
# If the extended header replaces the size field,
# we need to recalculate the offset where the next
# header starts.
offset = next.offset_data
if next.isreg() or next.type not in SUPPORTED_TYPES:
offset += next._block(next.size)
tarfile.offset = offset

The problem is that next.offset_data is used for this recomputation even though next.offset_data gets overwritten in _proc_gnusparse_10:

next.offset_data = tarfile.fileobj.tell()

This leads to the next TAR offset header being off by the number of blocks it takes to store the sparse data.

But, maybe I am wrong and have overlooked something. I can say, that this fixes it for my test case:

diff --git a/Lib/tarfile.py b/Lib/tarfile.py
index 068aa13ed7..7f3e62f5a2 100644
--- a/Lib/tarfile.py
+++ b/Lib/tarfile.py
@@ -1565,7 +1565,7 @@ def _proc_pax(self, tarfile):
                 # header starts.
                 offset = next.offset_data
                 if next.isreg() or next.type not in SUPPORTED_TYPES:
-                    offset += next._block(next.size)
+                    offset += next._block(next.size) - BLOCKSIZE
                 tarfile.offset = offset
 
         return next

Minimal reproducer (tested on EXT4 with GNU tar 1.35):

echo bar > foo
echo bar > sparse
fallocate -l 9G sparse
echo bar >> sparse
fallocate --punch-hole -o 1G -l 10M sparse
tar --numeric-owner --format=pax --sparse-version=1.0 -cSf sparse.tar sparse foo
ls -la sparse.tar
# -rw-rw-r-- 1 user user 9663682560 Jul 13 14:14 sparse.tar
tar tvlf sparse.tar
# -rw-rw-r-- 1000/1000 9663676420 2025-07-13 14:13 sparse
# -rw-rw-r-- 1000/1000          4 2025-07-13 14:11 foo
python3 -c 'import sys, tarfile;
[print(tarInfo.sparse, tarInfo.offset, tarInfo.offset_data, tarInfo.size, tarInfo.name)
for tarInfo in tarfile.open(sys.argv[1])]' sparse.tar
# [(0, 1073741824), (1084227584, 8579448836), (9663676420, 0)] 0 2048 9653191172 sparse
#  -> foo is missing!
cat sparse.tar | xz -9 | zstd -19 | base64

Reproducer sparse-file-larger-than-8GiB-followed-by-normal-file.tar.xz.zst file as base64:

cat <<EOF | base64 -d | zstd -d > sparse-file-larger-than-8GiB-followed-by-normal-file.tar.xz
KLUv/QRojBIA1CP9N3pYWgAABObWtEYCACEBHAAAABDPWMz//5wCcV0AFwvGh5JaO6ePxyUOuA/z
XtE/5U/vyT1WUwqPhMr1HTeZeJyWILwrrtDwH0eKx6KKGcU7D2aYidf/9bCtFMcWp8KxDA1FLF58
w9bO4J+eDKd9QfIZFPCutpNB91dMk9bSVazx9pUcWEWn2r0SWsv1BtSYmVDmdKaMdGC/Epx8bcRA
nm5Joy2Tgi3O7VouoCAqha+1YYNOQyyB4sG+tDbfLGdW6fyZMztJ/lRFQwtlFpDLHGFpia92kkke
+2a/mwMvPc58aiT5X56QuH2mw1OhsrBKnbYYnT89BJjyAh2GTOeDbtZ/lLDGwhvxkXlnCm/M8Qiq
fUGfqAjnBeikNY2nodSBFo8YQh+636fk9xfuTQ3kKQ8qEWa613HftzHJ/X/ha1bKD91T/SPTCgd/
rhyvFtn8FBBiUS7UayidinQBNmGebczIaRsKUQKoffUTC9EbCrRXDQjQMjfDyo7N/eDIxD7jBImH
Dv8Qk/hxeFn4C83/lShGD6n8fN77mjAuVsCPhfODgcBlxCVT+PWRNjEFpbDub8FwTUcM0ZERqq1g
HbrOsScYXFmG6WZSWL7pdqxZ5OVbBQj5x9qt/PtSK3TNHlsgQvndUz34KWQJO4DLKmzftTvwxL0u
X6oPPktmQpAT+5I61gCf/xABKwDsc1On/b6ufDEan7eNMW5wnqcjX+woy4XRlZiKfiqR8id19xnA
BphNmP3Yr9WQD1EPP7IEADz8NCsncIBOR5aC/hM+FaZUAAAAAQD9/778uQYCRAAAAAEA/f85AAJE
AAAAAQD9/zkAAkQAAAABAP3/OQACRAAAAAEA/f85AAJEAAAAAQD9/zkAAkQAAAABAP3/OQACRAAA
AAEA/f85AAJEAAAAAQD9/zkAAh0GAIQLkODIaNUNHQG+Ib0xB5201x9u5Typk+S1zSY18D/tc2o+
BXKM/RM9v6MTQoFntxwNm0So6CELgft8dinBPFJBg583tJn+q69PwBnThZQjYTzvNhv0fkxX4Gjm
SgwnOrb7GU5pc2qtcrNCcHrPaNkQicmkdyzESbMAA8S2zfCiJIzpnN25EroA08/3fFWQ44Jfrake
AIiPdXLNPTRNAAGY3VWAsID7IwAAdKr/2BQXOzADAAAAAARZWgIAG0DNWVsOgERj+N4=
EOF

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibPython modules in the Lib dirtype-bugAn unexpected behavior, bug, or error

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions