Skip to content

tarfile indeterminate TarInfo.size when PAX headers contain size and GNU.sparse.realsize keys at the same time #136601

@mxmlnkn

Description

@mxmlnkn

Bug report

Bug description:

Hello,

I am currently debugging this issue.

I have noticed that the bug can be reproduced when the problematic file is truncated to 9 GiB B but it does not happen when truncated to 8 GiB.

The problem seems to be that the next member offset is computed wrong. It seems to point 512 B after the correct TAR header, which, in this case, points into the data for the extended attributes such as 30 mtime=1752348[...].

One of the differences seems to be this code part, which is not hit for the working case:

cpython/Lib/tarfile.py

Lines 1562 to 1569 in 47b01da

if "size" in pax_headers:
# If the extended header replaces the size field,
# we need to recalculate the offset where the next
# header starts.
offset = next.offset_data
if next.isreg() or next.type not in SUPPORTED_TYPES:
offset += next._block(next.size)
tarfile.offset = offset

While looking into the line above, i.e., into _apply_pax_info, I noticed that there is no definite order for applying the size even though it can appear multiple times!

cpython/Lib/tarfile.py

Lines 1615 to 1634 in 47b01da

def _apply_pax_info(self, pax_headers, encoding, errors):
"""Replace fields with supplemental information from a previous
pax extended or global header.
"""
for keyword, value in pax_headers.items():
if keyword == "GNU.sparse.name":
setattr(self, "path", value)
elif keyword == "GNU.sparse.size":
setattr(self, "size", int(value))
elif keyword == "GNU.sparse.realsize":
setattr(self, "size", int(value))
elif keyword in PAX_FIELDS:
if keyword in PAX_NUMBER_FIELDS:
try:
value = PAX_NUMBER_FIELDS[keyword](value)
except ValueError:
value = 0
if keyword == "path":
value = value.rstrip("/")
setattr(self, keyword, value)

In the non-working case, the PAX headers look like this:

{'GNU.sparse.major': '1',
 'GNU.sparse.minor': '0',
 'GNU.sparse.name': 'userdata',
 'GNU.sparse.realsize': '9663676416',
 'atime': '1752349406.975921575',
 'ctime': '1752349534.57652562',
 'mtime': '1752349534.57652562',
 'size': '9602318848'}

I.e, the size member first gets set to GNU.sparse.realsize and then to size. The debug output looks like this:

[_apply_pax_info] SET SIZE to: 9663676416 from key: GNU.sparse.realsize
[_apply_pax_info] SET SIZE to: 9602318848 from key: size
[_apply_pax_info] SET key to: 1752349534.5765257 from key: mtime

Is it specified that the order of the PAX headers must always be this way? Else, one might just as well encounter it like this:

{'atime': '1752349406.975921575',
 'ctime': '1752349534.57652562',
 'mtime': '1752349534.57652562',
 'size': '9602318848',
 'GNU.sparse.major': '1',
 'GNU.sparse.minor': '0',
 'GNU.sparse.name': 'userdata',
 'GNU.sparse.realsize': '9663676416'}

and either one of these orders would be a bug.

The working case does not have this ambiguity:

{'GNU.sparse.major': '1',
 'GNU.sparse.minor': '0',
 'GNU.sparse.name': 'userdata',
 'GNU.sparse.realsize': '8589934592',
 'atime': '1752349538.445543898',
 'ctime': '1752351104.53673501',
 'mtime': '1752351104.53673501'}

the debug output looks like this:

[_apply_pax_info] SET SIZE to: 8589934592 from key: GNU.sparse.realsize
[_apply_pax_info] SET key to: 1752351104.536735 from key: mtime

I.e., even if the is no ordering problem, there already are different semantics for the TarInfo.size member as one will contain GNU.sparse.realsize and the other will contain [PAXHeader.]size.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibPython modules in the Lib dirtype-bugAn unexpected behavior, bug, or error

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions