Skip to content

Tarfile is unnecessarily slow #121267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jforberg opened this issue Jul 2, 2024 · 3 comments
Closed

Tarfile is unnecessarily slow #121267

jforberg opened this issue Jul 2, 2024 · 3 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@jforberg
Copy link
Contributor

jforberg commented Jul 2, 2024

Bug report

Bug description:

There is room for improvement in tarfile write performance. In a simple benchmark I find that tarfile spends most of its time doing repeated user name/group name queries.
https://gist.github.com/jforberg/86af759c796199740c31547ae828aef2

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

Linked PRs

@jforberg jforberg added the type-bug An unexpected behavior, bug, or error label Jul 2, 2024
jforberg added a commit to jforberg/cpython that referenced this issue Jul 2, 2024
Tarfile in the default write mode spends much of its time resolving UIDs
into usernames and GIDs into group names. By caching these mappings, a
significant speedup can be achieved.

In my simple benchmark[1], this extra caching speeds up tarfile by 8x.

[1] https://gist.github.com/jforberg/86af759c796199740c31547ae828aef2
jforberg added a commit to jforberg/cpython that referenced this issue Jul 2, 2024
Tarfile in the default write mode spends much of its time resolving UIDs
into usernames and GIDs into group names. By caching these mappings, a
significant speedup can be achieved.

In my simple benchmark[1], this extra caching speeds up tarfile by 8x.

[1] https://gist.github.com/jforberg/86af759c796199740c31547ae828aef2
@gaogaotiantian
Copy link
Member

Could you paste your cProfile output? I was a bit surprised that most of the time is spent on reading the pwd and grp.

@jforberg
Copy link
Contributor Author

jforberg commented Jul 2, 2024

@gaogaotiantian Here is the output:

cprofile.txt

jforberg added a commit to jforberg/cpython that referenced this issue Jul 3, 2024
Tarfile in the default write mode spends much of its time resolving UIDs
into usernames and GIDs into group names. By caching these mappings, a
significant speedup can be achieved.

In my simple benchmark[1], this extra caching speeds up tarfile by 8x.

[1] https://gist.github.com/jforberg/86af759c796199740c31547ae828aef2
jforberg added a commit to jforberg/cpython that referenced this issue Jul 3, 2024
Tarfile in the default write mode spends much of its time resolving UIDs
into usernames and GIDs into group names. By caching these mappings, a
significant speedup can be achieved.

In my simple benchmark[1], this extra caching speeds up tarfile by 8x.

[1] https://gist.github.com/jforberg/86af759c796199740c31547ae828aef2
hauntsaninja added a commit that referenced this issue Oct 30, 2024
Tarfile in the default write mode spends much of its time resolving UIDs
into usernames and GIDs into group names. By caching these mappings, a
significant speedup can be achieved.

In my simple benchmark[1], this extra caching speeds up tarfile by 8x.

[1] https://gist.github.com/jforberg/86af759c796199740c31547ae828aef2

---------

Co-authored-by: Tian Gao <[email protected]>
Co-authored-by: Bénédikt Tran <[email protected]>
Co-authored-by: Shantanu <[email protected]>
@picnixz picnixz added type-feature A feature request or enhancement performance Performance or resource usage stdlib Python modules in the Lib dir and removed type-bug An unexpected behavior, bug, or error labels Oct 30, 2024
@picnixz
Copy link
Member

picnixz commented Oct 30, 2024

Recategorizing this issue as per #121269 (comment).

picnixz added a commit to picnixz/cpython that referenced this issue Dec 8, 2024
…on#121269)

Tarfile in the default write mode spends much of its time resolving UIDs
into usernames and GIDs into group names. By caching these mappings, a
significant speedup can be achieved.

In my simple benchmark[1], this extra caching speeds up tarfile by 8x.

[1] https://gist.github.com/jforberg/86af759c796199740c31547ae828aef2

---------

Co-authored-by: Tian Gao <[email protected]>
Co-authored-by: Bénédikt Tran <[email protected]>
Co-authored-by: Shantanu <[email protected]>
ebonnal pushed a commit to ebonnal/cpython that referenced this issue Jan 12, 2025
…on#121269)

Tarfile in the default write mode spends much of its time resolving UIDs
into usernames and GIDs into group names. By caching these mappings, a
significant speedup can be achieved.

In my simple benchmark[1], this extra caching speeds up tarfile by 8x.

[1] https://gist.github.com/jforberg/86af759c796199740c31547ae828aef2

---------

Co-authored-by: Tian Gao <[email protected]>
Co-authored-by: Bénédikt Tran <[email protected]>
Co-authored-by: Shantanu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants