Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-118761: email.quoprimime removing re import #132046

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

Marius-Juston
Copy link
Contributor

@Marius-Juston Marius-Juston commented Apr 3, 2025

This pull request removes the re module from the email.quoprimime, thus increasing the import speed from 5676 us to 3669 us (a 60% import speed increase );

From

marius@DESKTOP-IOUM5DH:~/cpython$ ./python -X importtime -c "import email.quoprimime"
import time: self [us] | cumulative | imported package
import time:        88 |         88 |   _io
import time:        19 |         19 |   marshal
import time:       143 |        143 |   posix
import time:       332 |        580 | _frozen_importlib_external
import time:        42 |         42 |   time
import time:       125 |        166 | zipimport
import time:        25 |         25 |     _codecs
import time:       290 |        315 |   codecs
import time:       190 |        190 |   encodings.aliases
import time:       417 |        921 | encodings
import time:        90 |         90 | encodings.utf_8
import time:        44 |         44 | _signal
import time:        22 |         22 |     _abc
import time:        93 |        114 |   abc
import time:       484 |        484 |   _collections_abc
import time:       136 |        733 | io
import time:        22 |         22 |       _stat
import time:        59 |         80 |     stat
import time:        31 |         31 |       errno
import time:        43 |         43 |       genericpath
import time:        87 |        160 |     posixpath
import time:       283 |        523 |   os
import time:        50 |         50 |   _sitebuiltins
import time:        86 |         86 |   sitecustomize
import time:        30 |         30 |   usercustomize
import time:       216 |        902 | site
import time:       114 |        114 | linecache
import time:       203 |        203 |   email
import time:        22 |         22 |     _string
import time:       140 |        140 |         types
import time:       795 |        935 |       enum
import time:        34 |         34 |         _sre
import time:       130 |        130 |           re._constants
import time:       181 |        311 |         re._parser
import time:        49 |         49 |         re._casefix
import time:       198 |        591 |       re._compiler
import time:        57 |         57 |           itertools
import time:        75 |         75 |           keyword
import time:        41 |         41 |             _operator
import time:       153 |        194 |           operator
import time:        98 |         98 |           reprlib
import time:        32 |         32 |           _collections
import time:       553 |       1006 |         collections
import time:        30 |         30 |         _functools
import time:       346 |       1381 |       functools
import time:        95 |         95 |       copyreg
import time:       311 |       3311 |     re
import time:       365 |       3697 |   string
import time:      1777 |       5676 | email.quoprimime

To

marius@DESKTOP-IOUM5DH:~/cpython$ ./python -X importtime -c "import email.quoprimime"
import time: self [us] | cumulative | imported package
import time:        89 |         89 |   _io
import time:        18 |         18 |   marshal
import time:       130 |        130 |   posix
import time:       305 |        541 | _frozen_importlib_external
import time:        37 |         37 |   time
import time:       115 |        152 | zipimport
import time:        24 |         24 |     _codecs
import time:       273 |        296 |   codecs
import time:       175 |        175 |   encodings.aliases
import time:       387 |        857 | encodings
import time:        83 |         83 | encodings.utf_8
import time:        40 |         40 | _signal
import time:        16 |         16 |     _abc
import time:        88 |        103 |   abc
import time:       422 |        422 |   _collections_abc
import time:       125 |        649 | io
import time:        20 |         20 |       _stat
import time:        54 |         73 |     stat
import time:        29 |         29 |       errno
import time:        39 |         39 |       genericpath
import time:        81 |        148 |     posixpath
import time:       269 |        490 |   os
import time:        48 |         48 |   _sitebuiltins
import time:        80 |         80 |   sitecustomize
import time:        28 |         28 |   usercustomize
import time:       199 |        842 | site
import time:       106 |        106 | linecache
import time:       189 |        189 |   email
import time:        18 |         18 |     _string
import time:       124 |        124 |         types
import time:       667 |        791 |       enum
import time:        33 |         33 |         _sre
import time:       130 |        130 |           re._constants
import time:       177 |        307 |         re._parser
import time:        49 |         49 |         re._casefix
import time:       179 |        566 |       re._compiler
import time:        58 |         58 |           itertools
import time:        74 |         74 |           keyword
import time:        36 |         36 |             _operator
import time:       148 |        183 |           operator
import time:        96 |         96 |           reprlib
import time:        31 |         31 |           _collections
import time:       494 |        934 |         collections
import time:        27 |         27 |         _functools
import time:       315 |       1274 |       functools
import time:        87 |         87 |       copyreg
import time:       284 |       3000 |     re
import time:       297 |       3314 |   string
import time:       168 |       3669 | email.quoprimime

however, the new implementation does increase the compute time

TEST_CASES = {
    "empty": "Dracula",
    "empty_medium": "Dracula"* 10,
    "empty_long": "Dracula"* 100,
    "short": "Hello=20World=21",
    "medium": "This_is_a_test=3F=3D=2E" * 10,
    "long": "Some_long_text_with_encoding=20" * 100,
    "mixed": "A=2Equick=20brown=5Ffox=21=3F" * 50,
    "edge_case_short": "=20=21=3F=2E=5F",
    "edge_case_long": "=20=21=3F=2E=5F" * 200
}
Benchmark regex non_regex
empty 284 ns 382 ns: 1.34x slower
empty_medium 302 ns 2.99 us: 9.91x slower
empty_long 371 ns 28.6 us: 77.20x slower
short 731 ns 902 ns: 1.23x slower
medium 6.24 us 11.8 us: 1.89x slower
long 25.5 us 137 us: 5.37x slower
mixed 57.0 us 71.5 us: 1.25x slower
edge_case_short 1.36 us 916 ns: 1.48x faster
edge_case_long 178 us 160 us: 1.11x faster
Geometric mean (ref) 2.78x slower

So it is very possible that this is not worth it.

Issues:

@Marius-Juston Marius-Juston requested a review from a team as a code owner April 3, 2025 10:22
@Marius-Juston Marius-Juston changed the title gh-118761: Quoprimime removing re import gh-118761: email.quoprimime removing re import Apr 3, 2025
@Marius-Juston
Copy link
Contributor Author

Marius-Juston commented Apr 3, 2025

The PR:

will probably drastically improve the speed as well as once string lazy imports re it will drastically speed up the string import and this module only uses the string module to import constants from string import ascii_letters, digits, hexdigits

@Marius-Juston
Copy link
Contributor Author

I did not notice that the warmup needed for ./python -X importtime -c 'import email.quoprimime' and so the more accurate timings are actually:

regex: 153.9974 ± 35.97 (103 to 1778; n=10000)
non_regex: 148.4565 ± 25.48 (125 to 991; n=10000)

@Marius-Juston
Copy link
Contributor Author

( the new _HEX_TO_CHAR cache could also be used for the decode function as well afterwards since it checks for more or less the same thing)

# Decode if in form =AB
elif i+2 < n and line[i+1] in hexdigits and line[i+2] in hexdigits:
     decoded += unquote(line[i:i+3])

@Marius-Juston
Copy link
Contributor Author

Benchmark regex non_regex_2
empty 288 ns 259 ns: 1.11x faster
empty_medium 299 ns 1.74 us: 5.81x slower
empty_long 375 ns 16.3 us: 43.61x slower
short 725 ns 714 ns: 1.01x faster
medium 6.22 us 7.97 us: 1.28x slower
long 22.0 us 85.9 us: 3.91x slower
mixed 49.5 us 56.3 us: 1.14x slower
edge_case_short 1.26 us 744 ns: 1.69x faster
edge_case_long 177 us 125 us: 1.41x faster
Geometric mean (ref) 2.01x slower

Slightly faster

@Marius-Juston
Copy link
Contributor Author

Adding the '=' check now speeds things up:

Benchmark regex non_regex
empty 288 ns 53.6 ns: 5.37x faster
empty_medium 299 ns 54.1 ns: 5.53x faster
empty_long 375 ns 62.5 ns: 6.00x faster
short 725 ns 722 ns: 1.00x faster
medium 6.22 us 8.09 us: 1.30x slower
long 22.0 us 86.7 us: 3.94x slower
mixed 49.5 us 58.6 us: 1.18x slower
edge_case_short 1.26 us 767 ns: 1.64x faster
edge_case_long 177 us 127 us: 1.39x faster
Geometric mean (ref) 1.60x faster

@Marius-Juston
Copy link
Contributor Author

Marius-Juston commented Apr 3, 2025

As a comparison (if you compile the regex for the function + add early exit)

c = re.compile("=[a-fA-F0-9]{2}", flags=re.ASCII)

def header_decode_re(s):
    """Decode a string using regex."""
    s = s.replace('_', ' ')  # Replace underscores with spaces
    if '=' in s:
        return c.sub(_unquote_match, s)
    return s
Benchmark regex regex2 non_regex
empty 288 ns 51.4 ns: 5.60x faster 53.6 ns: 5.37x faster
empty_medium 299 ns 52.0 ns: 5.76x faster 54.1 ns: 5.53x faster
empty_long 375 ns 61.0 ns: 6.15x faster 62.5 ns: 6.00x faster
short 725 ns 560 ns: 1.29x faster 722 ns: 1.00x faster
medium 6.22 us 6.44 us: 1.04x slower 8.09 us: 1.30x slower
long 22.0 us 22.9 us: 1.04x slower 86.7 us: 3.94x slower
mixed 49.5 us 52.6 us: 1.06x slower 58.6 us: 1.18x slower
edge_case_short 1.26 us 1.12 us: 1.13x faster 767 ns: 1.64x faster
edge_case_long 177 us 189 us: 1.07x slower 127 us: 1.39x faster
Geometric mean (ref) 1.83x faster 1.60x faster

@Marius-Juston
Copy link
Contributor Author

@AA-Turner, what's your opinion on replacing this regex expression (even though it sometimes makes the algorithm slower)?

@Marius-Juston
Copy link
Contributor Author

Very slight improvement (mainly on the edge_case_short and short where string concatenation is faster than using "".join()

Benchmark regex regex2 non_regex non_regex_add
empty 288 ns 51.4 ns: 5.60x faster 53.6 ns: 5.37x faster 51.6 ns: 5.58x faster
empty_medium 299 ns 52.0 ns: 5.76x faster 54.1 ns: 5.53x faster 51.6 ns: 5.80x faster
empty_long 375 ns 61.0 ns: 6.15x faster 62.5 ns: 6.00x faster 59.9 ns: 6.26x faster
short 725 ns 560 ns: 1.29x faster 722 ns: 1.00x faster 674 ns: 1.08x faster
medium 6.22 us 6.44 us: 1.04x slower 8.09 us: 1.30x slower 7.82 us: 1.26x slower
long 22.0 us 22.9 us: 1.04x slower 86.7 us: 3.94x slower 83.5 us: 3.79x slower
mixed 49.5 us 52.6 us: 1.06x slower 58.6 us: 1.18x slower 60.6 us: 1.22x slower
edge_case_short 1.26 us 1.12 us: 1.13x faster 767 ns: 1.64x faster 699 ns: 1.80x faster
edge_case_long 177 us 189 us: 1.07x slower 127 us: 1.39x faster 127 us: 1.39x faster
Geometric mean (ref) 1.83x faster 1.60x faster 1.66x faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants