Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-132121: Always escape non-printable characters in pygettext #132122

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tomasr8
Copy link
Member

@tomasr8 tomasr8 commented Apr 5, 2025

Pretty much the title, all non-printable characters are always escaped regardless whether --escape is passed or not.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please measure performance of escape_ascii() for non-ASCII strings of different length (10 to 100000)?

@tomasr8
Copy link
Member Author

tomasr8 commented Apr 5, 2025

This is on a PGO build:

main

n=10
        only printable: 0.0015040868893265724
        only non-printable: 0.001876825001090765
        mix (90% printable, 10% non-printable): 0.001469448208808899
        mix (50% printable, 50% non-printable): 0.0016370960511267185
n=100
        only printable: 0.010028427001088858
        only non-printable: 0.015033604810014367
        mix (90% printable, 10% non-printable): 0.010826029814779758
        mix (50% printable, 50% non-printable): 0.012500355951488018
n=1000
        only printable: 0.09287006920203567
        only non-printable: 0.14058534195646644
        mix (90% printable, 10% non-printable): 0.09682066598907113
        mix (50% printable, 50% non-printable): 0.11390575394034386
n=10000
        only printable: 0.9122045831754804
        only non-printable: 1.4113427978008986
        mix (90% printable, 10% non-printable): 0.9659982230514288
        mix (50% printable, 50% non-printable): 1.1028782678768039
n=100000
        only printable: 12.202279687859118
        only non-printable: 13.532787967007607
        mix (90% printable, 10% non-printable): 12.229883643100038
        mix (50% printable, 50% non-printable): 12.140505719929934

PR

n=10
        only printable: 0.0016840400639921427
        only non-printable: 0.008518028073012829
        mix (90% printable, 10% non-printable): 0.002215402899309993
        mix (50% printable, 50% non-printable): 0.0051144990138709545
n=100
        only printable: 0.012747738044708967
        only non-printable: 0.08456205087713897
        mix (90% printable, 10% non-printable): 0.02049739588983357
        mix (50% printable, 50% non-printable): 0.050314649008214474
n=1000
        only printable: 0.12691838503815234
        only non-printable: 0.8395978400949389
        mix (90% printable, 10% non-printable): 0.20092108193784952
        mix (50% printable, 50% non-printable): 0.4972433662042022
n=10000
        only printable: 1.2568156099878252
        only non-printable: 8.38266294496134
        mix (90% printable, 10% non-printable): 2.0531771059613675
        mix (50% printable, 50% non-printable): 4.823235515970737
n=100000
        only printable: 12.481141783064231
        only non-printable: 82.42004368198104
        mix (90% printable, 10% non-printable): 20.128975569969043
        mix (50% printable, 50% non-printable): 49.69893314293586

test code:

def test_escape_ascii(string):
    for _ in range(100):
        escape_ascii(string, 'utf-8')

make_escapes(True)

for n in (10, 100, 1_000, 10_000, 100_000):
    print(f"{n=}")
    string = 'α' * n
    print(f"\tonly printable: {timeit.timeit(partial(test_escape_ascii, string), number=10)}")
    string = '\302\200' * n
    print(f"\tonly non-printable: {timeit.timeit(partial(test_escape_ascii, string), number=10)}")
    string = 'α' * ((9*n)//10) + '\302\200' * (n//10)
    print(f"\tmix (90% printable, 10% non-printable): {timeit.timeit(partial(test_escape_ascii, string), number=10)}")
    string = 'α' * (n//2) + '\302\200' * (n//2)
    print(f"\tmix (50% printable, 50% non-printable): {timeit.timeit(partial(test_escape_ascii, string), number=10)}")

Strings that only contain non-printable characters are much slower with this PR, though they should be quite uncommon. I also tested mixed strings (50% printable/non-printable and 90%/10% printable/non-printable) where the numbers look more reasonable but maybe there's a way to optimize the function?

@serhiy-storchaka
Copy link
Member

Thanks. It is expected that non-printable characters should be slower, but non-ASCII non-printable characters should be extremely rare. I worried about printable characters. How is this in comparison with the current code?

The time grow looks linear, but this may be CPython specific optimization, on other implementations it may be quadratic.

I am asking because you changed the implementation to use += in a loop instead of str.join() with a generator -- this may be slower or faster, depending on data and Python implementation. It does not matter if the total time is instant, but if it is increased from seconds to minutes, we have a problem.

@serhiy-storchaka
Copy link
Member

Ah, I see, you already tested it against main and PR. The difference is insignificant. Could you please test it on PyPy? Just to be sure that the worst case is not so bad.

@tomasr8
Copy link
Member Author

tomasr8 commented Apr 5, 2025

Could you please test it on PyPy? Just to be sure that the worst case is not so bad.

Will do, either today or tomorrow :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants