Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checking if both strings are the same in SequenceMatcher.find_longest_match #132166

Open
jurgenwigg opened this issue Apr 6, 2025 · 12 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@jurgenwigg
Copy link

jurgenwigg commented Apr 6, 2025

Feature or enhancement

Proposal:

Add checking if sequences a[alo:ahi] and b[blo:bhi] are the same on the beginning of the method find_longest_match in SequenceMatcher. For identical sequences there is no reason to run whole logic when simple check can be done. It appears to fix issue when comparing two slightly different strings ends up with waiting forever for the result. This solves problem reported here pytest-dev/pytest#8998

Proposed fix:

if a[alo:ahi] == b[blo:bhi]:
    return Match(alo, blo, len(a[alo:ahi]))

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

@sobolevn
Copy link
Member

sobolevn commented Apr 6, 2025

a[alo:ahi] and b[blo:bhi] will create two new slices of possibly big strings. How does it affect both timings and memory usage?

@jurgenwigg
Copy link
Author

Good question - I didn't measure it. Keep in mind, that without this change, assert used in pytest takes forever. Nevertheless it is worth checking how it affects memory usage. For sure it doesn't affects overall performance (short execution time of tests) but it would be good to profile it.

@picnixz picnixz added stdlib Python modules in the Lib dir performance Performance or resource usage labels Apr 6, 2025
@picnixz picnixz changed the title Add checking if both strings are the same in SequenceMatcher().find_longest_match Add checking if both strings are the same in SequenceMatcher.find_longest_match Apr 6, 2025
@gaogaotiantian
Copy link
Member

gaogaotiantian commented Apr 7, 2025

Keep in mind, that without this change, assert used in pytest takes forever

It seems like this is a pytest issue, not a CPython one. If this is really a critical matter for pytest, it should be easy enough for them to add an extra check for equality.

I think this improvement is too limited and with some potentially observable performance impact. A similar discussion was done in #106865 (#106877 (comment) for specifics) - the problem probably is caused by not correctly using the library.

@jurgenwigg
Copy link
Author

I've done some debug on the pytest site. In this line: https://github.com/pytest-dev/pytest/blob/18439f52fd00bfa42a7ad1d66ca862bc62e36c59/src/_pytest/assertion/util.py#L328 pytests assert is executing function that will prepare message for explanation. This is list comprehension where ndiff is called with perfectly fine strings given as an arguments. ndiff is a wrapper for Differ().compare() which calls SequenceMatcher().find_longest_match. Without this simple check test is failing (I will commit it soon) with very very long time of execution, which can be taken as a timeout.

From performance point of view this change, for cases where both sequences are identical, it is a huge improvement. This check is operating on already initialised variables which should be already a declared and reserved memory. So in this case we are doing very simple operation on known and initialised data in the memory. Correct me if I'm wrong here, but this is my understanding of how things work in this situation.

The biggest question which I don't understand (as for now) is why adding this transparent check for most cases, causes code to execute faster (timeout vs expected result).

@jurgenwigg
Copy link
Author

I've gone deeper and it looks like the issue occurs in SequenceMatcher().get_matching_blocks() at the very end of the method:

        self.matching_blocks = list(map(Match._make, non_adjacent))
        return self.matching_blocks

If I place breakpoint() between these lines pdb stops correctly, but if I continue execution it hangs. get_mathing_blocks() is called by ratio() method and while debugging the issue executing line matches = sum(triple[-1] for triple in self.get_matching_blocks()) triggers the issue. Even calling self.get_matching_blocks() alone gives the same result - waiting forever. Maybe the root cause is laying somewhere else, but for at a first glance something wrong happens while executing return self.matching_blocks in get_mathing_blocks(). I'll try to dig to the root cause.

@sobolevn
Copy link
Member

sobolevn commented Apr 7, 2025

Try adding iter_matching_blocks that does the same, but without list() call on map. And get_matching_blocks will return list(self.iter_matching_blocks()). And refactor proper places to use iter_matching_blocks instead of get_matching_blocks

If this solves the problem - we can add this method :)

@jurgenwigg
Copy link
Author

I've tried that - issue still exists. I've tried conversion to list comprehension, replacing Match() namedtuple with dataclass, nothing helped.

It loops here:

-> for line in ndiff(right.splitlines(keepends), left.splitlines(keepends))
  /home/jurgen/cpython/Lib/difflib.py(886)compare()
-> yield from g
  /home/jurgen/cpython/Lib/difflib.py(959)_fancy_replace()
-> and cr() > best_ratio):
  /home/jurgen/cpython/Lib/difflib.py(632)ratio()
-> matches = sum(triple[-1] for triple in self.get_matching_blocks())
  /home/jurgen/cpython/Lib/difflib.py(466)get_matching_blocks()
-> i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
> /home/jurgen/cpython/Lib/difflib.py(414)find_longest_match()
-> breakpoint()
(Pdb)

@jurgenwigg
Copy link
Author

jurgenwigg commented Apr 7, 2025

Ok, got the place of the issue in SequenceMatcher().get_matching_blocks():

        while queue:
            alo, ahi, blo, bhi = queue.pop()
            i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
            # a[alo:i] vs b[blo:j] unknown
            # a[i:i+k] same as b[j:j+k]
            # a[i+k:ahi] vs b[j+k:bhi] unknown
            if k:   # if k is 0, there was no matching block
                matching_blocks.append(x)
                if alo < i and blo < j:
                    queue.append((alo, i, blo, j))
                if i+k < ahi and j+k < bhi:
                    queue.append((i+k, ahi, j+k, bhi))

For my test, where both sequences differ only by one character (e.g. a='bb', b=' bb') it iterates over whole big sequence

# get_matching_blocks()
(Pdb) la
79647
(Pdb) lb
79648

by 4 blocks. That's why it takes so much time I think.

Why 4? Because my test strings are:

# get_matching_blocks()
(Pdb) self.a[:20]
'[10, 10, 10, 10, 10,'
(Pdb) self.b[:20]
' [10, 10, 10, 10, 10'
(Pdb)

so by the block means 10, .

Test which I'm using to debug the issue is taken from linked pytest issue. Question is if this behaviour is expected or not?

@jurgenwigg
Copy link
Author

While trying to solve this issue, I've tried to implement find_longest_match in another way. I've ended up with something like this:

def compare_strings(shorter, longer):
    len_diff = len(longer) - len(shorter) + 1
    shorter_len = len(shorter)
    res = []
    for offset in range(len_diff):
        prev = False
        for i in range(shorter_len):
            if shorter[i] == longer[offset + i]:
                if prev:
                    res[-1][-1] += 1
                else:
                    res.append([i, offset + i, 1])
                    prev = True
            else:
                prev = False
    return res


def lcs(string_1, string_2):
    len_diff = len(string_1) - len(string_2)
    if len_diff < 0:
        res = compare_strings(shorter=string_1, longer=string_2)
    else:
        res = compare_strings(shorter=string_2, longer=string_1)
    res = list(filter(lambda item: item[-1] > 1, res))
    return [*res, [len(string_1), len(string_2), 0]]


def find_longest_match(a="", b="", alo=0, ahi=None, blo=0, bhi=None):
    if ahi is None:
        ahi = len(a)
    if bhi is None:
        bhi = len(b)
    temp = lcs(string_1=a[alo:ahi], string_2=b[blo:bhi])
    return sorted(temp, key=lambda x: x[-1], reverse=True)[0]

It's not pretty but main thing is to present the idea.

It is doing the same things as get_matching_blocks() and find_longest_match(). No autojunk or any other functionality is implemented, due to its PoC nature. I've profiled it with python3.13 -m cProfile and it looks promising:

# Test fragments
[[find_longest_match(*item) for item in test_set] for _ in range(1000)]
print("LCS finished")
[
    [difflib.SequenceMatcher(*item).find_longest_match() for item in test_set]
    for _ in range(1000)
]
print("difflib finished")
$ python3.13 -m cProfile lcs.py                                             1 ↵
LCS finished
difflib finished
         79798045 function calls (79798018 primitive calls) in 27.755 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      3/1    0.000    0.000   27.755   27.755 {built-in method builtins.exec}
        1    0.004    0.004   27.755   27.755 lcs.py:1(<module>)
     3000   12.752    0.004   18.611    0.006 difflib.py:305(find_longest_match)
     3000    0.001    0.000    9.131    0.003 lcs.py:41(find_longest_match)
     3000    0.005    0.000    9.127    0.003 lcs.py:31(lcs)
     3000    9.119    0.003    9.121    0.003 lcs.py:13(compare_strings)
 79689012    5.857    0.000    5.857    0.000 {method 'get' of 'dict' objects}
     3000    0.001    0.000    0.007    0.000 difflib.py:120(__init__)

test_set includes two 80k strings to compare. Maybe this implementation could also address performance issues with ratio()?

@gaogaotiantian
Copy link
Member

Please take a look at the discussion at #106877 (comment) . I think the whole point from that discussion is - find_longest_match is not designed for long sequences with a small alphabet (for example, a huge string). When you are using the wrong tool to solve a problem, you should consider switching to a different tool, or viewing the problem from a different angle (for example, split the string into lines if applicable), not modifying the tool to fit the issue.

A few responses to your thoughts:

  1. Unfortunately, Python's slice (a[:]) does create a new array which takes time and memory, even when a is initialized.
  2. A prototype of a method that does not reproduce the existing behavior is not super helpful because it's impossible for us to replace it. As far as I remembered, autojunk actually has a significant impact on performance (might be wrong here).
  3. There are cases where we add something in a function and the problem magically disappears, but that's not a reason we should do it. We need to dig into the root problem and see how generic that solution is, as well as the potential side effects. Without knowing too much about the use cases, your initial solution seems too narrow - the problem it solves is too specific which would be hardly hit in real scenarios.
  4. It would be very helpful if you can come up with a minimal reproducible example when we discuss this issue. A minimal reproducible example is a small piece of code (which preferably only requires CPython, not any other 3rd party libraries) that we can copy & paste to demostrate the problem you mentioned. This way we can have a much better understanding of the issue and provide more helpful feedbacks.

@jurgenwigg
Copy link
Author

Makes sense, thank you for the answer.

So for fixing the problem in the pytest for comparing long sequences, e.g. long bytes sequence, which tool from the stdlib could/should be used?

I've read that discussion and that particular comment, and this issue here is almost the same

Ad.1
I thought that while accessing a slice in Python a[:] is just moving/taking a pointer to the value in already initialised memory. Where I can read more about how such things works under the hood?

@gaogaotiantian
Copy link
Member

So for fixing the problem in the pytest for comparing long sequences, e.g. long bytes sequence, which tool from the stdlib could/should be used?

I don't think it's a trivial problem to generically "diff" between two long sequences with a small alphabet. Consider two random strings with 30k characters, how do you "diff" them? I don't think there's a utility in stdlib that's specifically designed for it. I think the correct way to solve this is that pytest should not do such a thing. A diff between two super large bytes/strings does not help anyone does it? I don't believe pytest actually prints the diff for two 30k long strings. If it does not print it, why calculate it?

I thought that while accessing a slice in Python a[:] is just moving/taking a pointer to the value in already initialised memory. Where I can read more about how such things works under the hood?

That's just how Python works. For example, if you do a[:][0] = 1, a won't be affected. I don't remember where I read it for the first time but you'll get used to it when you use Python more (I do believe it's somewhere in the documentation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants