-
-
Notifications
You must be signed in to change notification settings - Fork 31.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add checking if both strings are the same in SequenceMatcher.find_longest_match
#132166
Comments
|
Good question - I didn't measure it. Keep in mind, that without this change, |
SequenceMatcher.find_longest_match
It seems like this is a pytest issue, not a CPython one. If this is really a critical matter for pytest, it should be easy enough for them to add an extra check for equality. I think this improvement is too limited and with some potentially observable performance impact. A similar discussion was done in #106865 (#106877 (comment) for specifics) - the problem probably is caused by not correctly using the library. |
I've done some debug on the From performance point of view this change, for cases where both sequences are identical, it is a huge improvement. This check is operating on already initialised variables which should be already a declared and reserved memory. So in this case we are doing very simple operation on known and initialised data in the memory. Correct me if I'm wrong here, but this is my understanding of how things work in this situation. The biggest question which I don't understand (as for now) is why adding this transparent check for most cases, causes code to execute faster (timeout vs expected result). |
I've gone deeper and it looks like the issue occurs in self.matching_blocks = list(map(Match._make, non_adjacent))
return self.matching_blocks If I place |
Try adding If this solves the problem - we can add this method :) |
I've tried that - issue still exists. I've tried conversion to list comprehension, replacing It loops here: -> for line in ndiff(right.splitlines(keepends), left.splitlines(keepends))
/home/jurgen/cpython/Lib/difflib.py(886)compare()
-> yield from g
/home/jurgen/cpython/Lib/difflib.py(959)_fancy_replace()
-> and cr() > best_ratio):
/home/jurgen/cpython/Lib/difflib.py(632)ratio()
-> matches = sum(triple[-1] for triple in self.get_matching_blocks())
/home/jurgen/cpython/Lib/difflib.py(466)get_matching_blocks()
-> i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
> /home/jurgen/cpython/Lib/difflib.py(414)find_longest_match()
-> breakpoint()
(Pdb) |
Ok, got the place of the issue in while queue:
alo, ahi, blo, bhi = queue.pop()
i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
# a[alo:i] vs b[blo:j] unknown
# a[i:i+k] same as b[j:j+k]
# a[i+k:ahi] vs b[j+k:bhi] unknown
if k: # if k is 0, there was no matching block
matching_blocks.append(x)
if alo < i and blo < j:
queue.append((alo, i, blo, j))
if i+k < ahi and j+k < bhi:
queue.append((i+k, ahi, j+k, bhi)) For my test, where both sequences differ only by one character (e.g.
by 4 blocks. That's why it takes so much time I think. Why 4? Because my test strings are: # get_matching_blocks()
(Pdb) self.a[:20]
'[10, 10, 10, 10, 10,'
(Pdb) self.b[:20]
' [10, 10, 10, 10, 10'
(Pdb) so by the block means Test which I'm using to debug the issue is taken from linked pytest issue. Question is if this behaviour is expected or not? |
While trying to solve this issue, I've tried to implement def compare_strings(shorter, longer):
len_diff = len(longer) - len(shorter) + 1
shorter_len = len(shorter)
res = []
for offset in range(len_diff):
prev = False
for i in range(shorter_len):
if shorter[i] == longer[offset + i]:
if prev:
res[-1][-1] += 1
else:
res.append([i, offset + i, 1])
prev = True
else:
prev = False
return res
def lcs(string_1, string_2):
len_diff = len(string_1) - len(string_2)
if len_diff < 0:
res = compare_strings(shorter=string_1, longer=string_2)
else:
res = compare_strings(shorter=string_2, longer=string_1)
res = list(filter(lambda item: item[-1] > 1, res))
return [*res, [len(string_1), len(string_2), 0]]
def find_longest_match(a="", b="", alo=0, ahi=None, blo=0, bhi=None):
if ahi is None:
ahi = len(a)
if bhi is None:
bhi = len(b)
temp = lcs(string_1=a[alo:ahi], string_2=b[blo:bhi])
return sorted(temp, key=lambda x: x[-1], reverse=True)[0] It's not pretty but main thing is to present the idea. It is doing the same things as # Test fragments
[[find_longest_match(*item) for item in test_set] for _ in range(1000)]
print("LCS finished")
[
[difflib.SequenceMatcher(*item).find_longest_match() for item in test_set]
for _ in range(1000)
]
print("difflib finished") $ python3.13 -m cProfile lcs.py 1 ↵
LCS finished
difflib finished
79798045 function calls (79798018 primitive calls) in 27.755 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
3/1 0.000 0.000 27.755 27.755 {built-in method builtins.exec}
1 0.004 0.004 27.755 27.755 lcs.py:1(<module>)
3000 12.752 0.004 18.611 0.006 difflib.py:305(find_longest_match)
3000 0.001 0.000 9.131 0.003 lcs.py:41(find_longest_match)
3000 0.005 0.000 9.127 0.003 lcs.py:31(lcs)
3000 9.119 0.003 9.121 0.003 lcs.py:13(compare_strings)
79689012 5.857 0.000 5.857 0.000 {method 'get' of 'dict' objects}
3000 0.001 0.000 0.007 0.000 difflib.py:120(__init__)
|
Please take a look at the discussion at #106877 (comment) . I think the whole point from that discussion is - A few responses to your thoughts:
|
Makes sense, thank you for the answer. So for fixing the problem in the I've read that discussion and that particular comment, and this issue here is almost the same Ad.1 |
I don't think it's a trivial problem to generically "diff" between two long sequences with a small alphabet. Consider two random strings with 30k characters, how do you "diff" them? I don't think there's a utility in stdlib that's specifically designed for it. I think the correct way to solve this is that pytest should not do such a thing. A diff between two super large bytes/strings does not help anyone does it? I don't believe pytest actually prints the diff for two 30k long strings. If it does not print it, why calculate it?
That's just how Python works. For example, if you do |
Feature or enhancement
Proposal:
Add checking if sequences
a[alo:ahi]
andb[blo:bhi]
are the same on the beginning of the methodfind_longest_match
inSequenceMatcher
. For identical sequences there is no reason to run whole logic when simple check can be done. It appears to fix issue when comparing two slightly different strings ends up with waiting forever for the result. This solves problem reported here pytest-dev/pytest#8998Proposed fix:
Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
No response
Linked PRs
The text was updated successfully, but these errors were encountered: