Skip to content

Span merge review #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

pascalkuthe
Copy link

I did a pretty comprehensive review of the SpanIter from helix-editor#3695.
I found a couple of bugs, and some opportunities for performance improvement.
The changes are complex and quite a lot of the so I did not want to suggest them trough the GitHub web UI.
See my comments in helix-editor#3695 for more details

This bug occurs if:
* A span needs to be subliced at the end of an in-progress span
* Another shorter span exists at the same poisitiotion
* That span ends before the in-progress span does

An exammple would look roughly like this:

     D
   |--|
       C
   |-------|
        B
   |----------|
     A
|---------|

Because spans at the same position are sorted in the descending order,
the largest span (B) is processed first and subslicing  is started.
The current code then moves the start of all subslices at the
same position to the end of the of the in-progress span (A).
This is correctt in many cases (like C) where the spans at the same poistion
extend past the end of the in-progress span (A).
However if any spans (C) are shorted than the in-progress span (A),
the start of these slices will be moved past their end.

To fix this, the code now first splits all spans that end before
the end of the in-progress span (A) (in this case B and C) into subslices.
This is identical to the previous behaviour except that the subslicing
stops as soon as the first span whose end starts before the end of the
in-progress span (A) is encountered.

Afterwards the span list is resorted (unchanged) to ensure the new
subslices are at the correct posistion.

Finally the spans that are shorter than the in-progress span (A)
are processed normally (no subslicing).
By sorting regends in descending order, the ranges that end before
the next span can simply be popped off the end.

This allows turning two inner loops (final loop and retain) into
simpel if conditions that return early.
This presents a nice peformance win in two ways:

* Do no iterate the entire `range_ends` vector during retain
* No need to move all elements in range_ends just to remove front of elements
* Many events can be returned directly instead of pushing to the VecDeque
@the-mikedavis
Copy link
Owner

Sweet, thanks for taking a look at this in such depth! Having another pair of eyes on this should hopefully help iron out those last remaining bugs. It might take me a while to look into the changes properly but a few notes off the top of my head:

  • property testing: I thought this might be useful after writing a few unit tests manually; thanks for setting it up 😀
  • sorting range ends descending: I had a commit with this change for a while but I dropped it because it exposed more edge cases than sorting ascending. It looks like your changes may have covered the edge cases I found. I will mostly be looking at this while reviewing.

@pascalkuthe
Copy link
Author

pascalkuthe commented Sep 25, 2022

Sorry about the failing tests somehow I made a small mistake while factoring out a function and forgot to save before running cargo test (still used to autosaving from nvim) :D
It should be fixed now

sorting range ends descending: I had a commit with this change for a while but I dropped it because it exposed more edge cases than sorting ascending. It looks like your changes may have covered the edge cases I found. I will mostly be looking at this while reviewing.

Ah that makes sense so it was intentional. I wasn't quite sure because the documentation of the span_iter function still claimed that spans should be ordered in descending order while elsewhere (and one testcase) used ascending order.

I guess you probably found similar edge cases to me. The edge cases I fixed here actually also occur for ascending order but instead of completely nonsensical spans (end past the start) this leads to subtlety broken highlighting which none of the previous tests would have picked up.
I updated my comment to reflect that (also with some more details). I had that mixed up because I mostly looked at descending order most of the time.
My fix wouldn't actually work for ascending order (and an alternative fix for ascending order would be more complex I think).

Note that I have removed the empty_span_at_sublice_start testcase. This was the minimal reproducible testcase the proptest gave me but its actually a poor example and relied on empty spans to introduce the bug (and therefore doesn't actually reach the edgecase anymore, because empty spans are now ignored)

@pascalkuthe
Copy link
Author

It seems that dropping empty spans can cause a mismatched number of HiglightStart/HighlightEnd events. I am not too sure why that is. I don't have time too dig into that right now so I just dropped that one commit since it was pretty trivial

Copy link
Owner

@the-mikedavis the-mikedavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was only able to look at this superficially so far but I'm liking what I see 😀

Comment on lines 130 to 137
while let Some(span) = self.spans.get(i) {
if span <= &intersect_pos {
after = Some(i);
i += 1;
} else {
break;
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for the loop { match { Some(range) if .. { .. }, _ => break, } } pattern since I thought the while let version was less clear that we're only looking for exactly one pattern. What do you think, is the while let version more idiomatic?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like the while loop better, because it looks less cluttered.
I personally always prefer if to match if there is just one case that I actually care about.
Its also closer to my ideal solution which would be a let_chain whenever that becomes available on a rust version we can use: while let Some(span) = self.spans.get(i) && span <= &intersect_pos

The while loop is just a standard vec iterator so when scanning the code its immediately) obvious what it does and I can focus on the content of the loop (where anything interesting is happening).

However thinking about this more this can actually be solved with an iterator which makes it much clearer.
This also inspired me to rename some variables to make the behaviour of partition_spans_at much clearer.
These changes are in 30563c4. Let me know what you think

// the runtime is even better since we only scan from `self.index` to
// the first element of the Vec with a `range.start` after this range.
let mut after = None;
let intersect_pos = Span {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would read clearer as intersect_span

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 2030231.

This made me notice that the term range is kind of used randomly instead of span. I think it might be nice to consistently use the term span instead. I did that in 7de6f2a and also expanded on the documentation of partition_spans_at. Do you think the renaming is better or should I revert that commit (except for the documentation)?

@the-mikedavis
Copy link
Owner

From the upstream PR review:

I haven't looked at the merge implementation yet, I think that will also be a good fit for proptesting

I'm much more confident about that implementation: the procedure is much simpler - basically the same problem as merging two lists which are already sorted - I haven't found any bugs in it yet (compared to span_iter which has been quite a headache), and it's essentially the same as the original implementation of Merge.

That being said, I agree it's a good place for more proptesting 👍

@pascalkuthe
Copy link
Author

pascalkuthe commented Sep 26, 2022

It seems that dropping empty spans can cause a mismatched number of HiglightStart/HighlightEnd events. I am not too sure why that is. I don't have time too dig into that right now so I just dropped that one commit since it was pretty trivial

I have fixed this now in 8f995d6. I simply dropped the empty spans too late where they already had some influence on the state of the iterator and ignoring them left the iterator in an invalid state.

I now drop the right at the start of the iterator before any state has been modified.

Protesting is really invaluable for this specific task. Whenever I make an oversight it is always caught (even tough none of the existing testcases found it). While I am not 100% sure about this I am reasonable sure that the proptest can reach 100% coverage on all permutations of the algorithm. I will look into improving the prop testing even further (and the applying it to merge)

@pascalkuthe pascalkuthe force-pushed the span-merge-review branch 2 times, most recently from dbc5c28 to 31c8931 Compare September 27, 2022 21:23
@pascalkuthe
Copy link
Author

pascalkuthe commented Sep 27, 2022

Another update from me (sorry for the spam).

I have significantly improved the proptesting infrastructure.
I have created a datastructure that allows me to represent the highlight of each character as a bitset.
This datastructure can be easily constructed from both highlight events and spans directly.
This allows me to compare the highlighting produced by the span_iter to that of the original evenets.

This has turned up two minor bugs (which are just extensions of what I already worked on):

  • empty spans must always be removed because they can actually mess up the highlight of other spans, in particular span partioning can create new empty spans by moving the start of a span to its end
  • previously the spans vector was resorted using rotate_left this was a very nice optimization but sadly only works if the spans were only sorted by their start. I already noticed during my initial review that this is not correct because span ends must also stay in descending order. However my suggested fix was wrong: I simply also rotate the spans that had a larger end to the front. However the span ends are not actually sorted (or rather there are two distincs sorted lists) and therefore if that special case occurs, a (very small) part of the spans vec needs a full call to sort_unstable

I have pushed the fixes. Right now I am letting my PC run a billion proptestcases to ensure there are no edgecases left but I am already at a couple hundred million so I doubt any more will turn up. The proptests now cover the entire output of the algorithm so we have realistically 100% coverage once that tests concludes.

I will start looking at the merge implementation now.

@pascalkuthe
Copy link
Author

(closed in favor of helix-editor#4042)

@pascalkuthe pascalkuthe closed this Oct 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants