Skip to content

Commit ff083d0

Browse files
New algorithm for File.match? (#15607)
This is a complete rewrite of `File.match?`, necessary to fix some serious shortcomings. The implementation is ported from the Rust crate https://github.com/oxc-project/fast-glob. It is based on the same [linear-time algorithm](https://research.swtch.com/glob) as our old implementation, but implements backtracking, which is necessary to properly handle patterns with multiple wildcards or globstars (#15319). There is some divergence from the current implementation, of course. But I've made sure to keep it minimal. ### Changes from previous Crystal implementation Most impactful is fixing the incorrect behaviour, which of course leads to differences. This part is inevitable. This new implementation fixes several bugs in pattern matching. This may break use cases which (perhaps inadvertently) relied on the buggy behaviour. `a**` no longer matches `ab/c`. The documentation already claimed this to be the case. And it's in accordance to most other libraries (see [comparison table](#15319 (comment))). I believe this is generally the more sensible and expected behaviour. So it seems to be the right thing to do, but it's still breaking existing behaviour. The new algorithm is more lenient when parsing character ranges, for example it accepts `[-]` as a character set. Previously, this was considered a pattern error (incomplete range) and raised an exception. So we're going from an error to defined behaviour. This is a breaking change in behaviour. But I believe it is a good one and the result is more sensible and matches that of other implementations. I'm not aware of any other breaking changes in the new implementation. All known behaviour changes are all documented in the spec diff. See #15319 (comment) for more details about the changes. ### Changes from fast-glob The fast-glob crate does not support UTF-8 decoding, which our current implementation does. So I integrated that into the new algorithm. UTF-8 awareness is only necessary for single char ( `?`) and character class (`[a]`, `[a-b]`) patterns. In all other situations, it really doesn't matter whether a byte is a part of a multibyte character or not. So the matching is generally byte-oriented and only decodes UTF-8 when necessary. I have explicitly disabled some features of the original Rust implementation because they are breaking changes in the pattern language with respect to our previous implementation: * Pattern negation: A pattern starting with `!` would be negated. * Character set negation with `!` or `^` (only the latter is supported in Crystal). We can consider adding these back later on. The implementation is trivial. But it should be a separate decision to do that.
1 parent 1659675 commit ff083d0

File tree

4 files changed

+580
-331
lines changed

4 files changed

+580
-331
lines changed

0 commit comments

Comments
 (0)