-
-
Notifications
You must be signed in to change notification settings - Fork 36
Partial match support #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
New methods: Regex::is_partial_match Regex::is_partial_match_at
@BurntSushi Can we have this merged and published? This is the only unpublished change that blocks publishing fclones to crates.io. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've left some comments. In the future, it's okay to ping me every few months. If a PR has gone this long without getting my attention, it's probably safe to say that it has completely dropped off my radar. I maintain many projects and things slip by.
As for this patch, I think mechanically this mostly looks good, subject to my comments. However, I do have a concern: this only exposes the ability to do "hard" partial matching and leaves no way to do "soft" partial matching. I suppose "soft" could be enabled via an option on the builder and added to the crate at a later date. (And this option would only be used when performing a partial match.)
Moreover, this doesn't seem to expose other information. For example, when a partial match occurs, is it possible to get the position of that partial match? If so, then the binary methods exposed here might not be enough.
Popping up a level, I can understand that you may not need this additional functionality and therefore do not see a reason to add it. If so, I think I'm okay with that and folks can just add things later as they need it. This strategy risks converging to a local optimum in API design though. But I think I'm okay with that for now.
src/bytes.rs
Outdated
@@ -427,6 +427,25 @@ impl Regex { | |||
self.is_match_at(subject, 0) | |||
} | |||
|
|||
/// Returns true if and only if the regex fully or partially matches the subject string given. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code and comments should be wrapped to 79 columns (inclusive).
/// | ||
/// # Example | ||
/// | ||
/// Test if given string can be a beginning of a valid telephone number: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a blank line before the code block.
// `disable_utf_check` method, which propagates the safety contract to | ||
// the caller. | ||
Ok(unsafe { match_data.find(&self.code, subject, start, options)? }) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me like this code might benefit from a slight refactor. Namely, is_match_at
and is_partial_match_at
appear to only differ in implementation by a single thing: whether PCRE2_PARTIAL_HARD
is set or not. A little code repetition doesn't normally bother me, but in this case, we can reduce the number of unsafe
usages by 1 since I don't believe partial matching has any relevance to safety.
So I'd say, create a private is_match_at_imp
routine with options
as a parameter, and then call it from each of these functions with the appropriate initial condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will do.
src/ffi.rs
Outdated
@@ -93,7 +93,7 @@ impl Code { | |||
/// an error. | |||
pub fn jit_compile(&mut self) -> Result<(), Error> { | |||
let error_code = unsafe { | |||
pcre2_jit_compile_8(self.code, PCRE2_JIT_COMPLETE) | |||
pcre2_jit_compile_8(self.code, PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_HARD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this enable partial matching unconditionally when using the jit? If not, please add a comment and a test verifying that it doesn't. If it does, then we'll need to rethink how we do this because it implies that partial matching isn't just a search-time option, but also a compile-time option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation says:
If you want to use partial matching with just-in-time optimized code, as well as setting a partial match option for the matching function, you must also call pcre2_jit_compile() with one or both of these options:
PCRE2_JIT_PARTIAL_HARD
PCRE2_JIT_PARTIAL_SOFT
PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial matches on the same pattern. Separate code is compiled for each mode. If the appropriate JIT mode has not been compiled, interpretive matching code is used.
So I read it, as indeed, this would cause more compile-time overhead, and it is good to give users control over which modes they want to enable.
What about changing this to:
pub fn jit_compile(&mut self, &JitOptions options)
but then this would break backward compatibility :(
I'm open to ideas here.
@@ -427,6 +427,9 @@ impl MatchData { | |||
); | |||
if rc == PCRE2_ERROR_NOMATCH { | |||
Ok(false) | |||
} else if rc == PCRE2_ERROR_PARTIAL && | |||
options & (PCRE2_PARTIAL_HARD | PCRE2_PARTIAL_SOFT) != 0 { | |||
Ok(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this behavior should be documented in this function's contract.
As for adding support for both HARD and SOFT partial matching, this semantic difference would be only for methods like |
Yes, but for this I guess we need to modify additional methods - like enum PartialMatchOptions {
Complete, PartialSoft, PartialHard
} |
If you don't need these, how about we punt for now? I would probably lean towards passing an "options" type, but would suspect that pcre2 has more things we would want to consider there in addition to just partial matching. So that API will take some design work. |
Makes sense! |
This is ok. I don't want to change too much. What about jit_compile - should I make another method, |
I don't believe it's part of the public API, so I suppose just accepting an option param is fine. |
Improved formatting, removed duplicate code, PCRE2_JIT_PARTIAL_HARD option is now configurable and disabled by default.
Hey @BurntSushi, did you have a chance to take a look at my update? |
This feature is super important for my application. Thanks for the PR @pkolaczk; I'll probably use this branch in the meantime. |
This seems wrong. If there is a complete match and no partial match (i.e., if Otherwise there's no way for an application to know, given that the subject string already contains a match, whether or not that match might change upon adding more text to the end of the string. The circumstances are few and far between where it would be desirable for a caller to handle partial matches the same way as complete matches. |
New methods:
Regex::is_partial_match
Regex::is_partial_match_at