Skip to content

Partial match support #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

pkolaczk
Copy link

@pkolaczk pkolaczk commented May 9, 2020

New methods:
Regex::is_partial_match
Regex::is_partial_match_at

@pkolaczk pkolaczk mentioned this pull request May 9, 2020
New methods:
Regex::is_partial_match
Regex::is_partial_match_at
@pkolaczk
Copy link
Author

pkolaczk commented Feb 21, 2021

@BurntSushi Can we have this merged and published? This is the only unpublished change that blocks publishing fclones to crates.io. Thank you!

Copy link
Owner

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've left some comments. In the future, it's okay to ping me every few months. If a PR has gone this long without getting my attention, it's probably safe to say that it has completely dropped off my radar. I maintain many projects and things slip by.

As for this patch, I think mechanically this mostly looks good, subject to my comments. However, I do have a concern: this only exposes the ability to do "hard" partial matching and leaves no way to do "soft" partial matching. I suppose "soft" could be enabled via an option on the builder and added to the crate at a later date. (And this option would only be used when performing a partial match.)

Moreover, this doesn't seem to expose other information. For example, when a partial match occurs, is it possible to get the position of that partial match? If so, then the binary methods exposed here might not be enough.

Popping up a level, I can understand that you may not need this additional functionality and therefore do not see a reason to add it. If so, I think I'm okay with that and folks can just add things later as they need it. This strategy risks converging to a local optimum in API design though. But I think I'm okay with that for now.

src/bytes.rs Outdated
@@ -427,6 +427,25 @@ impl Regex {
self.is_match_at(subject, 0)
}

/// Returns true if and only if the regex fully or partially matches the subject string given.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code and comments should be wrapped to 79 columns (inclusive).

///
/// # Example
///
/// Test if given string can be a beginning of a valid telephone number:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a blank line before the code block.

// `disable_utf_check` method, which propagates the safety contract to
// the caller.
Ok(unsafe { match_data.find(&self.code, subject, start, options)? })
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like this code might benefit from a slight refactor. Namely, is_match_at and is_partial_match_at appear to only differ in implementation by a single thing: whether PCRE2_PARTIAL_HARD is set or not. A little code repetition doesn't normally bother me, but in this case, we can reduce the number of unsafe usages by 1 since I don't believe partial matching has any relevance to safety.

So I'd say, create a private is_match_at_imp routine with options as a parameter, and then call it from each of these functions with the appropriate initial condition.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do.

src/ffi.rs Outdated
@@ -93,7 +93,7 @@ impl Code {
/// an error.
pub fn jit_compile(&mut self) -> Result<(), Error> {
let error_code = unsafe {
pcre2_jit_compile_8(self.code, PCRE2_JIT_COMPLETE)
pcre2_jit_compile_8(self.code, PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_HARD)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this enable partial matching unconditionally when using the jit? If not, please add a comment and a test verifying that it doesn't. If it does, then we'll need to rethink how we do this because it implies that partial matching isn't just a search-time option, but also a compile-time option.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation says:

 If you want to use partial matching with just-in-time optimized code, as well as setting a partial match option for the matching function, you must also call pcre2_jit_compile() with one or both of these options:

  PCRE2_JIT_PARTIAL_HARD
  PCRE2_JIT_PARTIAL_SOFT

PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial matches on the same pattern. Separate code is compiled for each mode. If the appropriate JIT mode has not been compiled, interpretive matching code is used. 

So I read it, as indeed, this would cause more compile-time overhead, and it is good to give users control over which modes they want to enable.

What about changing this to:

pub fn jit_compile(&mut self, &JitOptions options)

but then this would break backward compatibility :(
I'm open to ideas here.

@@ -427,6 +427,9 @@ impl MatchData {
);
if rc == PCRE2_ERROR_NOMATCH {
Ok(false)
} else if rc == PCRE2_ERROR_PARTIAL &&
options & (PCRE2_PARTIAL_HARD | PCRE2_PARTIAL_SOFT) != 0 {
Ok(true)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this behavior should be documented in this function's contract.

@pkolaczk
Copy link
Author

As for adding support for both HARD and SOFT partial matching, this semantic difference would be only for methods like find, where information on the match position is returned. In case both complete and partial match exists, HARD prefers to return a partial match, where SOFT prefers a complete match. However, in either case, is_partial_match would return true.

@pkolaczk
Copy link
Author

pkolaczk commented Feb 21, 2021

Moreover, this doesn't seem to expose other information. For example, when a partial match occurs, is it possible to get the position of that partial match? If so, then the binary methods exposed here might not be enough.

Yes, but for this I guess we need to modify additional methods - like find.
Do you prefer to add more methods like find_partial_soft or find_partial_hard or would it be actually cleaner to introduce a parameter to existing methods:

enum PartialMatchOptions {
  Complete, PartialSoft, PartialHard
}

@BurntSushi
Copy link
Owner

Yes, but for this I guess we need to modify additional methods - like find.
Do you prefer to add more methods like find_partial_soft or find_partial_hard or would it be actually cleaner to introduce a parameter to existing methods:

If you don't need these, how about we punt for now? I would probably lean towards passing an "options" type, but would suspect that pcre2 has more things we would want to consider there in addition to just partial matching. So that API will take some design work.

@BurntSushi
Copy link
Owner

As for adding support for both HARD and SOFT partial matching, this semantic difference would be only for methods like find, where information on the match position is returned.

Makes sense!

@pkolaczk
Copy link
Author

pkolaczk commented Feb 21, 2021

If you don't need these, how about we punt for now?

This is ok. I don't want to change too much.

What about jit_compile - should I make another method, jit_compile_with_partial or enhance the existing one with an option argument?

@BurntSushi
Copy link
Owner

What about jit_compile - should I make another method, jit_compile_with_partial or enhance the existing one with an option argument?

I don't believe it's part of the public API, so I suppose just accepting an option param is fine.

Improved formatting, removed duplicate code,
PCRE2_JIT_PARTIAL_HARD option is now
configurable and disabled by default.
@pkolaczk
Copy link
Author

Hey @BurntSushi, did you have a chance to take a look at my update?

@maxbrunsfeld
Copy link

This feature is super important for my application. Thanks for the PR @pkolaczk; I'll probably use this branch in the meantime.

@bemoody
Copy link

bemoody commented Dec 11, 2022

This seems wrong. If there is a complete match and no partial match (i.e., if pcre2_match_8 is called with PCRE2_PARTIAL_HARD and returns > 0), then is_partial_match should return false, not true.

Otherwise there's no way for an application to know, given that the subject string already contains a match, whether or not that match might change upon adding more text to the end of the string.

The circumstances are few and far between where it would be desirable for a caller to handle partial matches the same way as complete matches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants