-
Notifications
You must be signed in to change notification settings - Fork 29.5k
Added tokenize keyword arguments to feature extraction pipeline #19382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this.
Thanks for the tests too !
I added a remark to keep from breaking anything in user code.
preprocess_params = {} | ||
if truncation is not None: | ||
preprocess_params["truncation"] = truncation | ||
def _sanitize_parameters(self, tokenize_kwargs=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately we have to keep truncation
in there since it was already allowed in order to NOT break anything.
Something like:
def _sanitize_parameters(self, truncation=None, tokenize_kwargs=None, **kwargs):
# handle tokenize_kwargs first
if truncation is not None:
if 'truncation' in tokenize_kwargs:
raise ValueError("This is defined twice")
tokenize_kwargs["truncation"] = truncation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Narsil But the problem is truncation is already a parameter of the tokenizer, why the parameter should be kept separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it was there before, hence users might (and probably HAVE) started using it.
And since we cannot break user code, we have to keep backward compatibility with it.
After checking, the broken tests are exactly broken by the lack of Also for quality you should be able to to
Cheers. |
@Narsil I made the changes you indicate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nits, but LGTM otherwise! Thanks a lot for working on this!
@sgugger I have moved the import to top. |
Thanks a lot! |
What does this PR do?
The PR adds keyword arguments for the tokenizer for the feature extraction pipeline. Fixes: #19374
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@LysandreJik
@Narsil