Added tokenize keyword arguments to feature extraction pipeline #19382

quancore · 2022-10-06T15:11:11Z

What does this PR do?

The PR adds keyword arguments for the tokenizer for the feature extraction pipeline. Fixes: #19374

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@LysandreJik
@Narsil

HuggingFaceDocBuilderDev · 2022-10-06T15:25:26Z

The documentation is not available anymore as the PR was closed or merged.

Narsil

Thanks for this.

Thanks for the tests too !
I added a remark to keep from breaking anything in user code.

Narsil · 2022-10-07T10:02:13Z

src/transformers/pipelines/feature_extraction.py

-        preprocess_params = {}
-        if truncation is not None:
-            preprocess_params["truncation"] = truncation
+    def _sanitize_parameters(self, tokenize_kwargs=None, **kwargs):


Unfortunately we have to keep truncation in there since it was already allowed in order to NOT break anything.

Something like:

def _sanitize_parameters(self, truncation=None, tokenize_kwargs=None, **kwargs): # handle tokenize_kwargs first if truncation is not None: if 'truncation' in tokenize_kwargs: raise ValueError("This is defined twice") tokenize_kwargs["truncation"] = truncation

@Narsil But the problem is truncation is already a parameter of the tokenizer, why the parameter should be kept separately?

Because it was there before, hence users might (and probably HAVE) started using it.
And since we cannot break user code, we have to keep backward compatibility with it.

Narsil · 2022-10-07T10:12:04Z

After checking, the broken tests are exactly broken by the lack of truncation support.

Also for quality you should be able to to

pip install -e .[dev] #  or pip install transformers[dev]
make fixup

Cheers.

quancore · 2022-10-11T09:07:19Z

@Narsil I made the changes you indicate.

Narsil

LGTM.

@sgugger

sgugger

A couple of nits, but LGTM otherwise! Thanks a lot for working on this!

tests/pipelines/test_pipelines_feature_extraction.py

quancore · 2022-10-11T15:02:25Z

@sgugger I have moved the import to top.

sgugger · 2022-10-11T16:54:38Z

Thanks a lot!

Added tokenize keyword arguments to feature extraction pipeline

4797e97

LysandreJik requested a review from Narsil October 6, 2022 18:18

Narsil reviewed Oct 7, 2022

View reviewed changes

Reverted truncation parameter

fa33f93

Narsil approved these changes Oct 11, 2022

View reviewed changes

sgugger approved these changes Oct 11, 2022

View reviewed changes

tests/pipelines/test_pipelines_feature_extraction.py Outdated Show resolved Hide resolved

tests/pipelines/test_pipelines_feature_extraction.py Outdated Show resolved Hide resolved

Import numpy moved to top

ac65344

sgugger merged commit 70a058b into huggingface:main Oct 11, 2022

ajsanjoaquin mentioned this pull request Oct 14, 2022

add return_tensor parameter for feature extraction #19257

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added tokenize keyword arguments to feature extraction pipeline #19382

Added tokenize keyword arguments to feature extraction pipeline #19382

Uh oh!

quancore commented Oct 6, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Oct 6, 2022 •

edited

Loading

Uh oh!

Narsil left a comment

Uh oh!

Narsil Oct 7, 2022

Uh oh!

quancore Oct 7, 2022 •

edited

Loading

Uh oh!

Narsil Oct 11, 2022

Uh oh!

Narsil commented Oct 7, 2022

Uh oh!

quancore commented Oct 11, 2022

Uh oh!

Narsil left a comment

Uh oh!

sgugger left a comment

Uh oh!

Uh oh!

Uh oh!

quancore commented Oct 11, 2022

Uh oh!

sgugger commented Oct 11, 2022

Uh oh!

Uh oh!

Added tokenize keyword arguments to feature extraction pipeline #19382

Added tokenize keyword arguments to feature extraction pipeline #19382

Uh oh!

Conversation

quancore commented Oct 6, 2022

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Oct 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Narsil left a comment

Choose a reason for hiding this comment

Uh oh!

Narsil Oct 7, 2022

Choose a reason for hiding this comment

Uh oh!

quancore Oct 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Narsil Oct 11, 2022

Choose a reason for hiding this comment

Uh oh!

Narsil commented Oct 7, 2022

Uh oh!

quancore commented Oct 11, 2022

Uh oh!

Narsil left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

quancore commented Oct 11, 2022

Uh oh!

sgugger commented Oct 11, 2022

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Oct 6, 2022 •

edited

Loading

quancore Oct 7, 2022 •

edited

Loading