-
Notifications
You must be signed in to change notification settings - Fork 29.2k
Tokenizer returns float32 tensor for empty string input instead of long dtype #38417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi! I'd like to work on this issue as my first contribution. I've read through the codebase and identified the likely cause. Planning to submit a fix along with a simple test to ensure Please let me know if it's okay for me to take this on |
Hi! I've submitted a PR to fix this issue: #38421. |
Thank you for your prompt response and for providing a fix. |
Hi @Goer17 and @Flink-ddd, thanks for the fix, but we actually have a PR for this already that fixes it for all tokenizers! See #36555 |
Uh oh!
There was an error while loading. Please reload this page.
Hi team,
I found an unexpected behavior when using HuggingFace Transformers' tokenizer with an empty string as input. When I run the following code:
Potential impact:
For example, if this float32 tensor is concatenated with other torch.long tensors using torch.cat, the result will be promoted to float32, causing all token IDs to become floats. This can break downstream code that expects integer token IDs.
Could you please take a look? Thank you!
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
The tokenizer should always return torch.long tensors for input_ids, regardless of the input content.
The text was updated successfully, but these errors were encountered: