Skip to content

Tokenizer returns float32 tensor for empty string input instead of long dtype #38417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 4 tasks
Goer17 opened this issue May 28, 2025 · 4 comments · May be fixed by #38421
Open
1 of 4 tasks

Tokenizer returns float32 tensor for empty string input instead of long dtype #38417

Goer17 opened this issue May 28, 2025 · 4 comments · May be fixed by #38421
Labels

Comments

@Goer17
Copy link

Goer17 commented May 28, 2025

Hi team,

I found an unexpected behavior when using HuggingFace Transformers' tokenizer with an empty string as input. When I run the following code:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B", trust_remote_code=True)
input_ids = tokenizer("", return_tensors="pt")['input_ids']
print(input_ids, input_ids.dtype)
# output:
# tensor([], size=(1, 0)) torch.float32

Potential impact:
For example, if this float32 tensor is concatenated with other torch.long tensors using torch.cat, the result will be promoted to float32, causing all token IDs to become floats. This can break downstream code that expects integer token IDs.

Could you please take a look? Thank you!

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B", trust_remote_code=True)

input_ids = tokenizer("", return_tensors="pt")['input_ids']
print(input_ids, input_ids.dtype, sep="\n")

# output:
# tensor([], size=(1, 0))
# torch.float32

Expected behavior

The tokenizer should always return torch.long tensors for input_ids, regardless of the input content.

@Goer17 Goer17 added the bug label May 28, 2025
@Flink-ddd
Copy link

Hi! I'd like to work on this issue as my first contribution.

I've read through the codebase and identified the likely cause. Planning to submit a fix along with a simple test to ensure input_ids return as torch.long even for empty input.

Please let me know if it's okay for me to take this on

@Flink-ddd
Copy link

Hi! I've submitted a PR to fix this issue: #38421.
It ensures that the input_ids return a torch.long tensor even for empty string inputs.
Looking forward to your feedback. Thanks! ❤️

@Goer17
Copy link
Author

Goer17 commented May 28, 2025

Thank you for your prompt response and for providing a fix.
♥️♥️♥️

@Rocketknight1
Copy link
Member

Hi @Goer17 and @Flink-ddd, thanks for the fix, but we actually have a PR for this already that fixes it for all tokenizers! See #36555

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants