Tokenizer returns float32 tensor for empty string input instead of long dtype #38417

Goer17 · 2025-05-28T01:54:25Z

Hi team,

I found an unexpected behavior when using HuggingFace Transformers' tokenizer with an empty string as input. When I run the following code:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B", trust_remote_code=True)
input_ids = tokenizer("", return_tensors="pt")['input_ids']
print(input_ids, input_ids.dtype)
# output:
# tensor([], size=(1, 0)) torch.float32

Potential impact:
For example, if this float32 tensor is concatenated with other torch.long tensors using torch.cat, the result will be promoted to float32, causing all token IDs to become floats. This can break downstream code that expects integer token IDs.

Could you please take a look? Thank you!

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B", trust_remote_code=True)

input_ids = tokenizer("", return_tensors="pt")['input_ids']
print(input_ids, input_ids.dtype, sep="\n")

# output:
# tensor([], size=(1, 0))
# torch.float32

Expected behavior

The tokenizer should always return torch.long tensors for input_ids, regardless of the input content.

The text was updated successfully, but these errors were encountered:

Flink-ddd · 2025-05-28T03:01:00Z

Hi! I'd like to work on this issue as my first contribution.

I've read through the codebase and identified the likely cause. Planning to submit a fix along with a simple test to ensure input_ids return as torch.long even for empty input.

Please let me know if it's okay for me to take this on

Flink-ddd · 2025-05-28T04:17:00Z

Hi! I've submitted a PR to fix this issue: #38421.
It ensures that the input_ids return a torch.long tensor even for empty string inputs.
Looking forward to your feedback. Thanks! ❤️

Goer17 · 2025-05-28T04:20:46Z

Thank you for your prompt response and for providing a fix.
♥️♥️♥️

…gface#38417)

Rocketknight1 · 2025-05-28T14:42:09Z

Hi @Goer17 and @Flink-ddd, thanks for the fix, but we actually have a PR for this already that fixes it for all tokenizers! See #36555

Goer17 added the bug label May 28, 2025

Flink-ddd linked a pull request May 28, 2025 that will close this issue

[Qwen2.5-VL] Fix empty string input crash in processor #38421

Open

Flink-ddd added a commit to Flink-ddd/transformers that referenced this issue May 28, 2025

[Qwen2.5-VL] Add regression test for empty string input issue (huggin…

b9eff13

…gface#38417)

Flink-ddd added a commit to Flink-ddd/transformers that referenced this issue May 28, 2025

Add robust regression test for empty string handling (huggingface#38417)

4bea476

Flink-ddd added a commit to Flink-ddd/transformers that referenced this issue May 28, 2025

Add robust regression test for empty string handling (huggingface#38417)

ddf9a52

Rocketknight1 mentioned this issue May 28, 2025

Fix edge case for tokenize (#36277) #36555

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizer returns float32 tensor for empty string input instead of long dtype #38417

Tokenizer returns float32 tensor for empty string input instead of long dtype #38417

Goer17 commented May 28, 2025 •

edited

Loading

Flink-ddd commented May 28, 2025

Uh oh!

Flink-ddd commented May 28, 2025

Uh oh!

Goer17 commented May 28, 2025

Uh oh!

Rocketknight1 commented May 28, 2025

Uh oh!

Tokenizer returns float32 tensor for empty string input instead of long dtype #38417

Tokenizer returns float32 tensor for empty string input instead of long dtype #38417

Comments

Goer17 commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Who can help?

Information

Tasks

Reproduction

Expected behavior

Flink-ddd commented May 28, 2025

Uh oh!

Flink-ddd commented May 28, 2025

Uh oh!

Goer17 commented May 28, 2025

Uh oh!

Rocketknight1 commented May 28, 2025

Uh oh!

Goer17 commented May 28, 2025 •

edited

Loading