[smolvlm] fix video inference #39147

zucchini-nlp · 2025-07-01T11:36:55Z

What does this PR do?

Fixes #39006. The model actually had default values for sampling and thus the flag has to be set to True for BC

Added a small tests, we had no video tests and thus the bug went unnoticed

pcuenca · 2025-07-01T11:45:08Z

Nice! Generation works now, although greedy generation does not match the results from v4.51.3. I assume this may be because we are now defaulting to ~~bilinear~~ bicubic in the fast processor, whereas we were using Lanczos resampling before?

HuggingFaceDocBuilderDev · 2025-07-01T11:50:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

pcuenca · 2025-07-01T11:58:34Z

We still read all the video frames here (shape using the repro above: (559, 730, 1920, 3)). I think that was not the case in previous versions.

pcuenca · 2025-07-01T12:07:29Z

Shape in previous (v4.51.3) read_video_pyav (was in image_utils.py): (9, 730, 1920, 3).

zucchini-nlp · 2025-07-01T12:18:37Z

We still read all the video frames here (shape using the repro above: (559, 730, 1920, 3)). I think that was not the case in previous versions.

Yes, this is expected though I didn't think about extra RAM usage when we decode the whole video. I will open a subsequent PR, allowing video processors to directly accept url/paths at input and decode/sample videos before preprocessing

pcuenca · 2025-07-01T12:23:14Z

Also unrelated to this PR: even when we use use_fast=False, we get the following warning from here:

            logger.warning_once(
                "You have used fast image processor with LANCZOS resample which not yet supported for torch.Tensor. "
                "BICUBIC resample will be used as an alternative. Please fall back to image processor if you "
                "want full consistency with the original model."
            )

I'm not sure if this is actionable, how can the user fall back to using the image processor instead?

pcuenca · 2025-07-01T12:28:33Z

this is expected

Noting that this broke in v4.52. Previously, we passed the sampling function here, and then frames were filtered.

qubvel · 2025-07-01T12:30:50Z

logger.warning_once(
"You have used fast image processor with LANCZOS resample which not yet supported for torch.Tensor. "
"BICUBIC resample will be used as an alternative. Please fall back to image processor if you "
"want full consistency with the original model."
)

Does it mean we should use slow ImageProcessor instead of fast VideoProcessor? Probably a code snippet in the warning should clarify it.
processor = *VideoProcessor.from_pretrained -> processor = *ImageProcessor.from_pretrained(..., use_fast=False)

zucchini-nlp · 2025-07-01T12:31:23Z

I'm not sure if this is actionable, how can the user fall back to using the image processor instead?

Yeah, this is one thing that is not really actionable since video have no slow processing option, and we'll be defaulting to using only fast processors slowly. Maybe we can delete that, I remember @yonigozlan wanted to ask smolVLM team about defaulting to bilinear interpolation for models

pcuenca · 2025-07-01T12:38:15Z

Does it mean we should use slow ImageProcessor instead of fast VideoProcessor? Probably a code snippet in the warning should clarify it.

How? The only solution would be to instantiate the image processor and tokenizer manually, and then build SmolVLMProcessor with them. Is there another way? This is probably far from what the user was doing (AutoProcessor.from_pretrained, in my case).

…emplate`

zucchini-nlp · 2025-07-01T12:52:31Z

The only solution would be to instantiate the image processor and tokenizer manually, and then build SmolVLMProcessor with them. Is there another way?

+1, i think we either just remove it (including in image processor) or create a non-very-efficient resizing by forcing PIL based lanczos. Personally, I prefer the first option as we have only 3 such models (SmolVLM, Flava, Chameleon), though that means model's will not match to their original implementations fully

pcuenca · 2025-07-01T12:57:47Z

cc @andimarafioti for info on lanczos vs bicubic.

(This doesn't have to be decided for this PR in my opinion)

pcuenca

Looks good to merge and fix inference!

Remaining work for followup PRs could include:

Go back to using less memory when reading videos.
Reword bicubic vs Lanczos warning (it's not easily actionable), or provide a way to fallback to Lanczos.

src/transformers/models/smolvlm/processing_smolvlm.py

github-actions · 2025-07-02T09:50:41Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: smolvlm

zucchini-nlp · 2025-07-02T09:51:17Z

run-slow: smolvlm

github-actions · 2025-07-02T09:52:37Z

This comment contains run-slow, running the specified jobs:

models: ['models/smolvlm']
quantizations: [] ...

* fix smolvlm * better do as before, set sampling params in overwritten `apply_chat_template` * style * update with `setdefault`

fix smolvlm

d8997b2

zucchini-nlp requested review from pcuenca and qubvel July 1, 2025 11:36

zucchini-nlp added the for patch Tag issues / labels that should be included in the next patch label Jul 1, 2025

pcuenca mentioned this pull request Jul 1, 2025

smolvlm video processing #39006

Closed

zucchini-nlp added 2 commits July 1, 2025 14:41

better do as before, set sampling params in overwritten `apply_chat_t…

ffeb725

…emplate`

style

95d6438

pcuenca approved these changes Jul 1, 2025

View reviewed changes

qubvel reviewed Jul 1, 2025

View reviewed changes

src/transformers/models/smolvlm/processing_smolvlm.py Outdated Show resolved Hide resolved

update with setdefault

343fd32

zucchini-nlp merged commit 4d5822e into huggingface:main Jul 2, 2025
21 checks passed

Cyrilvallez pushed a commit that referenced this pull request Jul 4, 2025

[smolvlm] fix video inference (#39147)

650bc67

* fix smolvlm * better do as before, set sampling params in overwritten `apply_chat_template` * style * update with `setdefault`

zucchini-nlp mentioned this pull request Jul 11, 2025

Latest Transformers release causes CUDA out-of-memory errors during VisionLLM fine-tuning #39337

Closed

4 tasks

[smolvlm] fix video inference #39147

[smolvlm] fix video inference #39147

Conversation

zucchini-nlp commented Jul 1, 2025

What does this PR do?

Uh oh!

pcuenca commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jul 1, 2025

Uh oh!

pcuenca commented Jul 1, 2025

Uh oh!

pcuenca commented Jul 1, 2025

Uh oh!

zucchini-nlp commented Jul 1, 2025

Uh oh!

pcuenca commented Jul 1, 2025

Uh oh!

pcuenca commented Jul 1, 2025

Uh oh!

qubvel commented Jul 1, 2025

Uh oh!

zucchini-nlp commented Jul 1, 2025

Uh oh!

pcuenca commented Jul 1, 2025

Uh oh!

zucchini-nlp commented Jul 1, 2025

Uh oh!

pcuenca commented Jul 1, 2025

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jul 2, 2025

Uh oh!

zucchini-nlp commented Jul 2, 2025

Uh oh!

github-actions bot commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

pcuenca commented Jul 1, 2025 •

edited

Loading