Skip to content

[smolvlm] fix video inference #39147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 2, 2025
Merged

Conversation

zucchini-nlp
Copy link
Member

What does this PR do?

Fixes #39006. The model actually had default values for sampling and thus the flag has to be set to True for BC

Added a small tests, we had no video tests and thus the bug went unnoticed

@zucchini-nlp zucchini-nlp requested review from pcuenca and qubvel July 1, 2025 11:36
@pcuenca
Copy link
Member

pcuenca commented Jul 1, 2025

Nice! Generation works now, although greedy generation does not match the results from v4.51.3. I assume this may be because we are now defaulting to bilinear bicubic in the fast processor, whereas we were using Lanczos resampling before?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@pcuenca
Copy link
Member

pcuenca commented Jul 1, 2025

We still read all the video frames here (shape using the repro above: (559, 730, 1920, 3)). I think that was not the case in previous versions.

@pcuenca
Copy link
Member

pcuenca commented Jul 1, 2025

Shape in previous (v4.51.3) read_video_pyav (was in image_utils.py): (9, 730, 1920, 3).

@zucchini-nlp
Copy link
Member Author

We still read all the video frames here (shape using the repro above: (559, 730, 1920, 3)). I think that was not the case in previous versions.

Yes, this is expected though I didn't think about extra RAM usage when we decode the whole video. I will open a subsequent PR, allowing video processors to directly accept url/paths at input and decode/sample videos before preprocessing

@pcuenca
Copy link
Member

pcuenca commented Jul 1, 2025

Also unrelated to this PR: even when we use use_fast=False, we get the following warning from here:

            logger.warning_once(
                "You have used fast image processor with LANCZOS resample which not yet supported for torch.Tensor. "
                "BICUBIC resample will be used as an alternative. Please fall back to image processor if you "
                "want full consistency with the original model."
            )

I'm not sure if this is actionable, how can the user fall back to using the image processor instead?

@pcuenca
Copy link
Member

pcuenca commented Jul 1, 2025

this is expected

Noting that this broke in v4.52. Previously, we passed the sampling function here, and then frames were filtered.

@zucchini-nlp zucchini-nlp added the for patch Tag issues / labels that should be included in the next patch label Jul 1, 2025
@qubvel
Copy link
Member

qubvel commented Jul 1, 2025

logger.warning_once(
"You have used fast image processor with LANCZOS resample which not yet supported for torch.Tensor. "
"BICUBIC resample will be used as an alternative. Please fall back to image processor if you "
"want full consistency with the original model."
)

Does it mean we should use slow ImageProcessor instead of fast VideoProcessor? Probably a code snippet in the warning should clarify it.
processor = *VideoProcessor.from_pretrained -> processor = *ImageProcessor.from_pretrained(..., use_fast=False)

@zucchini-nlp
Copy link
Member Author

I'm not sure if this is actionable, how can the user fall back to using the image processor instead?

Yeah, this is one thing that is not really actionable since video have no slow processing option, and we'll be defaulting to using only fast processors slowly. Maybe we can delete that, I remember @yonigozlan wanted to ask smolVLM team about defaulting to bilinear interpolation for models

@pcuenca
Copy link
Member

pcuenca commented Jul 1, 2025

Does it mean we should use slow ImageProcessor instead of fast VideoProcessor? Probably a code snippet in the warning should clarify it.

How? The only solution would be to instantiate the image processor and tokenizer manually, and then build SmolVLMProcessor with them. Is there another way? This is probably far from what the user was doing (AutoProcessor.from_pretrained, in my case).

@zucchini-nlp
Copy link
Member Author

The only solution would be to instantiate the image processor and tokenizer manually, and then build SmolVLMProcessor with them. Is there another way?

+1, i think we either just remove it (including in image processor) or create a non-very-efficient resizing by forcing PIL based lanczos. Personally, I prefer the first option as we have only 3 such models (SmolVLM, Flava, Chameleon), though that means model's will not match to their original implementations fully

@pcuenca
Copy link
Member

pcuenca commented Jul 1, 2025

cc @andimarafioti for info on lanczos vs bicubic.

(This doesn't have to be decided for this PR in my opinion)

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to merge and fix inference!

Remaining work for followup PRs could include:

  • Go back to using less memory when reading videos.
  • Reword bicubic vs Lanczos warning (it's not easily actionable), or provide a way to fallback to Lanczos.

Copy link
Contributor

github-actions bot commented Jul 2, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: smolvlm

@zucchini-nlp
Copy link
Member Author

run-slow: smolvlm

Copy link
Contributor

github-actions bot commented Jul 2, 2025

This comment contains run-slow, running the specified jobs:

models: ['models/smolvlm']
quantizations: [] ...

@zucchini-nlp zucchini-nlp merged commit 4d5822e into huggingface:main Jul 2, 2025
21 checks passed
Cyrilvallez pushed a commit that referenced this pull request Jul 4, 2025
* fix smolvlm

* better do as before, set sampling params in overwritten `apply_chat_template`

* style

* update with `setdefault`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for patch Tag issues / labels that should be included in the next patch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants