-
Notifications
You must be signed in to change notification settings - Fork 29.6k
[smolvlm] fix video inference #39147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Nice! Generation works now, although greedy generation does not match the results from |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
We still read all the video frames here (shape using the repro above: |
Shape in previous (v4.51.3) |
Yes, this is expected though I didn't think about extra RAM usage when we decode the whole video. I will open a subsequent PR, allowing video processors to directly accept url/paths at input and decode/sample videos before preprocessing |
Also unrelated to this PR: even when we use
I'm not sure if this is actionable, how can the user fall back to using the image processor instead? |
Noting that this broke in v4.52. Previously, we passed the sampling function here, and then frames were filtered. |
Does it mean we should use slow |
Yeah, this is one thing that is not really actionable since video have no slow processing option, and we'll be defaulting to using only fast processors slowly. Maybe we can delete that, I remember @yonigozlan wanted to ask smolVLM team about defaulting to bilinear interpolation for models |
How? The only solution would be to instantiate the image processor and tokenizer manually, and then build |
+1, i think we either just remove it (including in image processor) or create a non-very-efficient resizing by forcing PIL based |
cc @andimarafioti for info on lanczos vs bicubic. (This doesn't have to be decided for this PR in my opinion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to merge and fix inference!
Remaining work for followup PRs could include:
- Go back to using less memory when reading videos.
- Reword bicubic vs Lanczos warning (it's not easily actionable), or provide a way to fallback to Lanczos.
[For maintainers] Suggested jobs to run (before merge) run-slow: smolvlm |
run-slow: smolvlm |
This comment contains run-slow, running the specified jobs: models: ['models/smolvlm'] |
* fix smolvlm * better do as before, set sampling params in overwritten `apply_chat_template` * style * update with `setdefault`
What does this PR do?
Fixes #39006. The model actually had default values for sampling and thus the flag has to be set to
True
for BCAdded a small tests, we had no video tests and thus the bug went unnoticed