Open
Description
On the technical report for EgoVideo on Ego4d NLQ, it is said that ViT-1B of EgoVideo is used to extract video feature for each snippet, which contains s = 16 consecutive frames with stride = 16. But I think the 4 frame model that is released does not encode 16 frames. Could you elaborate more on how exactly the feature extraction was done?
Metadata
Metadata
Assignees
Labels
No labels