Skip to content

Feature Extraction for EgoVideo Ego4d NLQ #16

Open
@RainbowMan1

Description

@RainbowMan1

On the technical report for EgoVideo on Ego4d NLQ, it is said that ViT-1B of EgoVideo is used to extract video feature for each snippet, which contains s = 16 consecutive frames with stride = 16. But I think the 4 frame model that is released does not encode 16 frames. Could you elaborate more on how exactly the feature extraction was done?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions