Skip to content

Question about Cross-modal Query #48

Open
@cokeshao

Description

@cokeshao

Hello, thanks for your nice work.
When conducting similar experiments on LLaVA-OV, I found that the method couldn't find the frames of interest.

LongVU/longvu/cambrian_arch.py

Lines 1411 to 1418 in 1ca4286

sim = torch.matmul(visual_emb, text_emb.transpose(0, 1)).mean(
dim=-1
)
sim_frame = sim.reshape(
frame_split_sizes[cur_image_idx], -1
).mean(dim=-1)
highres_num = min(highres_num, sim_frame.shape[0])
top_values, top_indices = torch.topk(sim_frame, highres_num)

When I print out the norm of the vision feature, I suddenly found that it ranges from various values.

Visual features norm: tensor(18.2188, device='cuda:0', dtype=torch.float16) tensor(296.5000, device='cuda:0', dtype=torch.float16) tensor(75.7500, device='cuda:0', dtype=torch.float16)
Text features norm: tensor(0.2739, device='cuda:0', dtype=torch.float16) tensor(0.8486, device='cuda:0', dtype=torch.float16) tensor(0.7085, device='cuda:0', dtype=torch.float16)

That means if a frame contains more features with bigger values, its cross-modal attention is bigger than that of other frames. So this frame will be chosen. But actually, the value mostly depends on the vision_tower.
So I think it might not be proper to compare the cross-modal attention scores between each frame. I hope I don't miss something important. What do you think about it? I would be glad to receive your reply.

Best regards.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions