Description
Hello, thanks for your nice work.
When conducting similar experiments on LLaVA-OV, I found that the method couldn't find the frames of interest.
LongVU/longvu/cambrian_arch.py
Lines 1411 to 1418 in 1ca4286
When I print out the norm of the vision feature, I suddenly found that it ranges from various values.
Visual features norm: tensor(18.2188, device='cuda:0', dtype=torch.float16) tensor(296.5000, device='cuda:0', dtype=torch.float16) tensor(75.7500, device='cuda:0', dtype=torch.float16)
Text features norm: tensor(0.2739, device='cuda:0', dtype=torch.float16) tensor(0.8486, device='cuda:0', dtype=torch.float16) tensor(0.7085, device='cuda:0', dtype=torch.float16)
That means if a frame contains more features with bigger values, its cross-modal attention is bigger than that of other frames. So this frame will be chosen. But actually, the value mostly depends on the vision_tower.
So I think it might not be proper to compare the cross-modal attention scores between each frame. I hope I don't miss something important. What do you think about it? I would be glad to receive your reply.
Best regards.