Skip to content

[Benchmark] Support CVQA #1176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

timothycdc
Copy link

Adds support for the Cultural VQA (CVQA) benchmark.
Despite 'VQA' being in the name, CVQA is actually multiple-choice.

Screenshot 2025-07-25 at 12 01 12 AM

I used the dataset's region+language column (e.g. Japan, Japanese) as l2-category split.

Original dataset link
TSV converted version

@FangXinyu-0913
Copy link
Collaborator

HI @timothycdc, Thanks for your contribution to our community. Here are the results of the test on our side for your reference. If you feel okay, we will merge this PR.


split test
Overall 0.3877964141122036
('Amharic', 'Ethiopia') 0.24786324786324787
('Bengali', 'India') 0.18181818181818182
('Breton', 'France') 0.2962962962962963
('Bulgarian', 'Bulgaria') 0.3692722371967655
('Chinese', 'China') 0.4212218649517685
('Chinese', 'Singapore') 0.3915094339622642
('Egyptian_Arabic', 'Egypt') 0.1724137931034483
('Filipino', 'Philippines') 0.4482758620689655
('Hindi', 'India') 0.3283582089552239
('Igbo', 'Nigeria') 0.325
('Indonesian', 'Indonesia') 0.4029126213592233
('Irish', 'Ireland') 0.4110429447852761
('Japanese', 'Japan') 0.3103448275862069
('Javanese', 'Indonesia') 0.3602693602693603
('Kinyarwanda', 'Rwanda') 0.33617021276595743
('Korean', 'South Korea') 0.3724137931034483
('Malay', 'Malaysia') 0.473015873015873
('Marathi', 'India') 0.24257425742574257
('Minangkabau', 'Indonesia') 0.350597609561753
('Mongolian', 'Mongolia') 0.22115384615384615
('Norwegian', 'Norway') 0.5418060200668896
('Oromo', 'Ethiopia') 0.38317757009345793
('Portuguese', 'Brazil') 0.5985915492957746
('Romanian', 'Romania') 0.5198675496688742
('Russian', 'Russia') 0.55
('Sinhala', 'Sri_Lanka') 0.21777777777777776
('Spanish', 'Argentina') 0.5471698113207547
('Spanish', 'Chile') 0.5897435897435898
('Spanish', 'Colombia') 0.5311203319502075
('Spanish', 'Ecuador') 0.5552486187845304
('Spanish', 'Mexico') 0.4613003095975232
('Spanish', 'Spain') 0.660377358490566
('Spanish', 'Uruguay') 0.44126984126984126
('Sundanese', 'Indonesia') 0.35
('Swahili', 'Kenya') 0.37362637362637363
('Tamil', 'India') 0.18691588785046728
('Telugu', 'India') 0.27
('Urdu', 'India') 0.2
('Urdu', 'Pakistan') 0.10648148148148148
Brands / products / companies 0.3910355486862442
Cooking and food 0.35877862595419846
Geography / buildings / landmarks 0.4
Objects / materials / clothing 0.3337531486146096
People and everyday life 0.4195624195624196
Plants and animal 0.36418816388467373
Public Figure and pop culture 0.41467576791808874
Sports and recreation 0.41372141372141374
Traditions / art / history 0.39231738035264485
Vehicles and Transportation 0.4074074074074074


BTW, the SYSPROMPT in the PR is not used in the actual evaluation. If you want it to be used by all models, you can write a build_prompt function in combination with this prompt and put it under the class CVQA.

@FangXinyu-0913 FangXinyu-0913 self-assigned this Jul 29, 2025
@timothycdc
Copy link
Author

timothycdc commented Jul 29, 2025

Thanks @FangXinyu-0913. Which model did you use for this? The score seems a bit lower than normal.

I will double-check the accuracy calculation code and fix the SYSPROMPT.

@FangXinyu-0913
Copy link
Collaborator

Thanks @FangXinyu-0913. Which model did you use for this? The score seems a bit lower than normal.

I will double-check the accuracy code and fix the SYSPROMPT.

I use the llava_v1.5_7b model and get this result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants