[Benchmark] Support CVQA #1176

timothycdc · 2025-07-24T16:05:43Z

Adds support for the Cultural VQA (CVQA) benchmark.
Despite 'VQA' being in the name, CVQA is actually multiple-choice.

I used the dataset's region+language column (e.g. Japan, Japanese) as l2-category split.

Original dataset link
TSV converted version

FangXinyu-0913 · 2025-07-28T15:02:27Z

HI @timothycdc, Thanks for your contribution to our community. Here are the results of the test on our side for your reference. If you feel okay, we will merge this PR.

split test
Overall 0.3877964141122036
('Amharic', 'Ethiopia') 0.24786324786324787
('Bengali', 'India') 0.18181818181818182
('Breton', 'France') 0.2962962962962963
('Bulgarian', 'Bulgaria') 0.3692722371967655
('Chinese', 'China') 0.4212218649517685
('Chinese', 'Singapore') 0.3915094339622642
('Egyptian_Arabic', 'Egypt') 0.1724137931034483
('Filipino', 'Philippines') 0.4482758620689655
('Hindi', 'India') 0.3283582089552239
('Igbo', 'Nigeria') 0.325
('Indonesian', 'Indonesia') 0.4029126213592233
('Irish', 'Ireland') 0.4110429447852761
('Japanese', 'Japan') 0.3103448275862069
('Javanese', 'Indonesia') 0.3602693602693603
('Kinyarwanda', 'Rwanda') 0.33617021276595743
('Korean', 'South Korea') 0.3724137931034483
('Malay', 'Malaysia') 0.473015873015873
('Marathi', 'India') 0.24257425742574257
('Minangkabau', 'Indonesia') 0.350597609561753
('Mongolian', 'Mongolia') 0.22115384615384615
('Norwegian', 'Norway') 0.5418060200668896
('Oromo', 'Ethiopia') 0.38317757009345793
('Portuguese', 'Brazil') 0.5985915492957746
('Romanian', 'Romania') 0.5198675496688742
('Russian', 'Russia') 0.55
('Sinhala', 'Sri_Lanka') 0.21777777777777776
('Spanish', 'Argentina') 0.5471698113207547
('Spanish', 'Chile') 0.5897435897435898
('Spanish', 'Colombia') 0.5311203319502075
('Spanish', 'Ecuador') 0.5552486187845304
('Spanish', 'Mexico') 0.4613003095975232
('Spanish', 'Spain') 0.660377358490566
('Spanish', 'Uruguay') 0.44126984126984126
('Sundanese', 'Indonesia') 0.35
('Swahili', 'Kenya') 0.37362637362637363
('Tamil', 'India') 0.18691588785046728
('Telugu', 'India') 0.27
('Urdu', 'India') 0.2
('Urdu', 'Pakistan') 0.10648148148148148
Brands / products / companies 0.3910355486862442
Cooking and food 0.35877862595419846
Geography / buildings / landmarks 0.4
Objects / materials / clothing 0.3337531486146096
People and everyday life 0.4195624195624196
Plants and animal 0.36418816388467373
Public Figure and pop culture 0.41467576791808874
Sports and recreation 0.41372141372141374
Traditions / art / history 0.39231738035264485
Vehicles and Transportation 0.4074074074074074

BTW, the SYSPROMPT in the PR is not used in the actual evaluation. If you want it to be used by all models, you can write a build_prompt function in combination with this prompt and put it under the class CVQA.

timothycdc · 2025-07-29T04:12:16Z

Thanks @FangXinyu-0913. Which model did you use for this? The score seems a bit lower than normal.

I will double-check the accuracy calculation code and fix the SYSPROMPT.

FangXinyu-0913 · 2025-07-29T12:55:51Z

Thanks @FangXinyu-0913. Which model did you use for this? The score seems a bit lower than normal.

I will double-check the accuracy code and fix the SYSPROMPT.

I use the llava_v1.5_7b model and get this result.

timothycdc added 2 commits July 24, 2025 23:48

Add support for CVQA Benchmark

3e40e8d

Attempt 1: Fix linting

462c809

FangXinyu-0913 self-assigned this Jul 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Benchmark] Support CVQA #1176

[Benchmark] Support CVQA #1176

Uh oh!

timothycdc commented Jul 24, 2025

Uh oh!

FangXinyu-0913 commented Jul 28, 2025

Uh oh!

timothycdc commented Jul 29, 2025 •

edited

Loading

Uh oh!

FangXinyu-0913 commented Jul 29, 2025

Uh oh!

Uh oh!

[Benchmark] Support CVQA #1176

Are you sure you want to change the base?

[Benchmark] Support CVQA #1176

Uh oh!

Conversation

timothycdc commented Jul 24, 2025

Uh oh!

FangXinyu-0913 commented Jul 28, 2025

Uh oh!

timothycdc commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FangXinyu-0913 commented Jul 29, 2025

Uh oh!

Uh oh!

timothycdc commented Jul 29, 2025 •

edited

Loading