Is it able to detect benchmark hunting ? #2134

Greatz08 · 2025-04-01T15:28:52Z

Greatz08
Apr 1, 2025

Many models just train their data on benchmark questions and so if you choose those benchmark questions for - performance testing then 7B and 14B thinking models can also compete in many questions against big models (200B+) , so have you thought about any better performance evaluation for llm's ?

Some did create private data sets for testing but again we cant trust those individuals testing (they can be biased or they can be corrupted with money) but if we open source questions/tests then people can and are already training their data on those questions/tests to score high points in each benchmark or evaluation tests, so i was wondering like is their any better way to really evaluate llms ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Giskard

Is it able to detect benchmark hunting ? #2134

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Giskard

Is it able to detect benchmark hunting ? #2134

Uh oh!

Greatz08 Apr 1, 2025

Replies: 0 comments

Greatz08
Apr 1, 2025