You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many models just train their data on benchmark questions and so if you choose those benchmark questions for - performance testing then 7B and 14B thinking models can also compete in many questions against big models (200B+) , so have you thought about any better performance evaluation for llm's ?
Some did create private data sets for testing but again we cant trust those individuals testing (they can be biased or they can be corrupted with money) but if we open source questions/tests then people can and are already training their data on those questions/tests to score high points in each benchmark or evaluation tests, so i was wondering like is their any better way to really evaluate llms ?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Many models just train their data on benchmark questions and so if you choose those benchmark questions for - performance testing then 7B and 14B thinking models can also compete in many questions against big models (200B+) , so have you thought about any better performance evaluation for llm's ?
Some did create private data sets for testing but again we cant trust those individuals testing (they can be biased or they can be corrupted with money) but if we open source questions/tests then people can and are already training their data on those questions/tests to score high points in each benchmark or evaluation tests, so i was wondering like is their any better way to really evaluate llms ?
Beta Was this translation helpful? Give feedback.
All reactions