pass@k results silently wrong when n<k #58
daniel-vainsencher
started this conversation in
General
Replies: 1 comment
-
This is a partial solution to this problem. The script to calculate pass@k now prints the minimum and maximum number of completions per row: https://github.com/nuprl/MultiPL-E/blob/dev/pass_k.py#L53 For the informed user, when MinCompletions < k, it means that the number in that row is unreliable. The gold standard is But, when operating at scale, it helps to look at intermediate results. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
pass@k = 1 should be evidence that in k generations by this model, at least 1 is very likely to pass the test.
However, the definition of estimator returns 1 even when there are 0 passes among 99 tries if
k
=100. Nothing in the callers prevents using too small ann
, in fact someone in a hurry is quite likely to use a smalln
(as I did in the original issue, oops).Note in contrast how huggingface/evaluate does deal correctly with the n<k case: if that happens for any result, pass@k for that
k
is elided from the dictionary.Originally posted by @daniel-vainsencher in #31 (comment)
Beta Was this translation helpful? Give feedback.
All reactions