More stable MBPP evaluation #2111
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves the stability of the evaluation process for the MBPP dataset.
Motivation
There were two key issues affecting the robustness and reliability of the evaluation:
Regex performance:
The regex pattern
[r'(.*)\s*```.*'](https://github.com/open-compass/opencompass/blob/aa2b89b6f8b7c5448e47ed1aa3f12b04da1ff123/opencompass/datasets/mbpp.py#L327)
occasionally hangs when processing predictions containing excessive trailing whitespace. This is likely due to catastrophic backtracking. To mitigate this, I replaced the standardre
module with theregex
module, which supports timeouts, and set a hard timeout of 10 seconds.Multiprocessing issue:
The following error was encountered:
AttributeError: Can't pickle local object 'execution.<locals>._execution'
This occurs because the
ProcessPoolExecutor
cannot pickle local (non-global) functions. The fix is to move the_execution
function to the global scope so it can be properly serialized.Modifications
re
withregex
and added a 10-second timeout to[regex.search](https://github.com/open-compass/opencompass/blob/aa2b89b6f8b7c5448e47ed1aa3f12b04da1ff123/opencompass/datasets/mbpp.py#L333)
._execution
function [from here](https://github.com/open-compass/opencompass/blob/aa2b89b6f8b7c5448e47ed1aa3f12b04da1ff123/opencompass/datasets/mbpp.py#L403C1-L418C33) to the global scope to resolve the pickling issue.Backward Compatibility
This change is not breaking. However, note that introducing a regex timeout may cause differences in evaluation results for cases where the timeout is triggered.