-
Notifications
You must be signed in to change notification settings - Fork 246
Description
Make some trivial changes to the code will bring a dramatical difference to the performance of Qwen-2.5-7B-coder-instruct on the Humaneval benchmark.
The first point is line 32 in file bigcode_eval/tasks/humaneval.py, in function create_all_tasks
, when given the parameter True
to the function create_task
, the fixed Humaneval prompt in the dataset will be changed, specifically, removing the last '\n'. Therefore, the last token will change, leading the dramatical output of the model.
The second point is line 52 of the same file, that is the stop_words
list. The stop word \n#
and \nif
may ruin the code generated by the model as it like to add annotation and the string if __name__ == '__main__'
to the end of the code. So removing these from the list and adding some post-processing to the generated code may make it ok. Such as add below code in the line 97 of the file bigcode_eval/evaluator.py:
for i,gen_list in enumerate(generations):
for j,gen in enumerate(gen_list):
gen = gen.split("\nif __name__ == \'__main__\':")[0]
generations[i][j] = gen.split('\nif __name__ == \"__main__\":')[0]