Skip to content

testing Humaneval of qwen-2.5-7B-coder-instruct #308

@zxiangx

Description

@zxiangx

Make some trivial changes to the code will bring a dramatical difference to the performance of Qwen-2.5-7B-coder-instruct on the Humaneval benchmark.

The first point is line 32 in file bigcode_eval/tasks/humaneval.py, in function create_all_tasks, when given the parameter True to the function create_task, the fixed Humaneval prompt in the dataset will be changed, specifically, removing the last '\n'. Therefore, the last token will change, leading the dramatical output of the model.

The second point is line 52 of the same file, that is the stop_words list. The stop word \n# and \nif may ruin the code generated by the model as it like to add annotation and the string if __name__ == '__main__' to the end of the code. So removing these from the list and adding some post-processing to the generated code may make it ok. Such as add below code in the line 97 of the file bigcode_eval/evaluator.py:

for i,gen_list in enumerate(generations):
            for j,gen in enumerate(gen_list):
                gen = gen.split("\nif __name__ == \'__main__\':")[0]
                generations[i][j] = gen.split('\nif __name__ == \"__main__\":')[0]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions