testing Humaneval of qwen-2.5-7B-coder-instruct

Make some trivial changes to the code will bring a dramatical difference to the performance of Qwen-2.5-7B-coder-instruct on the Humaneval benchmark. 

The first point is line 32 in file bigcode_eval/tasks/humaneval.py, in function `create_all_tasks`, when given the parameter `True` to the function `create_task`, the fixed Humaneval prompt in the dataset will be changed, specifically, removing the last '\n'. Therefore, the last token will change, leading the dramatical output of the model.

The second point is line 52 of the same file, that is the `stop_words` list. The stop word `\n#` and `\nif` may ruin the code generated by the model as it like to add annotation and the string `if __name__ == '__main__'` to the end of the code. So removing these from the list and adding some post-processing to the generated code may make it ok. Such as add below code in the line 97 of the file bigcode_eval/evaluator.py:

```python
for i,gen_list in enumerate(generations):
            for j,gen in enumerate(gen_list):
                gen = gen.split("\nif __name__ == \'__main__\':")[0]
                generations[i][j] = gen.split('\nif __name__ == \"__main__\":')[0]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

testing Humaneval of qwen-2.5-7B-coder-instruct #308

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

testing Humaneval of qwen-2.5-7B-coder-instruct #308

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions