Skip to content

Commit a6ba6c5

Browse files
xingyaowwneubig
andauthored
Add SWEBench-docker eval (#2085)
* add initial version of swebench-docker eval * update the branch of git repo * add poetry run * download dev set too and pre-load f2p and p2p * update eval infer script * increase timeout * add poetry run * install swebench from our fork * update script * update loc * support single instance debug * replace \r\n from model patch * replace eval docker from namespace xingyaoww * update script to auto detect swe-bench format jsonl * support eval infer on single instance id * change log output dir to logs * update summarise result script * update README * update readme * tweak branch * Update evaluation/swe_bench/scripts/eval/prep_eval.sh Co-authored-by: Graham Neubig <[email protected]> --------- Co-authored-by: Graham Neubig <[email protected]>
1 parent 9605106 commit a6ba6c5

10 files changed

+273
-349
lines changed

evaluation/swe_bench/EVAL_PATCH.md

-256
This file was deleted.

evaluation/swe_bench/README.md

+16-51
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,12 @@ If you want to evaluate existing results, you should first run this to clone exi
127127
git clone https://huggingface.co/spaces/OpenDevin/evaluation evaluation/evaluation_outputs
128128
```
129129

130+
To prepare for swe-bench evaluation, you should pull evaluation docker from [OpenDevin/SWE-bench-docker](https://github.com/OpenDevin/SWE-bench-docker) and download swe-bench data by running:
131+
132+
```bash
133+
evaluation/swe_bench/scripts/eval/prep_eval.sh
134+
```
135+
130136
Then you can run the following:
131137

132138
```bash
@@ -135,55 +141,14 @@ Then you can run the following:
135141
./evaluation/swe_bench/scripts/eval_infer.sh evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.jsonl
136142
```
137143

138-
The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.merged.jsonl`.
139-
140-
It will contain an additional field `fine_grained_report` (see example below) compared to the `output.jsonl` from the previous inference stage.
141-
142-
```json
143-
"fine_grained_report": {
144-
"gold_tests": {
145-
"FAIL_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_default\"]",
146-
"PASS_TO_PASS": "[\"tests/test_ext_viewcode.py::test_viewcode_epub_enabled\", \"tests/test_ext_viewcode.py::test_linkcode\", \"tests/test_ext_viewcode.py::test_local_source_files\"]"
147-
},
148-
"generated": true,
149-
"with_logs": true,
150-
"applied": true,
151-
"test_errored": false,
152-
"test_timeout": false,
153-
"resolved": true,
154-
"log_parse": {
155-
"tests/test_ext_viewcode.py::test_viewcode_epub_default": "PASSED",
156-
"tests/test_ext_viewcode.py::test_viewcode_epub_enabled": "PASSED",
157-
"tests/test_ext_viewcode.py::test_linkcode": "PASSED",
158-
"tests/test_ext_viewcode.py::test_local_source_files": "PASSED",
159-
"tests/test_ext_viewcode.py::test_viewcode": "FAILED"
160-
},
161-
"eval_report": {
162-
"FAIL_TO_PASS": {
163-
"success": [
164-
"tests/test_ext_viewcode.py::test_viewcode_epub_default"
165-
],
166-
"failure": []
167-
},
168-
"PASS_TO_PASS": {
169-
"success": [
170-
"tests/test_ext_viewcode.py::test_viewcode_epub_enabled",
171-
"tests/test_ext_viewcode.py::test_linkcode",
172-
"tests/test_ext_viewcode.py::test_local_source_files"
173-
],
174-
"failure": []
175-
},
176-
"FAIL_TO_FAIL": {
177-
"success": [],
178-
"failure": []
179-
},
180-
"PASS_TO_FAIL": {
181-
"success": [],
182-
"failure": []
183-
}
184-
}
185-
}
186-
```
144+
PS: You can also pass in a JSONL with [SWE-Bench format](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md#-creating-predictions) to `./evaluation/swe_bench/scripts/eval_infer.sh`, where each line is a JSON of `{"model_patch": "XXX", "model_name_or_path": "YYY", "instance_id": "ZZZ"}`.
145+
146+
The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/` with the following files/directory (following format of [SWE-bench-docker](https://github.com/aorwall/SWE-bench-docker/tree/main/evaluations/SWE-bench_Lite_golden)):
147+
148+
- `README.md`: a report showing what are the instances that passed, failed, etc.
149+
- `logs/`: a directory of test logs
150+
- `report.json`: a JSON file that contains keys like `"resolved"` pointing to instance IDs that are resolved by the agent.
151+
- `summary.json`: a JSON file contains more fine-grained information for each test instance.
187152

188153
Please refer to [EVAL_PATCH.md](./EVAL_PATCH.md) if you want to learn more about how to evaluate patches that are already generated (e.g., not by OpenDevin).
189154

@@ -192,8 +157,8 @@ Please refer to [EVAL_PATCH.md](./EVAL_PATCH.md) if you want to learn more about
192157
If you just want to know the resolve rate, and/or a summary of what tests pass and what don't, you could run
193158

194159
```bash
195-
poetry run python ./evaluation/swe_bench/scripts/summarise_results.py <path_to_output_merged_jsonl_file>
196-
# e.g. poetry run python ./evaluation/swe_bench/scripts/summarise_results.py ./evaluation/evaluation_outputs/outputs/swe_bench_lite/CodeActSWEAgent/gpt-4o-2024-05-13_maxiter_50_N_v1.5-no-hint/output.merged.jsonl
160+
poetry run python ./evaluation/swe_bench/scripts/summarise_results.py <path_to_report_json_file>
161+
# e.g. poetry run python ./evaluation/swe_bench/scripts/summarise_results.py ./evaluation/evaluation_outputs/outputs/swe_bench_lite/CodeActSWEAgent/gpt-4o-2024-05-13_maxiter_50_N_v1.5-no-hint/report.json
197162
```
198163

199164
## Submit your evaluation results
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
#!/bin/bash
2+
3+
mkdir evaluation/swe_bench/eval_workspace
4+
pushd evaluation/swe_bench/eval_workspace
5+
git clone https://github.com/OpenDevin/SWE-bench-docker.git
6+
cd SWE-bench-docker
7+
scripts/pull_docker_images.sh docker/ xingyaoww
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
import argparse
2+
import os
3+
4+
import pandas as pd
5+
6+
parser = argparse.ArgumentParser()
7+
parser.add_argument('od_output_file', type=str)
8+
args = parser.parse_args()
9+
output_filepath = args.od_output_file.replace('.jsonl', '.swebench.jsonl')
10+
print(f'Converting {args.od_output_file} to {output_filepath}')
11+
12+
od_format = pd.read_json(args.od_output_file, orient='records', lines=True)
13+
# model name is the folder name of od_output_file
14+
model_name = os.path.basename(os.path.dirname(args.od_output_file))
15+
16+
17+
def convert_row_to_swebench_format(row):
18+
return {
19+
'instance_id': row['instance_id'],
20+
'model_patch': row['git_patch'].replace('\r\n', '\n'),
21+
'model_name_or_path': model_name,
22+
}
23+
24+
25+
swebench_format = od_format.apply(convert_row_to_swebench_format, axis=1)
26+
swebench_format.to_json(output_filepath, lines=True, orient='records')

0 commit comments

Comments
 (0)