You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* add initial version of swebench-docker eval
* update the branch of git repo
* add poetry run
* download dev set too and pre-load f2p and p2p
* update eval infer script
* increase timeout
* add poetry run
* install swebench from our fork
* update script
* update loc
* support single instance debug
* replace \r\n from model patch
* replace eval docker from namespace xingyaoww
* update script to auto detect swe-bench format jsonl
* support eval infer on single instance id
* change log output dir to logs
* update summarise result script
* update README
* update readme
* tweak branch
* Update evaluation/swe_bench/scripts/eval/prep_eval.sh
Co-authored-by: Graham Neubig <[email protected]>
---------
Co-authored-by: Graham Neubig <[email protected]>
To prepare for swe-bench evaluation, you should pull evaluation docker from [OpenDevin/SWE-bench-docker](https://github.com/OpenDevin/SWE-bench-docker) and download swe-bench data by running:
131
+
132
+
```bash
133
+
evaluation/swe_bench/scripts/eval/prep_eval.sh
134
+
```
135
+
130
136
Then you can run the following:
131
137
132
138
```bash
@@ -135,55 +141,14 @@ Then you can run the following:
The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/output.merged.jsonl`.
139
-
140
-
It will contain an additional field `fine_grained_report` (see example below) compared to the `output.jsonl` from the previous inference stage.
PS: You can also pass in a JSONL with [SWE-Bench format](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md#-creating-predictions) to `./evaluation/swe_bench/scripts/eval_infer.sh`, where each line is a JSON of `{"model_patch": "XXX", "model_name_or_path": "YYY", "instance_id": "ZZZ"}`.
145
+
146
+
The final results will be saved to `evaluation/evaluation_outputs/outputs/swe_bench/CodeActAgent/gpt-4-1106-preview_maxiter_50_N_v1.0/` with the following files/directory (following format of [SWE-bench-docker](https://github.com/aorwall/SWE-bench-docker/tree/main/evaluations/SWE-bench_Lite_golden)):
147
+
148
+
-`README.md`: a report showing what are the instances that passed, failed, etc.
149
+
-`logs/`: a directory of test logs
150
+
-`report.json`: a JSON file that contains keys like `"resolved"` pointing to instance IDs that are resolved by the agent.
151
+
-`summary.json`: a JSON file contains more fine-grained information for each test instance.
187
152
188
153
Please refer to [EVAL_PATCH.md](./EVAL_PATCH.md) if you want to learn more about how to evaluate patches that are already generated (e.g., not by OpenDevin).
189
154
@@ -192,8 +157,8 @@ Please refer to [EVAL_PATCH.md](./EVAL_PATCH.md) if you want to learn more about
192
157
If you just want to know the resolve rate, and/or a summary of what tests pass and what don't, you could run
193
158
194
159
```bash
195
-
poetry run python ./evaluation/swe_bench/scripts/summarise_results.py <path_to_output_merged_jsonl_file>
196
-
# e.g. poetry run python ./evaluation/swe_bench/scripts/summarise_results.py ./evaluation/evaluation_outputs/outputs/swe_bench_lite/CodeActSWEAgent/gpt-4o-2024-05-13_maxiter_50_N_v1.5-no-hint/output.merged.jsonl
160
+
poetry run python ./evaluation/swe_bench/scripts/summarise_results.py <path_to_report_json_file>
161
+
# e.g. poetry run python ./evaluation/swe_bench/scripts/summarise_results.py ./evaluation/evaluation_outputs/outputs/swe_bench_lite/CodeActSWEAgent/gpt-4o-2024-05-13_maxiter_50_N_v1.5-no-hint/report.json
0 commit comments