Skip to content

Commit a93b045

Browse files
feat(eval): Support evaluation on SWE-bench-Live (#9137)
1 parent 98e0f55 commit a93b045

File tree

7 files changed

+286
-12
lines changed

7 files changed

+286
-12
lines changed

evaluation/benchmarks/swe_bench/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
This folder contains the evaluation harness that we built on top of the original [SWE-Bench benchmark](https://www.swebench.com/) ([paper](https://arxiv.org/abs/2310.06770)).
44

5+
**UPDATE (6/15/2025): We now support running SWE-bench-Live evaluation (see the paper [here](https://arxiv.org/abs/2505.23419))! For how to run it, checkout [this README](./SWE-bench-Live.md).**
6+
57
**UPDATE (5/26/2025): We now support running interactive SWE-Bench evaluation (see the paper [here](https://arxiv.org/abs/2502.13069))! For how to run it, checkout [this README](./SWE-Interact.md).**
68

79
**UPDATE (4/8/2025): We now support running SWT-Bench evaluation! For more details, checkout [the corresponding section](#SWT-Bench-Evaluation).**
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# SWE-bench-Live
2+
3+
<p align="center">
4+
<a href="https://arxiv.org/abs/2505.23419">📃 Paper</a>
5+
6+
<a href="https://huggingface.co/SWE-bench-Live" >🤗 HuggingFace</a>
7+
8+
<a href="https://SWE-bench-Live.github.io" >📊 Leaderboard</a>
9+
</p>
10+
11+
SWE-bench-Live is a live benchmark for issue resolving, providing a dataset that contains the latest issue tasks. This document explains how to run the evaluation of OpenHands on SWE-bench-Live.
12+
13+
Since SWE-bench-Live has an almost identical setting to SWE-bench, you only need to simply change the dataset name to `SWE-bench-Live/SWE-bench-Live`, the other parts are basically the same as running on SWE-bench.
14+
15+
## Setting Up
16+
17+
Set up the development environment and configure your LLM provider by following the [README](README.md).
18+
19+
## Running Inference
20+
21+
Use the same script, but change the dataset name to `SWE-bench-Live` and select the split (either `lite` or `full`). The lite split contains 300 instances from the past six months, while the full split includes 1,319 instances created after 2024.
22+
23+
```shell
24+
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
25+
```
26+
27+
In the original SWE-bench-Live paper, max_iterations is set to 100.
28+
29+
```shell
30+
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.your_llm HEAD CodeActAgent 300 100 3 SWE-bench-Live/SWE-bench-Live lite
31+
```
32+
33+
## Evaluating Results
34+
35+
After OpenHands generates patch results for each issue, we evaluate the results using the [SWE-bench-Live evaluation harness](https://github.com/microsoft/SWE-bench-Live).
36+
37+
Convert to the format of predictions for SWE benchmarks:
38+
39+
```shell
40+
# You can find output.jsonl in evaluation/evaluation_outputs
41+
python evaluation/benchmarks/swe_bench/scripts/live/convert.py --output_jsonl [path/to/evaluation/output.jsonl] > preds.jsonl
42+
```
43+
44+
Please refer to the original [SWE-bench-Live repository](https://github.com/microsoft/SWE-bench-Live) to set up the evaluation harness and use the provided scripts to generate the evaluation report:
45+
46+
```shell
47+
python -m swebench.harness.run_evaluation \
48+
--dataset_name SWE-bench-Live/SWE-bench-Live \
49+
--split lite \
50+
--namespace starryzhang \
51+
--predictions_path preds.jsonl \
52+
--max_workers 10 \
53+
--run_id openhands
54+
```
55+
56+
## Citation
57+
58+
```bibtex
59+
@article{zhang2025swebenchgoeslive,
60+
title={SWE-bench Goes Live!},
61+
author={Linghao Zhang and Shilin He and Chaoyun Zhang and Yu Kang and Bowen Li and Chengxing Xie and Junhao Wang and Maoquan Wang and Yufan Huang and Shengyu Fu and Elsie Nallipogu and Qingwei Lin and Yingnong Dang and Saravan Rajmohan and Dongmei Zhang},
62+
journal={arXiv preprint arXiv:2505.23419},
63+
year={2025}
64+
}
65+
```
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
from typing import Any
2+
3+
import pandas as pd
4+
5+
from evaluation.utils.shared import assert_and_raise
6+
from openhands.core.logger import openhands_logger as logger
7+
from openhands.events.action import CmdRunAction
8+
from openhands.events.observation import (
9+
CmdOutputObservation,
10+
ErrorObservation,
11+
)
12+
from openhands.runtime.base import Runtime
13+
from openhands.utils.shutdown_listener import sleep_if_should_continue
14+
15+
16+
def complete_runtime(
17+
runtime: Runtime,
18+
instance: pd.Series,
19+
) -> dict[str, Any]:
20+
"""Complete the runtime and export the git patch for SWE-bench-Live."""
21+
logger.info('-' * 30)
22+
logger.info('BEGIN Runtime Completion Fn')
23+
logger.info('-' * 30)
24+
obs: CmdOutputObservation
25+
workspace_dir_name = instance.instance_id
26+
action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
27+
action.set_hard_timeout(600)
28+
logger.info(action)
29+
obs = runtime.run_action(action)
30+
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
31+
assert_and_raise(
32+
isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
33+
f'Failed to cd to /workspace/{workspace_dir_name}: {str(obs)}',
34+
)
35+
action = CmdRunAction(command='git config --global core.pager ""')
36+
action.set_hard_timeout(600)
37+
logger.info(action, extra={'msg_type': 'ACTION'})
38+
obs = runtime.run_action(action)
39+
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
40+
assert_and_raise(
41+
isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
42+
f'Failed to git config --global core.pager "": {str(obs)}',
43+
)
44+
action = CmdRunAction(command='git add -A')
45+
action.set_hard_timeout(600)
46+
logger.info(action, extra={'msg_type': 'ACTION'})
47+
obs = runtime.run_action(action)
48+
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
49+
assert_and_raise(
50+
isinstance(obs, CmdOutputObservation) and obs.exit_code == 0,
51+
f'Failed to git add -A: {str(obs)}',
52+
)
53+
n_retries = 0
54+
git_patch = None
55+
while n_retries < 5:
56+
action = CmdRunAction(
57+
command=f'git diff --no-color --cached {instance["base_commit"]}',
58+
)
59+
action.set_hard_timeout(100 + 10 * n_retries)
60+
logger.info(action, extra={'msg_type': 'ACTION'})
61+
obs = runtime.run_action(action)
62+
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
63+
n_retries += 1
64+
if isinstance(obs, CmdOutputObservation):
65+
if obs.exit_code == 0:
66+
git_patch = obs.content.strip()
67+
break
68+
else:
69+
logger.info('Failed to get git diff, retrying...')
70+
sleep_if_should_continue(10)
71+
elif isinstance(obs, ErrorObservation):
72+
logger.error(f'Error occurred: {obs.content}. Retrying...')
73+
sleep_if_should_continue(10)
74+
else:
75+
assert_and_raise(False, f'Unexpected observation type: {str(obs)}')
76+
assert_and_raise(git_patch is not None, 'Failed to get git diff (None)')
77+
logger.info('-' * 30)
78+
logger.info('END Runtime Completion Fn')
79+
logger.info('-' * 30)
80+
return {'git_patch': git_patch}

evaluation/benchmarks/swe_bench/run_infer.py

Lines changed: 53 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -66,14 +66,37 @@
6666
ENABLE_LLM_EDITOR = os.environ.get('ENABLE_LLM_EDITOR', 'false').lower() == 'true'
6767
BenchMode = Literal['swe', 'swt', 'swt-ci']
6868

69+
# Global variable to track dataset type
70+
DATASET_TYPE = 'SWE-bench'
71+
72+
73+
def set_dataset_type(dataset_name: str) -> str:
74+
"""Set dataset type based on dataset name."""
75+
global DATASET_TYPE
76+
name_lower = dataset_name.lower()
77+
78+
if 'swe-gym' in name_lower:
79+
DATASET_TYPE = 'SWE-Gym'
80+
elif 'swe-bench-live' in name_lower:
81+
DATASET_TYPE = 'SWE-bench-Live'
82+
elif 'multimodal' in name_lower:
83+
DATASET_TYPE = 'Multimodal'
84+
else:
85+
DATASET_TYPE = 'SWE-bench'
86+
87+
logger.info(f'Dataset type set to: {DATASET_TYPE}')
88+
6989

7090
AGENT_CLS_TO_FAKE_USER_RESPONSE_FN = {
7191
'CodeActAgent': codeact_user_response,
7292
}
7393

7494

7595
def _get_swebench_workspace_dir_name(instance: pd.Series) -> str:
76-
return f'{instance.repo}__{instance.version}'.replace('/', '__')
96+
if DATASET_TYPE == 'SWE-bench-Live':
97+
return instance.instance_id
98+
else:
99+
return f'{instance.repo}__{instance.version}'.replace('/', '__')
77100

78101

79102
def get_instruction(instance: pd.Series, metadata: EvalMetadata) -> MessageAction:
@@ -153,9 +176,13 @@ def get_instance_docker_image(
153176
if swebench_official_image:
154177
# Official SWE-Bench image
155178
# swebench/sweb.eval.x86_64.django_1776_django-11333:v1
156-
docker_image_prefix = 'docker.io/swebench/'
179+
# SWE-bench-Live uses the same naming convention as SWE-Bench
180+
if DATASET_TYPE == 'SWE-bench-Live':
181+
docker_image_prefix = 'docker.io/starryzhang/'
182+
elif DATASET_TYPE == 'SWE-bench':
183+
docker_image_prefix = 'docker.io/swebench/'
157184
repo, name = instance_id.split('__')
158-
image_name = f'swebench/sweb.eval.x86_64.{repo}_1776_{name}:latest'.lower()
185+
image_name = f'{docker_image_prefix.rstrip("/")}/sweb.eval.x86_64.{repo}_1776_{name}:latest'.lower()
159186
logger.debug(f'Using official SWE-Bench image: {image_name}')
160187
return image_name
161188
else:
@@ -173,7 +200,8 @@ def get_config(
173200
metadata: EvalMetadata,
174201
) -> OpenHandsConfig:
175202
# We use a different instance image for the each instance of swe-bench eval
176-
use_swebench_official_image = 'swe-gym' not in metadata.dataset.lower()
203+
use_swebench_official_image = DATASET_TYPE != 'SWE-Gym'
204+
177205
base_container_image = get_instance_docker_image(
178206
instance['instance_id'],
179207
swebench_official_image=use_swebench_official_image,
@@ -290,8 +318,12 @@ def initialize_runtime(
290318
runtime.copy_to(temp_file_path, '/swe_util/eval_data/instances/')
291319

292320
# inject the instance swe entry
321+
if DATASET_TYPE == 'SWE-bench-Live':
322+
entry_script_path = 'instance_swe_entry_live.sh'
323+
else:
324+
entry_script_path = 'instance_swe_entry.sh'
293325
runtime.copy_to(
294-
str(os.path.join(script_dir, 'scripts/setup/instance_swe_entry.sh')),
326+
str(os.path.join(script_dir, f'scripts/setup/{entry_script_path}')),
295327
'/swe_util/',
296328
)
297329

@@ -311,14 +343,14 @@ def initialize_runtime(
311343
logger.error(f'Failed to source ~/.bashrc: {str(obs)}')
312344
assert_and_raise(obs.exit_code == 0, f'Failed to source ~/.bashrc: {str(obs)}')
313345

314-
action = CmdRunAction(command='source /swe_util/instance_swe_entry.sh')
346+
action = CmdRunAction(command=f'source /swe_util/{entry_script_path}')
315347
action.set_hard_timeout(600)
316348
logger.info(action, extra={'msg_type': 'ACTION'})
317349
obs = runtime.run_action(action)
318350
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
319351
assert_and_raise(
320352
obs.exit_code == 0,
321-
f'Failed to source /swe_util/instance_swe_entry.sh: {str(obs)}',
353+
f'Failed to source /swe_util/{entry_script_path}: {str(obs)}',
322354
)
323355

324356
action = CmdRunAction(command=f'cd /workspace/{workspace_dir_name}')
@@ -371,9 +403,9 @@ def initialize_runtime(
371403
obs = runtime.run_action(action)
372404
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
373405

374-
if 'multimodal' not in metadata.dataset.lower():
406+
if DATASET_TYPE != 'Multimodal' and DATASET_TYPE != 'SWE-bench-Live':
375407
# Only for non-multimodal datasets, we need to activate the testbed environment for Python
376-
# SWE-Bench multimodal datasets are not using the testbed environment
408+
# SWE-Bench multimodal datasets and SWE-bench-Live are not using the testbed environment
377409
action = CmdRunAction(command='which python')
378410
action.set_hard_timeout(600)
379411
logger.info(action, extra={'msg_type': 'ACTION'})
@@ -615,7 +647,13 @@ def process_instance(
615647

616648
# ======= THIS IS SWE-Bench specific =======
617649
# Get git patch
618-
return_val = complete_runtime(runtime, instance)
650+
if DATASET_TYPE == 'SWE-bench-Live':
651+
from evaluation.benchmarks.swe_bench.live_utils import (
652+
complete_runtime as complete_runtime_fn,
653+
)
654+
else:
655+
complete_runtime_fn = complete_runtime
656+
return_val = complete_runtime_fn(runtime, instance)
619657
git_patch = return_val['git_patch']
620658
logger.info(
621659
f'Got git diff for instance {instance.instance_id}:\n--------\n{git_patch}\n--------'
@@ -720,11 +758,15 @@ def filter_dataset(dataset: pd.DataFrame, filter_column: str) -> pd.DataFrame:
720758
# NOTE: It is preferable to load datasets from huggingface datasets and perform post-processing
721759
# so we don't need to manage file uploading to OpenHands's repo
722760
dataset = load_dataset(args.dataset, split=args.split)
761+
762+
# Set the global dataset type based on dataset name
763+
set_dataset_type(args.dataset)
764+
723765
swe_bench_tests = filter_dataset(dataset.to_pandas(), 'instance_id')
724766
logger.info(
725767
f'Loaded dataset {args.dataset} with split {args.split}: {len(swe_bench_tests)} tasks'
726768
)
727-
if 'SWE-Gym' in args.dataset:
769+
if DATASET_TYPE == 'SWE-Gym':
728770
with open(
729771
os.path.join(
730772
os.path.dirname(os.path.abspath(__file__)),
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
import argparse
2+
import json
3+
4+
5+
def main(output_jsonl: str):
6+
with open(output_jsonl, 'r') as f:
7+
for line in f:
8+
try:
9+
output = json.loads(line)
10+
pred = {
11+
'instance_id': output['instance_id'],
12+
'model_name_or_path': output['metadata']['llm_config']['model'],
13+
'model_patch': output['test_result']['git_patch'],
14+
}
15+
except Exception as e:
16+
print(
17+
f'Error while reading output of instance {output["instance_id"]}: {e}'
18+
)
19+
20+
print(json.dumps(pred))
21+
22+
23+
if __name__ == '__main__':
24+
parser = argparse.ArgumentParser()
25+
parser.add_argument(
26+
'--output_jsonl',
27+
type=str,
28+
required=True,
29+
help='Path to the prediction file (.../outputs.jsonl)',
30+
)
31+
args = parser.parse_args()
32+
33+
main(args.output_jsonl)
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
#!/usr/bin/env bash
2+
3+
source ~/.bashrc
4+
SWEUTIL_DIR=/swe_util
5+
6+
# FIXME: Cannot read SWE_INSTANCE_ID from the environment variable
7+
# SWE_INSTANCE_ID=django__django-11099
8+
if [ -z "$SWE_INSTANCE_ID" ]; then
9+
echo "Error: SWE_INSTANCE_ID is not set." >&2
10+
exit 1
11+
fi
12+
13+
# Read the swe-bench-test-lite.json file and extract the required item based on instance_id
14+
item=$(jq --arg INSTANCE_ID "$SWE_INSTANCE_ID" '.[] | select(.instance_id == $INSTANCE_ID)' $SWEUTIL_DIR/eval_data/instances/swe-bench-instance.json)
15+
16+
if [[ -z "$item" ]]; then
17+
echo "No item found for the provided instance ID."
18+
exit 1
19+
fi
20+
21+
22+
echo "WORKSPACE_NAME: $SWE_INSTANCE_ID"
23+
24+
# Clear the workspace
25+
if [ -d /workspace ]; then
26+
rm -rf /workspace/*
27+
else
28+
mkdir /workspace
29+
fi
30+
# Copy repo to workspace
31+
if [ -d /workspace/$SWE_INSTANCE_ID ]; then
32+
rm -rf /workspace/$SWE_INSTANCE_ID
33+
fi
34+
mkdir -p /workspace
35+
cp -r /testbed /workspace/$SWE_INSTANCE_ID
36+
37+
# SWE-bench-Live does not use conda to manage Python
38+
# if [ -d /opt/miniconda3 ]; then
39+
# . /opt/miniconda3/etc/profile.d/conda.sh
40+
# conda activate testbed
41+
# fi

evaluation/utils/shared.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -263,8 +263,19 @@ def prepare_dataset(
263263
f'Randomly sampling {eval_n_limit} unique instances with random seed 42.'
264264
)
265265

266+
def make_serializable(instance: pd.Series) -> dict:
267+
import numpy as np
268+
269+
instance_dict = instance.to_dict()
270+
for k, v in instance_dict.items():
271+
if isinstance(v, np.ndarray):
272+
instance_dict[k] = v.tolist()
273+
elif isinstance(v, pd.Timestamp):
274+
instance_dict[k] = str(v)
275+
return instance_dict
276+
266277
new_dataset = [
267-
instance
278+
make_serializable(instance)
268279
for _, instance in dataset.iterrows()
269280
if str(instance[id_column]) not in finished_ids
270281
]

0 commit comments

Comments
 (0)