Skip to content

Commit 7fc5765

Browse files
tangxiangrulilbillybiscuityufansongyufansong
authored
BioCoder integration (#2076)
* prepare execution and inference * Create README.md * Update README.md * Update evaluation/biocoder/README.md * Update evaluation/swe_bench/swe_env_box.py * switch to biocoder docker container and test-specific code * code for copying and running test files into container * add metrics * add readme * Biocoder evaluation code finished (rewrite testing infrastructure, prompt tuning, and bug fixes) * Update README.md --------- Co-authored-by: lilbillybiscuit <[email protected]> Co-authored-by: Yufan Song <[email protected]> Co-authored-by: yufansong <[email protected]>
1 parent 91ddd93 commit 7fc5765

File tree

5 files changed

+886
-1
lines changed

5 files changed

+886
-1
lines changed

evaluation/biocoder/README.md

+59
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# BioCoder Evaluation with Opendevin
2+
3+
Implements evaluation of agents on BioCoder from the BioCoder benchmark introduced in [BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models](https://arxiv.org/abs/2308.16458). Please see [here](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/humanevalpack.py) for the reference implementation used in the paper.
4+
5+
## Setup Environment
6+
7+
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
8+
9+
10+
## Configure OpenDevin and your LLM
11+
Create a `config.toml` file if it does not exist at the root of the workspace. Please check [README.md](../../README.md) for how to set this up.
12+
13+
## BioCoder Docker Image
14+
In the opendevin branch of the Biocoder repository, we have slightly modified our original Docker image to work with the OpenDevin environment. In the Docker image are testing scripts (`/testing/start_test_opendevin.py` and aux files in `/testing_files/`) to assist with evaluation. Additionally, we have installed all dependencies, including OpenJDK, mamba (with Python 3.6), and many system libraries. Notably, we have **not** packaged all repositories into the image, so they are downloaded at runtime.
15+
16+
**Before first execution, pull our Docker image with the following command**
17+
```bash
18+
docker pull public.ecr.aws/i5g0m1f6/eval_biocoder:v1.0
19+
```
20+
21+
To reproduce this image, please see the Dockerfile_Opendevin in the `biocoder` repository.
22+
23+
## Start the evaluation
24+
25+
26+
```bash
27+
./evaluation/biocoder/scripts/run_infer.sh [model_config] [agent] [eval_limit]
28+
```
29+
30+
where `model_config` is mandatory, while `agent`, `dataset` and `eval_limit` are optional.
31+
32+
- `model_config`, e.g. `eval_gpt4_1106_preview`, is the config group name for your
33+
LLM settings, as defined in your `config.toml`.
34+
35+
- `agent`, e.g. `CodeActAgent`, is the name of the agent for benchmarks, defaulting
36+
to `CodeActAgent`.
37+
38+
- `eval_limit`, e.g. `10`, limits the evaluation to the first `eval_limit` instances. By default it infers all instances.
39+
40+
Let's say you'd like to run 10 instances using `eval_gpt4_1106_eval_gpt4o_2024_05_13preview` and CodeActAgent,
41+
then your command would be:
42+
43+
## Examples
44+
45+
```bash
46+
./evaluation/biocoder/scripts/run_infer.sh eval_gpt4o_2024_05_13 CodeActAgent 1
47+
```
48+
49+
## Reference
50+
```
51+
@misc{tang2024biocoder,
52+
title={BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models},
53+
author={Xiangru Tang and Bill Qian and Rick Gao and Jiakang Chen and Xinyun Chen and Mark Gerstein},
54+
year={2024},
55+
eprint={2308.16458},
56+
archivePrefix={arXiv},
57+
primaryClass={cs.LG}
58+
}
59+
```

0 commit comments

Comments
 (0)