Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BioCoder integration #2076

Merged
merged 16 commits into from
Jun 10, 2024
Merged

BioCoder integration #2076

merged 16 commits into from
Jun 10, 2024

Conversation

tangxiangru
Copy link
Contributor

Integrates BioCoder from https://arxiv.org/abs/2308.16458

@lilbillybiscuit
Copy link
Contributor

Description:

This pull request is to introduce an evaluation on the BioCoder dataset using agents on OpenDevin. The Biocoder benchmark aims to assess the performance of LLMs on entire repositories related to bioinformatics. It is a challenging full-repository level benchmark that requires both understanding of long-range, inter-class dependencies and domain knowledge.

By evaluating this dataset, we can gain further insights on the performance of LLMs on closed domain (bioinformatics) tasks, as well as their ability to retrieve information and relevant code from a large context window, or the entire repository.

Key changes:

Added evaluation/biocoder/run_infer.py: This file handles execution of the entire Biocoder benchmark using OpenDevin. It loads datasets, runs the agent on each task, and saves the evaluation outputs. Note that we are still
Added evaluation/biocoder/biocoder_env_box.py: This file handles setting up the Biocoder docker environment (which contains many dependencies, libraries, and environments shared across the repositories in Biocoder, including downloading repository files and metadata.

Updated README.md: We have updated the README.md to include information about Biocoder, such as the design, challenges, and objectives of the benchmark. Furthermore, we include documentations on executing the framework, an example of the evaluation, and instructions for contributing and customization.

Related papers:

BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models (https://arxiv.org/pdf/2308.16458)

Copy link
Collaborator

@yufansong yufansong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am little confuse What is BiocoderSSHBox for? It seems not called in run_infer.py.

Comment on lines 10 to 14
## Configure OpenDevin and your LLM


## Run Inference

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more details about how to run your benchamrk and also add run_infer.sh?

@xingyaoww xingyaoww added the evaluation Related to running evaluations with OpenHands label May 27, 2024
def get_box_for_instance(
cls,
instance,
n_tries=5,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that n_tries is not being used. If you need to retry, you can use the retry decorator from the tenacity library, like

@retry(stop=stop_after_attempt(5), wait=wait_fixed(5))

@neubig
Copy link
Contributor

neubig commented May 31, 2024

@tangxiangru and @lilbillybiscuit , thanks a lot for contributing this! Could you ping us again when the PR is finished and ready for review (including the documentation of how to run)? I'll change it to a draft PR in the meantime.

@neubig neubig marked this pull request as draft May 31, 2024 19:19
Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tangxiangru , @lilbillybiscuit, and @li-boxuan -- now that we're done with the main evals for the paper it'd be great to get this merged if you have a moment!

@yufansong yufansong marked this pull request as ready for review June 8, 2024 17:45
@yufansong yufansong marked this pull request as draft June 8, 2024 17:48
Copy link
Collaborator

@yufansong yufansong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tangxiangru @lilbillybiscuit I try to help you fix some comments and add shell script and readme file. But when I try to run your code locally, I actually meet some problem around the instance and BiocoderData. When I fix one, there are some follow up problems. So I guess you have not push the final version evalution code here.

So please ping me to review when you finish the code. Thanks.

@lilbillybiscuit
Copy link
Contributor

Hi @yufansong and @tangxiangru , I am sorry for the delay, but the final version has been pushed. I am writing a readme but the script can be run with the same arguments as SWE-bench. I am currently writing a more polished readme and it will be pushed shortly.

To answer the previous question, BiocoderSSHBox is a modified version of SWEBenchSSHBox that runs our docker image and all the dependencies (such as fetching the repository archive, testing caches, etc.)

@yufansong yufansong marked this pull request as ready for review June 10, 2024 03:11
@yufansong yufansong merged commit 7fc5765 into All-Hands-AI:main Jun 10, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
evaluation Related to running evaluations with OpenHands
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants