Code for "RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing" (Arxiv, Website)
If you find our paper or code useful, please cite the paper:
@misc{xie2025repost,
title={RepoST: Scalable Repository-LevelCoding Environment Construction with Sandbox Testing},
author={Yiqing Xie and Alex Xie and Divyanshu Sheth and Pengfei Liu and Daniel Fried and Carolyn Rose},
year={2025},
eprint={2503.07358},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.07358},
}
-
Format of the RepoST-Train and RepoST-Eval data: The structure of our datasets.
-
Evaluation on RepoST-Eval: conducting evaluation on RepoST-Eval. The code includes:
- (1) loading the RepoST-Eval docker image for execution,
- (2) running code generation for RepoST-Eval examples, and
- (3) executing our evaluation scripts in the docker to compute Pass@k scores.
We follow R2E and provide the "Ground-Truth" code context, which is constructed by extracting the modules directly or indirectly called by the target function. In principle, you can also clone the repos and retrieve context or apply agent frameworks.
-
Training with RepoST-Train (Vanilla SFT): the process of setting up our training environment, RepoST-Train, and using it to construct training data. We provide the code to run vanilla SFT (with ground truth functions as targets) using the LlamaFactory framework, where you do not need to set up dockers.
-
Training with RepoST-Train (Rejection Sampling): We also provide the code to run rejection sampling, which includes
- (1) loading the RepoST-Train docker image,
- (2) generating candidate solutions on training examples,
- (3) executing the candidate solutions,
- (4, optional) debugging failed candidate solutions, and
- (5) running model training using LlamaFactory.
-
Evaluation on Public Datasets: After training the model, we provide the code to run evaluation on HumanEval and RepoEval-func.
-
Build Your Own Executable Code Generation Environments: You can also build your own executable environments using RepoST. The steps include:
- (1) Curating GitHub Repos, Functions, and Context
- (2) Sandboxing
- (3) Test Generation
- (4) Iterative Execution & Debugging
- (5) Post-Checking
The RepoST-Train and RepoST-Eval datasets have the same structure. They both consist of (1) a set of GitHub repos (with specific commit id), (2) a set of target functions inside each repo, and (3) an evaluation script for each target function that can be executed in our docker environment.
We provide the data files in data/ExecTrain/checked_test_set_final.json
(and data/ExecEval/checked_test_set_final.json
). The file is a list of dicts. Each dict represents one example and is in the following format:
{
"func_name": "VNetwork.forward", # the (class name and) function name of the target function
"idx": "708", # the idx of the example
"repo_name": "seakers___sacTransformerEOS", # the repo name
"commit_id": "669fb18ee47ae5947b578e392be25c1517f6d73b", # the commit id
"func_path": "scripts/sac.py", # the local file path that contains the target function
"orig_context": "```python\n## scripts/sac.py\nimport torch ...", # the "ground truth" code context extracted from the repo
"eval_script": "## scripts/sac.py\nimport torch\n\nimport torch.nn as nn ...", # the evaluation script generated in the RepoST framework, used for execution-based evaluation
"coverage_rate": 1.0, # coverage rate
"sandbox_ast_check": true, # the following three are post-check results
"sandbox_functionality_check": ...,
"test_correctness_check": ...,
}
Create a new environment:
conda env create -n repost python=3.11 -y
conda activate repost
Set up the environment for code generation and RepoST environment construction (Note that you probably need a separate environment for model training):
pip install -r requirements.txt
And specify code and data paths:
source setup_eval.sh
We provide the RepoST-Eval docker image.
You can use the image by
docker load -i repost_eval_image.tar
and create a docker container that links the data folders (specified in setup_eval.sh)
docker run --name=execeval -v ${final_dataset_DIR}:${docker_final_dataset_DIR} -it repost_eval:v0
We provide the code for running inference under generation/
.
We use vllm to run code generation for open-source models and use litellm for API-based models.
Here are the example scripts, where we evaluate Pass@1 for GPT-4o and QwenCoder-7B.
cd generation
bash scripts/ExecEval_gpt4o_generation.sh
bash scripts/ExecEval_qwen_generation.sh
The results will be saved under data/ExecEval/results/
, for example, data/ExecEval/results/ExecEval_gpt4o_context_top1.json
.
Our code will automatically copy the generated function to the evaluation script (with test cases) to check the correctness of the generation results (e.g., generation_script
)
After the previous step, you should see the generation file under ${docker_final_dataset_DIR} (specified in setup_eval.sh) inside the docker container.
Assume your generation file is data/ExecEval/results/ExecEval_gpt4o_context_top1.json
and the dict key that stores the scripts is generation_script
, you can run execution with:
# in the docker container
cd RepoST
source setup_eval.sh
python RepoST/execution.py --input_file ../ExecEval/results/ExecEval_gpt4o_context_top1.json --script_key generation_script
The execution results for each evaluation example will be stored in data/ExecEval/results/ExecEval_gpt4o_context_top1_exec.json
.
We provide the vanilla sft data we used in our paper in LLaMA-Factory/data/ExecTrain_exec_sft_data.json
and provide the code for model training under LLaMA-Factory
.
If you want to process the data on your own, we provide the code for processing training data under get_training_data
, which processes data/ExecTrain/checked_test_set_final.json
into the training data format.
Create a new environment:
conda env create -n llama python=3.11 -y
conda activate llama
Set up the environment for model training (Note that you probably need a separate environment for code generation and RepoST environment construction):
cd LLaMA-Factory
pip install -e .
And specify code and data paths:
source setup_train.sh
We provide the vanilla sft data in LLaMA-Factory/data/ExecTrain_exec_sft_data.json
You can also process the data again by running:
cd get_training_data
bash scripts/get_full_sft_data.sh
The data file will be generated under LLaMA-Factory/data
.
Before training, you first need to add the data information in LLaMA-Factory/data/dataset_info.json
. For example, adding an entry:
"ExecTrain_exec_sft": {
"file_name": "ExecTrain_exec_sft_data.json"
},
After that, write a config file. We provide an example in LLaMA-Factory/examples/train_full/qwencoder_ExecTrain_exec_sft.yaml
.
Finally, run training by calling llamafactory-cli train
. We provide an example here:
cd LLaMA-Factory
bash scripts/sft_ExecTrain_exec_qwencoder.sh
Similar to vanilla SFT, we provide the rejection sampling (distill) data we used in our paper in LLaMA-Factory/data/ExecTrain_rejection_sampling_claudegpt_debug_replay_data.json
and provide the code for model training under LLaMA-Factory
. Please directly refer to the Model Training Instructions if you want to train with the provided data.
If you want to run rejection sampling on your own, we provide the code for generating candidate solutions under generation/
, the code for execution in RepoST/execution.py
, and the code to process training data under get_training_data/
.
We recommend using different environments for code generation and for model training. Specifically:
For code generation:
conda env create -n repost python=3.11 -y
conda activate repost
pip install -r requirements.txt
For model training:
conda env create -n llama python=3.11 -y
conda activate llama
cd LLaMA-Factory
pip install -e .
Finally, specify code and data paths:
source setup_train.sh
We provide the RepoST-Train docker image.
You can use the image by
docker load -i repost_train_image.tar
and create a docker container that links the data folders (specified in setup_train.sh)
docker run --name=exectrain -v ${final_dataset_DIR}:${docker_final_dataset_DIR} -it repost_train:v0
We provide the code for generating candidate solutions under generation/
.
We use vllm to run code generation for open-source models and use litellm for API-based models.
Here are the example scripts, where we generate 5 candidate solutions for each training example using GPT-4o.
cd generation
bash ExecTrain_gpt4o_generation.sh
The results will be saved under data/ExecTrain/results/
, for example, data/ExecTrain/rejection_sampling/ExecTrain_gpt4o_top5.json
.
Our code will automatically copy the generated function to the evaluation script (with test cases) to check the correctness of the generation results (e.g., generation_script
)
After the previous step, you should see the generation file under ${docker_final_dataset_DIR} (specified in setup_train.sh) inside the docker container.
Assume your generation file is data/ExecTrain/rejection_sampling/ExecTrain_gpt4o_top5.json
and the dict key that stores the scripts is generation_script
, you can run execution with:
# in the docker container
cd RepoST
source setup_train.sh
python RepoST/execution.py --input_file ../ExecTrain/rejection_sampling/ExecTrain_gpt4o_top5.json --script_key generation_script
The execution results for each evaluation example will be stored in data/ExecTrain/rejection_sampling/ExecTrain_gpt4o_top5_exec.json
.
To obtain more training instances, you can also call the model to debug failed solutions.
Here is an example script, where we prompt GPT-4o to generate one debugged solution for each failed candidate solution.
cd generation
bash scripts/ExecTrain_gpt4o_debug.sh
Finally, process the training data under get_training_data
.
We provide the example scripts here:
cd get_training_data
bash scripts/get_rej_sampling_data.sh
and
cd get_training_data
bash scripts/get_rej_sampling_debug_data.sh
The processed training data files will be saved to LLaMA-Factory/data/
.
Before training, you first need to add the data information in LLaMA-Factory/data/dataset_info.json
. For example, adding an entry:
"ExecTrain_gpt_debug_replay": {
"file_name": "ExecTrain_rejection_sampling_gpt_debug_replay_data.json"
},
After that, write a config file with the data information.
For example, create LLaMA-Factory/examples/train_full/qwencoder_ExecTrain_exec_rej_sampling_gpt_debug_replay.yaml
.
Finally, run training by calling llamafactory-cli train
with your config. For example:
cd LLaMA-Factory
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 llamafactory-cli train examples/train_full/qwencoder_ExecTrain_exec_rej_sampling_gpt_debug_replay.yaml
We provide the code for evaluating on public benchmarks under CodeRAGBench_evaluation
, where we borrow and modify the code of CodeRAGBench.
Please refer to the instructions in CodeRAGBench_evaluation/README.md.
We provide the code for scraping GitHub repos, functions, and context under get_github_input
and the code for constructing the evaluation scripts under RepoST
.
You will need to build a docker to execute the evaluation scripts.
Create a new environment:
conda env create -n repost python=3.11 -y
conda activate repost
Set up the environment for code generation and RepoST environment construction (Note that you probably need a separate environment for model training):
pip install -r requirements.txt
Finally, specify the code and data paths in your own setup file. We provide get_github_input/r2e/ExecEval_config.yml
as an example.
Load the setup file with:
source setup_eval.sh
Start with creating an empty docker image based on Dockerfile
:
docker build --tag exec_eval .
And run a docker container:
docker run --name=exec_eval_test -v ${dataset_generation_DIR}:${docker_dataset_generation_DIR} -it exec_eval
Inside the docker container, you may want to create a restricted user that can only access the data file and the tmp file:
adduser --disabled-password --gecos "" restricted_user
chown restricted_user:restricted_user /home/user/ExecEval
chown restricted_user:restricted_user /home/user/tmp
You first need to write a config file under get_github_input/
, where we provide two examples: get_github_input/r2e/ExecTrain_config.yml
and get_github_input/r2e/ExecEval_config.yml
.
After that, specify the path to the config file in get_github_input/setup.sh
.
To scrape GitHub repos, you can follow the example scripts file that is used for creating RepoST-Eval: get_github_input/scripts/ExecEval_scrape_repos.sh
.
In this script, we scrape 1,000 repos created between 2024-09-01 and 2024-12-31, and randomly sample 300. We clone the selected repos and remove the ones with storage sizes larger than 10M.
cd get_github_input
bash scripts/ExecEval_scrape_repos.sh
To sample functions and extract ground truth context by call graph, you can follow the example scripts file that is used for creating RepoST-Eval: get_github_input/scripts/ExecEval_obtain_functions.sh
.
In this script, we extract all the functions with docstrings, sample at most 15 standalone functions and 15 class methods for each repo, and extract context with a maximum block size of 8,000 tokens.
cd get_github_input
bash scripts/ExecEval_obtain_functions.sh
After running this script, the file with initial repo, function, and context will be saved under data/
. For example, data/ExecEval/cleaned_python_test.json
.
The second step is to call an LLM to aggregate the target function and its context (both infile and cross-file ones) to a separate script.
cd RepoST
python sandboxing.py --input_file cleaned_python_test.json --model_name "claude-3-5-sonnet-20240620"
If you use our default setup files, the results will be saved to data/ExecEval/cleaned_python_test_sandboxed.json
.
The third step is to call an LLM to create tests for the target function in the sandboxed script.
cd RepoST
python test_generation.py --input_file cleaned_python_test_sandboxed.json --model_name "claude-3-5-sonnet-20240620"
If you use our default setup files, the results will be saved to data/ExecEval/cleaned_python_test_tests.json
.
In principle, the ground truth target function should pass all the generated tests in the sandboxed script with no errors. We hence execute the sandboxed script and use an LLM to debug the script (if needed):
Run execution inside the docker:
# inside the docker
python RepoST/RepoST/execution.py --input_file /home/user/ExecEval/cleaned_python_test_tests.json --script_key full_script --test_gt
python RepoST/RepoST/execution.py --input_file /home/user/ExecEval/cleaned_python_test_tests_exec.json --script_key full_script --test_gt --check_coverage_mode
Alternatively, you can run execution under the restricted_user:
# inside the docker
su -c "python RepoST/RepoST/execution.py --input_file /home/user/ExecEval/cleaned_python_test_tests.json --script_key full_script --test_gt" restricted_user
su -c "python RepoST/RepoST/execution.py --input_file /home/user/ExecEval/cleaned_python_test_tests_exec.json --script_key full_script --test_gt --check_coverage_mode" restricted_user
If you use our default setup files, the execution results will be saved to data/ExecEval/cleaned_python_test_tests_exec_coverage.json
.
Debug with an LLM:
cd RepoST
python debug.py --input_file cleaned_python_test_tests_exec_coverage.json --model_name "claude-3-5-sonnet-20240620"
If you use our default setup files, the results will be saved to data/ExecEval/cleaned_python_test_debug_round0.json
.
You can iteratively run execution and debugging to improve the success rate of the RepoST method.
We also provide the code to improve the test coverage, if the generated tests are correct, but miss some important branches:
cd RepoST
python improve_coverage.py --input_file cleaned_python_test_tests.json --model_name "claude-3-5-sonnet-20240620"
Assuming the result file you finally obtain (with execution results) is called data/ExecEval/cleaned_python_test_debug_round0_exec_coverage.json
, you can run post-processing with:
cd RepoST
python post_processing.py --input_file cleaned_python_test_debug_round0_exec_coverage.json --script_key full_script_debug_round0 --output_file test_set_final.json
Then you can call LLM-checkers to check (1) the functionality equivalence of the sandboxed target function and the original function, and (2) the correctness of the test cases:
cd RepoST
python llm_check.py --input_file test_set_final.json --model_name gpt-4o --mode sandbox
python llm_check.py --input_file test_set_final.json --model_name gpt-4o --mode test
python llm_check.py --input_file test_set_final.json --model_name gpt-4o --mode combine
If you use our default setup files, the file with examples that pass all tests will be saved as data/ExecEval/checked_test_set_final.json
.
Please refer to here to see the format of the resulting data file.