RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing

Code for "RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing" (Arxiv, Website)

If you find our paper or code useful, please cite the paper:

@misc{xie2025repost,
      title={RepoST: Scalable Repository-LevelCoding Environment Construction with Sandbox Testing}, 
      author={Yiqing Xie and Alex Xie and Divyanshu Sheth and Pengfei Liu and Daniel Fried and Carolyn Rose},
      year={2025},
      eprint={2503.07358},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.07358}, 
}

Organization

Format of the RepoST-Train and RepoST-Eval data: The structure of our datasets.
Evaluation on RepoST-Eval: conducting evaluation on RepoST-Eval. The code includes:
- (1) loading the RepoST-Eval docker image for execution,
- (2) running code generation for RepoST-Eval examples, and
- (3) executing our evaluation scripts in the docker to compute Pass@k scores.
We follow R2E and provide the "Ground-Truth" code context, which is constructed by extracting the modules directly or indirectly called by the target function. In principle, you can also clone the repos and retrieve context or apply agent frameworks.
Training with RepoST-Train (Vanilla SFT): the process of setting up our training environment, RepoST-Train, and using it to construct training data. We provide the code to run vanilla SFT (with ground truth functions as targets) using the LlamaFactory framework, where you do not need to set up dockers.
Training with RepoST-Train (Rejection Sampling): We also provide the code to run rejection sampling, which includes
- (1) loading the RepoST-Train docker image,
- (2) generating candidate solutions on training examples,
- (3) executing the candidate solutions,
- (4, optional) debugging failed candidate solutions, and
- (5) running model training using LlamaFactory.
Evaluation on Public Datasets: After training the model, we provide the code to run evaluation on HumanEval and RepoEval-func.
Build Your Own Executable Code Generation Environments: You can also build your own executable environments using RepoST. The steps include:
- (1) Curating GitHub Repos, Functions, and Context
- (2) Sandboxing
- (3) Test Generation
- (4) Iterative Execution & Debugging
- (5) Post-Checking

Dataset Structure

The RepoST-Train and RepoST-Eval datasets have the same structure. They both consist of (1) a set of GitHub repos (with specific commit id), (2) a set of target functions inside each repo, and (3) an evaluation script for each target function that can be executed in our docker environment.

We provide the data files in data/ExecTrain/checked_test_set_final.json (and data/ExecEval/checked_test_set_final.json). The file is a list of dicts. Each dict represents one example and is in the following format:

{
  "func_name": "VNetwork.forward",                                                 # the (class name and) function name of the target function
  "idx": "708",                                                                    # the idx of the example
  "repo_name": "seakers___sacTransformerEOS",                                      # the repo name
  "commit_id": "669fb18ee47ae5947b578e392be25c1517f6d73b",                         # the commit id
  "func_path": "scripts/sac.py",                                                   # the local file path that contains the target function
  "orig_context": "```python\n## scripts/sac.py\nimport torch ...",                # the "ground truth" code context extracted from the repo
  "eval_script": "## scripts/sac.py\nimport torch\n\nimport torch.nn as nn ...",   # the evaluation script generated in the RepoST framework, used for execution-based evaluation
  "coverage_rate": 1.0,                                                            # coverage rate
  "sandbox_ast_check": true,                                                       # the following three are post-check results
  "sandbox_functionality_check": ...,
  "test_correctness_check": ..., 
}

Evaluation on RepoST-Eval

Setup and Installation

Create a new environment:

conda env create -n repost python=3.11 -y
conda activate repost

Set up the environment for code generation and RepoST environment construction (Note that you probably need a separate environment for model training):

pip install -r requirements.txt

And specify code and data paths:

source setup_eval.sh

Load the RepoST-Eval Docker Image for Execution

We provide the RepoST-Eval docker image.

You can use the image by

docker load -i repost_eval_image.tar

and create a docker container that links the data folders (specified in setup_eval.sh)

docker run --name=execeval -v ${final_dataset_DIR}:${docker_final_dataset_DIR} -it repost_eval:v0

Run Generation

We provide the code for running inference under generation/. We use vllm to run code generation for open-source models and use litellm for API-based models.

Here are the example scripts, where we evaluate Pass@1 for GPT-4o and QwenCoder-7B.

cd generation
bash scripts/ExecEval_gpt4o_generation.sh
bash scripts/ExecEval_qwen_generation.sh

The results will be saved under data/ExecEval/results/, for example, data/ExecEval/results/ExecEval_gpt4o_context_top1.json. Our code will automatically copy the generated function to the evaluation script (with test cases) to check the correctness of the generation results (e.g., generation_script)

Execution-based Evaluation in the Docker

After the previous step, you should see the generation file under ${docker_final_dataset_DIR} (specified in setup_eval.sh) inside the docker container.

Assume your generation file is data/ExecEval/results/ExecEval_gpt4o_context_top1.json and the dict key that stores the scripts is generation_script, you can run execution with:

# in the docker container
cd RepoST
source setup_eval.sh
python RepoST/execution.py --input_file ../ExecEval/results/ExecEval_gpt4o_context_top1.json --script_key generation_script

The execution results for each evaluation example will be stored in data/ExecEval/results/ExecEval_gpt4o_context_top1_exec.json.

Training with RepoST-Train (Vanilla SFT)

We provide the vanilla sft data we used in our paper in LLaMA-Factory/data/ExecTrain_exec_sft_data.json and provide the code for model training under LLaMA-Factory.

If you want to process the data on your own, we provide the code for processing training data under get_training_data, which processes data/ExecTrain/checked_test_set_final.json into the training data format.

Setup and Installation for model training

Create a new environment:

conda env create -n llama python=3.11 -y
conda activate llama

Set up the environment for model training (Note that you probably need a separate environment for code generation and RepoST environment construction):

cd LLaMA-Factory
pip install -e .

And specify code and data paths:

source setup_train.sh

Data Processing

We provide the vanilla sft data in LLaMA-Factory/data/ExecTrain_exec_sft_data.json You can also process the data again by running:

cd get_training_data
bash scripts/get_full_sft_data.sh

The data file will be generated under LLaMA-Factory/data.

Model Training

Before training, you first need to add the data information in LLaMA-Factory/data/dataset_info.json. For example, adding an entry:

  "ExecTrain_exec_sft": {
    "file_name": "ExecTrain_exec_sft_data.json"
  },

After that, write a config file. We provide an example in LLaMA-Factory/examples/train_full/qwencoder_ExecTrain_exec_sft.yaml.

Finally, run training by calling llamafactory-cli train. We provide an example here:

cd LLaMA-Factory
bash scripts/sft_ExecTrain_exec_qwencoder.sh

Training with RepoST-Train (Rejection Sampling)

Similar to vanilla SFT, we provide the rejection sampling (distill) data we used in our paper in LLaMA-Factory/data/ExecTrain_rejection_sampling_claudegpt_debug_replay_data.json and provide the code for model training under LLaMA-Factory. Please directly refer to the Model Training Instructions if you want to train with the provided data.

If you want to run rejection sampling on your own, we provide the code for generating candidate solutions under generation/, the code for execution in RepoST/execution.py, and the code to process training data under get_training_data/.

Setup and Installation for both code generation and model training

We recommend using different environments for code generation and for model training. Specifically:

For code generation:

conda env create -n repost python=3.11 -y
conda activate repost

pip install -r requirements.txt

For model training:

conda env create -n llama python=3.11 -y
conda activate llama

cd LLaMA-Factory
pip install -e .

Finally, specify code and data paths:

source setup_train.sh

Load the RepoST-Train Docker Image for Execution

We provide the RepoST-Train docker image.

You can use the image by

docker load -i repost_train_image.tar

and create a docker container that links the data folders (specified in setup_train.sh)

docker run --name=exectrain -v ${final_dataset_DIR}:${docker_final_dataset_DIR} -it repost_train:v0

Generate Candidate Solutions on RepoST-Train

We provide the code for generating candidate solutions under generation/. We use vllm to run code generation for open-source models and use litellm for API-based models.

Here are the example scripts, where we generate 5 candidate solutions for each training example using GPT-4o.

cd generation
bash ExecTrain_gpt4o_generation.sh

The results will be saved under data/ExecTrain/results/, for example, data/ExecTrain/rejection_sampling/ExecTrain_gpt4o_top5.json. Our code will automatically copy the generated function to the evaluation script (with test cases) to check the correctness of the generation results (e.g., generation_script)

Executing the Candidate Solutions in the Docker

After the previous step, you should see the generation file under ${docker_final_dataset_DIR} (specified in setup_train.sh) inside the docker container.

Assume your generation file is data/ExecTrain/rejection_sampling/ExecTrain_gpt4o_top5.json and the dict key that stores the scripts is generation_script, you can run execution with:

# in the docker container
cd RepoST
source setup_train.sh
python RepoST/execution.py --input_file ../ExecTrain/rejection_sampling/ExecTrain_gpt4o_top5.json --script_key generation_script

The execution results for each evaluation example will be stored in data/ExecTrain/rejection_sampling/ExecTrain_gpt4o_top5_exec.json.

(Optional) Debugging Failed Candidate Solutions to Obtain More Training Instances

To obtain more training instances, you can also call the model to debug failed solutions.

Here is an example script, where we prompt GPT-4o to generate one debugged solution for each failed candidate solution.

cd generation
bash scripts/ExecTrain_gpt4o_debug.sh

Process Data

Finally, process the training data under get_training_data.

We provide the example scripts here:

cd get_training_data
bash scripts/get_rej_sampling_data.sh

and

cd get_training_data
bash scripts/get_rej_sampling_debug_data.sh

The processed training data files will be saved to LLaMA-Factory/data/.

Run Model Training

Before training, you first need to add the data information in LLaMA-Factory/data/dataset_info.json. For example, adding an entry:

  "ExecTrain_gpt_debug_replay": {
    "file_name": "ExecTrain_rejection_sampling_gpt_debug_replay_data.json"
  },

After that, write a config file with the data information. For example, create LLaMA-Factory/examples/train_full/qwencoder_ExecTrain_exec_rej_sampling_gpt_debug_replay.yaml.

Finally, run training by calling llamafactory-cli train with your config. For example:

cd LLaMA-Factory
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 llamafactory-cli train examples/train_full/qwencoder_ExecTrain_exec_rej_sampling_gpt_debug_replay.yaml

Evaluation on Public Datasets

We provide the code for evaluating on public benchmarks under CodeRAGBench_evaluation, where we borrow and modify the code of CodeRAGBench.

Please refer to the instructions in CodeRAGBench_evaluation/README.md.

Build Your Own Executable Code Generation Environments

We provide the code for scraping GitHub repos, functions, and context under get_github_input and the code for constructing the evaluation scripts under RepoST.

You will need to build a docker to execute the evaluation scripts.

Setup and Installation

Create a new environment:

conda env create -n repost python=3.11 -y
conda activate repost

Set up the environment for code generation and RepoST environment construction (Note that you probably need a separate environment for model training):

pip install -r requirements.txt

Finally, specify the code and data paths in your own setup file. We provide get_github_input/r2e/ExecEval_config.yml as an example.

Load the setup file with:

source setup_eval.sh

Docker Setup

Start with creating an empty docker image based on Dockerfile:

docker build --tag exec_eval .

And run a docker container:

docker run --name=exec_eval_test -v ${dataset_generation_DIR}:${docker_dataset_generation_DIR} -it exec_eval

Inside the docker container, you may want to create a restricted user that can only access the data file and the tmp file:

adduser --disabled-password --gecos "" restricted_user
chown restricted_user:restricted_user /home/user/ExecEval
chown restricted_user:restricted_user /home/user/tmp

Step 1: Sample GitHub Repos, Functions, and Context

You first need to write a config file under get_github_input/, where we provide two examples: get_github_input/r2e/ExecTrain_config.yml and get_github_input/r2e/ExecEval_config.yml.

After that, specify the path to the config file in get_github_input/setup.sh.

To scrape GitHub repos, you can follow the example scripts file that is used for creating RepoST-Eval: get_github_input/scripts/ExecEval_scrape_repos.sh. In this script, we scrape 1,000 repos created between 2024-09-01 and 2024-12-31, and randomly sample 300. We clone the selected repos and remove the ones with storage sizes larger than 10M.

cd get_github_input
bash scripts/ExecEval_scrape_repos.sh

To sample functions and extract ground truth context by call graph, you can follow the example scripts file that is used for creating RepoST-Eval: get_github_input/scripts/ExecEval_obtain_functions.sh. In this script, we extract all the functions with docstrings, sample at most 15 standalone functions and 15 class methods for each repo, and extract context with a maximum block size of 8,000 tokens.

cd get_github_input
bash scripts/ExecEval_obtain_functions.sh

After running this script, the file with initial repo, function, and context will be saved under data/. For example, data/ExecEval/cleaned_python_test.json.

Step 2: Sandboxing

The second step is to call an LLM to aggregate the target function and its context (both infile and cross-file ones) to a separate script.

cd RepoST
python sandboxing.py --input_file cleaned_python_test.json --model_name "claude-3-5-sonnet-20240620"

If you use our default setup files, the results will be saved to data/ExecEval/cleaned_python_test_sandboxed.json.

Step 3: Test Generation

The third step is to call an LLM to create tests for the target function in the sandboxed script.

cd RepoST
python test_generation.py --input_file cleaned_python_test_sandboxed.json --model_name "claude-3-5-sonnet-20240620"

If you use our default setup files, the results will be saved to data/ExecEval/cleaned_python_test_tests.json.

Step 4: Iterative Execution & Debugging

In principle, the ground truth target function should pass all the generated tests in the sandboxed script with no errors. We hence execute the sandboxed script and use an LLM to debug the script (if needed):

Run execution inside the docker:

# inside the docker
python RepoST/RepoST/execution.py --input_file /home/user/ExecEval/cleaned_python_test_tests.json --script_key full_script --test_gt
python RepoST/RepoST/execution.py --input_file /home/user/ExecEval/cleaned_python_test_tests_exec.json --script_key full_script --test_gt --check_coverage_mode

Alternatively, you can run execution under the restricted_user:

# inside the docker
su -c "python RepoST/RepoST/execution.py --input_file /home/user/ExecEval/cleaned_python_test_tests.json --script_key full_script --test_gt" restricted_user
su -c "python RepoST/RepoST/execution.py --input_file /home/user/ExecEval/cleaned_python_test_tests_exec.json --script_key full_script --test_gt --check_coverage_mode" restricted_user

If you use our default setup files, the execution results will be saved to data/ExecEval/cleaned_python_test_tests_exec_coverage.json.

Debug with an LLM:

cd RepoST
python debug.py --input_file cleaned_python_test_tests_exec_coverage.json --model_name "claude-3-5-sonnet-20240620"

If you use our default setup files, the results will be saved to data/ExecEval/cleaned_python_test_debug_round0.json.

You can iteratively run execution and debugging to improve the success rate of the RepoST method.

We also provide the code to improve the test coverage, if the generated tests are correct, but miss some important branches:

cd RepoST
python improve_coverage.py --input_file cleaned_python_test_tests.json --model_name "claude-3-5-sonnet-20240620"

Step 5: Post-Checking

Assuming the result file you finally obtain (with execution results) is called data/ExecEval/cleaned_python_test_debug_round0_exec_coverage.json, you can run post-processing with:

cd RepoST
python post_processing.py --input_file cleaned_python_test_debug_round0_exec_coverage.json --script_key full_script_debug_round0 --output_file test_set_final.json

Then you can call LLM-checkers to check (1) the functionality equivalence of the sandboxed target function and the original function, and (2) the correctness of the test cases:

cd RepoST
python llm_check.py --input_file test_set_final.json --model_name gpt-4o --mode sandbox
python llm_check.py --input_file test_set_final.json --model_name gpt-4o --mode test
python llm_check.py --input_file test_set_final.json --model_name gpt-4o --mode combine

If you use our default setup files, the file with examples that pass all tests will be saved as data/ExecEval/checked_test_set_final.json.

Resulting Files

Please refer to here to see the format of the resulting data file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing

Organization

Dataset Structure

Evaluation on RepoST-Eval

Setup and Installation

Load the RepoST-Eval Docker Image for Execution

Run Generation

Execution-based Evaluation in the Docker

Training with RepoST-Train (Vanilla SFT)

Setup and Installation for model training

Data Processing

Model Training

Training with RepoST-Train (Rejection Sampling)

Setup and Installation for both code generation and model training

Load the RepoST-Train Docker Image for Execution

Generate Candidate Solutions on RepoST-Train

Executing the Candidate Solutions in the Docker

(Optional) Debugging Failed Candidate Solutions to Obtain More Training Instances

Process Data

Run Model Training

Evaluation on Public Datasets

Build Your Own Executable Code Generation Environments

Setup and Installation

Docker Setup

Step 1: Sample GitHub Repos, Functions, and Context

Step 2: Sandboxing

Step 3: Test Generation

Step 4: Iterative Execution & Debugging

Step 5: Post-Checking

Resulting Files

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
CodeRAGBench_evaluation		CodeRAGBench_evaluation
LLaMA-Factory		LLaMA-Factory
RepoST		RepoST
data		data
generation		generation
get_github_input		get_github_input
get_training_data		get_training_data
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
requirements_docker_py311.txt		requirements_docker_py311.txt
setup_eval.sh		setup_eval.sh
setup_train.sh		setup_train.sh
utils.py		utils.py
utils_exec.py		utils_exec.py

License

yiqingxyq/RepoST

Folders and files

Latest commit

History

Repository files navigation

RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing

Organization

Dataset Structure

Evaluation on RepoST-Eval

Setup and Installation

Load the RepoST-Eval Docker Image for Execution

Run Generation

Execution-based Evaluation in the Docker

Training with RepoST-Train (Vanilla SFT)

Setup and Installation for model training

Data Processing

Model Training

Training with RepoST-Train (Rejection Sampling)

Setup and Installation for both code generation and model training

Load the RepoST-Train Docker Image for Execution

Generate Candidate Solutions on RepoST-Train

Executing the Candidate Solutions in the Docker

(Optional) Debugging Failed Candidate Solutions to Obtain More Training Instances

Process Data

Run Model Training

Evaluation on Public Datasets

Build Your Own Executable Code Generation Environments

Setup and Installation

Docker Setup

Step 1: Sample GitHub Repos, Functions, and Context

Step 2: Sandboxing

Step 3: Test Generation

Step 4: Iterative Execution & Debugging

Step 5: Post-Checking

Resulting Files

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages