DevBench is structured around five key tasks, each focusing on a critical phase of the software development lifecycle. Our modular design allows for independent testing and evaluation of each task.
Task Specification: The model is provided with the Product Requirements Document (PRD) to create:
- Unified Modeling Language (UML) class and sequence diagrams in Mermaid, and
- Architectural designs using hierarchical file-tree structures.
The class diagram acts as a fundamental component in software system modeling, illustrating classes, their attributes, and interrelations. Sequence diagrams detail processes and objects involved and the sequence of messages exchanged to carry out the software functionality. Architectural design is aimed at crafting a structured layout of source code, compilation scripts, and auxiliary files crucial for software implementation.
Evaluation: LLM-as-a-judge. The evaluation is anchored by two principal metrics: general principles and faithfulness.
- General principles include cohesion and decoupling, and practicability.
- Faithfulness gauges the extent to which models adhere to specified instructions.
Task Specification: Models are provided with the PRD, UML diagrams, and architecture design to generate a dependency file requisite for initializing the development environment (Conda for Python, Gradle for Java and NPM for JavaScript). It is noteworthy that our evaluation framework does not contain an Environment Setup task for C/C++ due to the absence of a universally acknowledged and user-friendly dependency management system for these languages.
Evaluation: The evaluation centers on the execution of dependency files across each programming language within a predetermined base environment delineated in a Docker file. This is followed by the execution of the repository's example usage code. The principal metric for evaluation in this task is the success rate of the executed example code.
Task Specification: Models are provided with the PRD, UML diagrams and architecture design, and are then instructed to develop code for each source code file as specified in the architecture design. The generation process is guided to adhere to a partial order derived from a predefined directed acyclic graph, thereby ensuring structured and logical code development.
Evaluation: An automated testing framework is developed and tailored to the specific programming language in use, integrates PyTest for Python, GTest for C++, JUnit for Java, and Jest for JavaScript. The evaluation procedure involves executing reference acceptance and unit tests within a predefined reference environment and the evaluation metric is determined by the pass rate of these tests.
Task Specification: Models are provided with the PRD, UML diagrams, and architecture design, with the option to include the implementation source code to generate acceptance test code. For libraries, acceptance tests are implemented through code that invokes the library's API, followed by assertions made on the responses of the API.
Evaluation: The evaluation of this task involves running the generated acceptance tests against the benchmark implementation code and reports the pass rate of the tests.
Task Specification: The PRD, UML diagrams, and architecture design are furnished to the models to generate unit tests.
Evaluation: The evaluation of LLM-generated unit tests is conducted through their application on the reference source code, employing the previously delineated testing framework. In addition to the pass rate metrics, statement coverage is also incorporated, providing a quantitative understanding of test comprehensiveness.
DevBench comprises a curated collection of repos organized by programming language.
Text | Language | Domain | #code files | #code lines | #code tokens | #acceptance tests | #unit tests | Unit test coverage |
---|---|---|---|---|---|---|---|---|
TextCNN | Python | DL, NLP | 5 | 403 | 1566 | 1 | 10 | 99 |
ArXiv digest | Python | SE, API | 1 | 198 | 901 | 4 | 38 | 94 |
chakin | Python | NLP | 1 | 162 | 225 | 1 | 1 | 86 |
readtime | Python | ALGO | 3 | 284 | 920 | 4 | 8 | 95 |
hone | Python | SE | 4 | 274 | 844 | 5 | 7 | 90 |
Stocktrends | Python | ALGO | 1 | 384 | 1350 | 2 | 7 | 85 |
GeoText | Python | NLP | 2 | 470 | 1701 | 5 | 4 | 98 |
lice | Python | SE | 2 | 376 | 1329 | 6 | 25 | 88 |
PSO | Python | ALGO | 2 | 168 | 578 | 1 | 5 | 93 |
hybrid images | Python | ALGO, CV | 1 | 144 | 746 | 1 | 19 | 90 |
Actor Relationship Game | Java | ALGO, API | 8 | 493 | 1453 | 4 | 16 | 64.32 |
Leftist Trees and Fibonacci Heaps Comparison | Java | ALGO | 3 | 632 | 2009 | 2 | 2 | 45.32 |
Redis | Java | SE, DB | 9 | 779 | 2546 | 1 | 17 | 78.6 |
idcenter | Java | SE | 4 | 333 | 1140 | 3 | 4 | 54.2 |
image similarity | Java | SE, CV | 3 | 382 | 1397 | 2 | 2 | 71.34 |
xlsx2csv | C/C++ | SE | 10 | 476 | 1440 | 5 | 8 | 95.17 |
people management | C/C++ | SE, DB | 6 | 540 | 2043 | 7 | 9 | 95.14 |
Area Calculation | C/C++ | SE | 7 | 162 | 307 | 3 | 3 | 90.48 |
Graph BFS DFS | C/C++ | ALGO | 5 | 667 | 2828 | 5 | 22 | / |
Logistic Management System | C/C++ | SE | 7 | 630 | 2007 | 7 | 17 | 99.11 |
listen-now-frontend | JS | Web | 6 | 232 | 492 | 1 | 0 | / |
register | JS | Web | 6 | 223 | 741 | 3 | 0 | / |
Repos in benchmark_data
have been organized by programming language. The typical repo structure includes:
repo
acceptance_tests
Acceptance test scriptsdocs
PRD.md
Product requirement documentUML_class.md
UML Class diagramsUML_sequence.md
UML Sequence diagramsarchitecture_design.md
Architecture design of the reporequirements.txt
Dependencies needed for the repo
examples
Usage examplesunit_tests
Unit test scriptssrc
Source coderepo_config.json
Repo configuration filesetup_shell_script.sh
Script to set up the runtime environment
The repo_config.json
file is crucial for defining the scope and requirements of each task, ensuring a seamless evaluation process. Below is an example configuration snippet:
{
# I:SoftwareDesign
# II: EnvironmentSetup
# III: Implementation
# IV: AccpetanceTesting
# V: UnitTesting
# docs
"PRD": "{path_to_prd}", # I, II, III, IV, V
"UML_class": "{path_to_UML_class}", # II, III, IV, V
"UML_sequence": "{path_to_UML_sequence}", #II, III, IV, V
"dependencies": "{path_to_dependices_file}", #II, III, IV, V
"architecture_design": "{path_to_architecture_design_file}", #II, III, IV, V
"language": "{language}", #II, III, IV, V
# path to usage examples
"usage_examples": "path_to_usage_examples", # file or dir; II
# test_path
"unit_tests": "{unit_tests_path}", # III: evaluation; V: only dir is needed to host the generated test code
"acceptance_tests": "{acceptance_tests_path}", # III: evaluation; VI: only dir is needed to host the generated test code
# setup files
"required_files": [
"requirements.txt",
"a.cpp",
"a.h"
] # III
# setup shell script
"setup_shell_script": "path_to_script", # III
# DAG for code files so that LLMs can generat code in order
"code_file_DAG": {
"main.py": ["a.py", "b.py", "c.py", "d.py"],
"a.py": ["a1.py", "a2.py"]
}, #III, IV, V
# unit test linking and scripts for debugging
"unit_tests_linking": {
"unit_tests/test_1.py": ["src/module1.py", "src/module2.py"],
# unit test: corresponding code files
...
} # V
# test in batch
"unit_test_script": "pytest --cov=. --cov-report=json:unit_test_cov.json --json-report --json-report-file=unit_test_report.json unit_tests", # III, IV
"acceptance_test_script": "pytest --cov=. --cov-report=json:acceptance_test_cov.json --json-report --json-report-file=acceptance_test_report.json acceptance_tests", # III, V
# prompts for test generation
"fine_unit_test_prompt": {
"path/to/unittest1": "prompt",
"path/to/unittest2": "prompt",
...
}, # V
"fine_acceptance_test_prompt": {
"path/to/acceptancetest1": "prompt",
"path/to/acceptancetest2": "prompt",
...
} # IV
}