EPAM AI/RUN^TM Engineering Benchmark 🚀

Evaluation of AI-powered solutions in enterprise software development.

What We Solve
Leaderboards
Our Approach
Technologies
Repositories Structure
Contributing
Submission
License

🎯 What We Solve

In the rapidly evolving landscape of LLMs and AI-powered developer tools, we identified a critical need for a comprehensive benchmark tailored to enterprise software development. Our AI/RUN^TM Engineering Benchmark addresses this gap by providing:

Holistic evaluation and comparison of AI Assistants and LLMs
Focus on real-world enterprise development scenarios
Comprehensive coverage of software engineering tasks

Before we decided to start working on our benchmark our team explored the ones available on the market. We have realized that are mostly designed around tasks that differ significantly from the everyday challenges faced by developers in enterprise settings. E.g. see more details why the available Code Assistants benchmarks would not show the realistic scores here.

🏆 Leaderboards

Our team runs EPAM AI/RUN^TM Engineering Benchmark on the most popular Code Assistants and LLMs. The results are updated on the corresponding leaderboards that we update regularly.

View the current rankings and results of LLM's benchmark on our LLMs Leaderboard and Code Assistants results on Code Assistants Leaderboard.

🔬 Our Approach

We've developed a multi-faceted approach to assess capabilities and performance of AI Assistants and LLMs:

Code Assistants Benchmark 💻

Evaluates day-to-day coding tasks
Covers 11 key categories of software development activities:
- Non-Proprietary Solution/Component Generation
- Tests Creation
- Algorithm Development
- Data Generation
- Code Explanation
- Code Bug Fixing
- Code Refactoring
- Solution Documentation
- Code Optimizations
- DevOps
- Solution Migration
Utilizes both code completion and chat-based scenarios

For more detailed information about our Code Assistants benchmark methodology, please refer to the following page Code Assistants.

LLMs Benchmark 🧠

Assesses core LLM capabilities in software engineering tasks
Covers 4 main categories: Code Generation, Code Transformation, Documentation Generation, and Large Context Instructions Following
A sophisticated scoring system that considers a balanced assessment of the following metrics: Accuracy, Completeness, Generation Rate, Number of Attempts
Uses LLMs for automated results evaluation based on prepared acceptance criteria

For more detailed information about our benchmark methodology, please refer to the following page LLMs.

🛠 Technologies

VS Code: Primary IDE for AI Assistants testing
Python/Langchain: Script automation for executing LLMs Benchmark tests and processing results
The evaluation of results is performed using a Large Language Model (LLM). The GPT-4o model was used to evaluate all results for the current benchmark version.
Grading can be done with any model that supports log probabilities, and GPT-4o was utilised for this task.

🗂 Repositories Structure

LLM Benchmarking

AIRUN LLM Benchmark: This repository contains scenarios with instructions for LLMs on how to execute different software engineering tasks, evaluation criteria, and utility scripts for automated benchmark execution and evaluation.
AIRUN LLM Benchmark Results: This repository holds all LLM benchmark run results and reports of LLM evaluations.

LLM Evaluation Framework

AIRUN Evaluation Framework: This repository contains a tool for automatically evaluating and grading Large Language Model (LLM) answers using another LLM as an evaluator.

AI Code Assistants Benchmarking

AIRUN Assistants Benchmark TestInstructions: This repository contains instructions to run tests evaluating AI Code Assistants' performance.
AIRUN Assistants Benchmark Codebase and Codebase Golf Application: These repositories hold different projects with code for common development tasks. These projects are used as playgrounds to run AI development assistant tests.

🤝 Contributing

We appreciate all contributions to improve the AI/RUN ^TM Engineering Benchmark. Please see our Contribution Guidelines for more information on how to get involved.

If you have suggestions for new benchmark scenarios or improvements to existing ones, please open an issue or submit a pull request.

📝 Submission

The list of LLMs and Code Assistants available on the Leaderboards is built based on the amount of requests from developers we get. We encourage you to submit your LLM or Code Assistant for our evaluation. Our team will run the corresponding benchmark tests and add the LLM/Code Assistant to the leaderboard.

To submit your LLM or Code Assistant for evaluation and inclusion in our Leaderboards, please follow these steps:

Review the submission requirements:
- For LLMs: LLM Submission Guidelines
- For Code Assistants: Code Assistant Submission Guidelines
Create a new branch in this repository.
Add a new Markdown (.md) file with information about your model or code assistant. The file should be named according to your submission (e.g., your-model-name.md or your-assistant-name.md) and placed in the appropriate directory near instruction file.
Fill out the Markdown file following the format and requirements specified in the respective submission guidelines linked above. Ensure all required information is included and properly formatted.
Create a Pull Request (PR) with your changes and description.

Our team will review your PR, run benchmark tests, and add the LLM or Code Assistant to the appropriate leaderboard upon approval.

📄 License

The code base, documentation including our benchmark approach, test instructions and evaluation criteria are licensed under the Apache 2.0.

The leaderboards, detailed reports and analysis are licensed under CC BY-SA 4.0

Our tests use third-party opensource repositories (licensed under MIT, BSD 3-Clause, Apache 2, Eclipse Public License 1.0, Eclipse Public License v2.0, MPL 2.0, ISC). Please refer to the list below for more details here.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
docs		docs
files/sandbox-test		files/sandbox-test
images		images
pages		pages
reports		reports
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
REPORTS-LICENSE		REPORTS-LICENSE
third-party-repositories.md		third-party-repositories.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

EPAM AI/RUN^TM Engineering Benchmark 🚀

🎯 What We Solve

🏆 Leaderboards

🔬 Our Approach

Code Assistants Benchmark 💻

LLMs Benchmark 🧠

🛠 Technologies

🗂 Repositories Structure

LLM Benchmarking

LLM Evaluation Framework

AI Code Assistants Benchmarking

🤝 Contributing

📝 Submission

📄 License

About

Licenses found

Releases

Packages

Contributors 4

License

Licenses found

epam/AIRUN-Engineering-Benchmark

Folders and files

Latest commit

History

Repository files navigation

EPAM AI/RUNTM Engineering Benchmark 🚀

🎯 What We Solve

🏆 Leaderboards

🔬 Our Approach

Code Assistants Benchmark 💻

LLMs Benchmark 🧠

🛠 Technologies

🗂 Repositories Structure

LLM Benchmarking

LLM Evaluation Framework

AI Code Assistants Benchmarking

🤝 Contributing

📝 Submission

📄 License

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

EPAM AI/RUN^TM Engineering Benchmark 🚀

Packages