Implements the evaluation of agents on the GPQA benchmark introduced in GPQA: A Graduate-Level Google-Proof Q&A Benchmark.
This code implements the evaluation of agents on the GPQA Benchmark with Open Book setting.
- The benchmark consists of 448 high-quality and extremely difficult multiple-choice questions in the domains of biology, physics, and chemistry. The questions are intentionally designed to be "Google-proof," meaning that even highly skilled non-expert validators achieve only 34% accuracy despite unrestricted access to the web.
- Even experts in the corresponding domains achieve only 65% accuracy.
- State-of-the-art AI systems achieve only 39% accuracy on this challenging dataset.
Note Accurate solving of above graduate level questions would require both tool use (e.g., python for calculations) and web-search for finding related facts as information required for the questions might not be part of the LLM knowledge / training data.
Further references:
- https://arxiv.org/pdf/2311.12022
- https://paperswithcode.com/dataset/gpqa
- https://github.com/idavidrein/gpqa
Please follow this document to setup local develop environment for OpenDevin.
Create a config.toml
file (you can copy from config.template.toml
) if it does not exist at the root of the workspace.
Add the following configurations:
# TODO: Change these to the model you want to evaluate
[llm.eval_gpt4_1106_preview]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[llm.eval_azure_openai_compatible_model]
model = "AZURE_OPENAI_EXACT_DEPLOYMENT_MODEL_NAME"
base_url = "AZURE_OPENAI_ENDPOINT"
api_key = "AZURE_ENDPOINT_API_KEY"
temperature = 0.0
'gpqa_main', 'gqpa_diamond', 'gpqa_experts', 'gpqa_extended' -- data split options From the root of the OpenDevin repo, run the following command:
./evaluation/gpqa/scripts/run_infer.sh [model_config_name] [git-version] [num_samples_eval] [data_split] [AgentClass]
You can replace model_config_name
with any model you set up in config.toml
.
model_config_name
: The model configuration name fromconfig.toml
that you want to evaluate.git-version
, e.g.HEAD
, is the git commit hash of the OpenDevin version you would like to evaluate. It could also be a release tag like0.6.2
.num_samples_eval
: Number of samples to evaluate (useful for testing and debugging).data_split
: The data split to evaluate on. Must be one ofgpqa_main
,gqpa_diamond
,gpqa_experts
,gpqa_extended
. Defaults togpqa_diamond
as done in the paper.AgentClass
: The agent class to use for evaluation. Currently only supportsCodeActAgent
for CodeActAgent.
- [] TODO: Finish the evaluation run across the entire benchmark and compile results