Skip to content

feat: tool calling benchmark unified across types and prompts variety #620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Jul 1, 2025
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
bd7ccc2
feat: add new levels for prompt sand system prompts
jmatejcz May 26, 2025
71e6b71
feat: adjust basic taks to new levels
jmatejcz May 26, 2025
b5f115f
feat: manipulation tasks adjusted to new levels
jmatejcz May 26, 2025
7bb77c8
feat: adjust navigation tasks to new levels
jmatejcz May 26, 2025
295f18e
refactor: grouped args to task into pydantic model
jmatejcz May 27, 2025
e8f1631
feat: adjust custom interfaces tasks to new levels
jmatejcz May 27, 2025
81983e4
feat: adjust spatiial tasks to new levels
jmatejcz May 27, 2025
2690ed6
feat: merged topics mocks in basic tasks
jmatejcz May 27, 2025
4198d6d
feat: adjusted examples and result saving
jmatejcz Jun 2, 2025
08be22a
feat: adjusted visualisation to new levels
jmatejcz Jun 2, 2025
99b5c77
feat: seperate file for mocks, merged mocks from different types
jmatejcz Jun 3, 2025
74b8ec2
feat: defined more basic tasks
jmatejcz Jun 3, 2025
e3ae1e5
refactor: splitted predefined tasks into files
jmatejcz Jun 3, 2025
ee81ec0
feat: extra tool calls as list
jmatejcz Jun 4, 2025
eb90175
docs: adjusted docs to new changes
jmatejcz Jun 4, 2025
70cd604
style: format changes
jmatejcz Jun 4, 2025
0c8cb9d
chore: reduce the computation in example benchamrking
jmatejcz Jun 4, 2025
0b1406f
feat: task prompts more like guidance
jmatejcz Jun 5, 2025
ff851f2
feat: added Task's base prompt for result processing
jmatejcz Jun 5, 2025
b4c9e9e
feat: saving base prompt to results
jmatejcz Jun 5, 2025
14d4f9d
fix: labels in task plots
jmatejcz Jun 5, 2025
d532672
fix: passing prompt levels from user
jmatejcz Jun 6, 2025
7d01b2b
style: adjust docs tutorial
jmatejcz Jun 12, 2025
761ba5c
chore: version bump
jmatejcz Jun 12, 2025
d0a2c5f
docs: typos in docs
jmatejcz Jun 24, 2025
67d2268
refactor: removed dupicalte check
jmatejcz Jun 24, 2025
308c3ce
style: change config name
jmatejcz Jun 30, 2025
c03f1cf
refactor: removed moredate level of prompt detail
jmatejcz Jul 1, 2025
8dae02c
docs: added docstings
jmatejcz Jul 1, 2025
80e5ad2
docs: added examples and more descriptions to docs
jmatejcz Jul 1, 2025
e9ea171
docs: linked the main ToolCallingAgentBenchmarkConfig docstring in o…
jmatejcz Jul 1, 2025
3e64387
docs: updated docs and linked
jmatejcz Jul 1, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 20 additions & 5 deletions docs/simulation_and_benchmarking/rai_bench.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ The `Validator` class can combine single or multiple subtasks to create a single

### Task

A Task represents a specific prompt and set of tools available. A list of validators is assigned to validate the performance.
A Task represents a specific prompts and set of tools available. A list of validators is assigned to validate the performance.

??? info "Task class definition"

Expand All @@ -123,14 +123,29 @@ The ToolCallingAgentBenchmark class manages the execution of tasks and collects

### Available Tasks

Tasks of this benchmark are grouped by type:
There are predefined Tasks available which are grouped by categories:

- Basic - basic usage of tools
- Basic - require retrieving info from certain topics
- Navigation
- Spatial reasoning - questions about surroundings with images attached
- Manipulation
- Custom Interfaces - requires using messages with custom interfaces

If you want to know details about every task, visit `rai_bench/tool_calling_agent/tasks`
Every Task has assigned the `complexity` which reflects the difficulty.

When creating a Task, you can define few params:

```python
class TaskArgs(BaseModel):
"""Holds the configurations specified by user"""

## Test Models
extra_tool_calls: int = 0
prompt_detail: Literal["brief", "moderate", "descriptive"] = "brief"
examples_in_system_prompt: Literal[0, 2, 5] = 0
```

- examples_in_system_prompt - How many examples there are in system prompts.
- prompt_detail - How descriptive should the Task prompt be.
- extra_tool_calls - How many extra tool calls an agent can make and still pass the Task.

If you want to know details about every task, visit `rai_bench/tool_calling_agent/tasks`
55 changes: 38 additions & 17 deletions docs/tutorials/benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,12 +53,12 @@ If your goal is creating custom tasks and scenarios, visit [Creating Custom Task
This benchmark does not require any additional setup besides the main one [Basic Setup](../setup/install.md), just run:

```bash
python src/rai_bench/rai_bench/examples/tool_calling_agent.py --model-name <model-name> --vendor <vendor> --extra-tool-calls <5> --task-types <basic> --out-dir <out_dir>
python src/rai_bench/rai_bench/examples/tool_calling_agent.py --model-name <qwen2.5:7b> --vendor <ollama> --extra-tool-calls <0 5> --task-types basic --n-shots <0 2> --prompt-detail <brief descriptive> --complexities <easy medium hard> --out-dir <out_dir>
```

!!! note

This Benchmark is significantly faster, but still if just trying out, we recommend choosing just one task-type.
This Benchmark is significantly faster, but still, if just trying out, we recommend choosing just one parameter per flag as every combination on params will create more tasks.

## Testing Models

Expand Down Expand Up @@ -90,12 +90,18 @@ if __name__ == "__main__":
],
repeats=1, # how many times to repeat
)
tool_conf = ToolCallingAgentBenchmarkConfig(
extra_tool_calls=5, # how many extra tool calls allowed to still pass
tool_conf = ToolCallingAgentBenchmarkConfig(
extra_tool_calls=[0, 5], # how many extra tool calls allowed to still pass
task_types=[ # what types of tasks to include
"basic",
"spatial_reasoning",
"manipulation",
"custom_interfaces",
],
N_shots=[0, 2], # examples in system prompt
prompt_detail=[ # how descriptive should task prompt be
"brief",
"moderate",
"descriptive",
],
repeats=1,
)
Expand Down Expand Up @@ -222,6 +228,21 @@ class ThrowObjectsOffTableTask(ManipulationTask):

incorrect: int = len(selected_type_objects) - correct
return correct, incorrect

# configure existing Task with different params
target_coords = (0.1, 0.1)
disp = 0.1
task = PlaceObjectAtCoordTask(
obj_type="apple",
target_position=target_coords,
allowable_displacement=disp,
)

Scenario(
task=task,
scene_config=scene_config,
scene_config_path=path_to_your_config
)
```

As `obj_types` is parameterizable, it enables various variants of this Task. In combination with a lot of simulation configs available, it means that a single Task can provide dozens of scenarios.
Expand All @@ -240,23 +261,14 @@ from rai_bench.tool_calling_agent.subtasks import (
from rai_bench.tool_calling_agent.validators import (
OrderedCallsValidator,
)
from rai_bench.tool_calling_agent.tasks.basic import BasicTask
from rai_bench.tool_calling_agent.mocked_tools import (
MockGetROS2TopicsNamesAndTypesTool,
)
from rai_bench.tool_calling_agent.interfaces import Task, TaskArgs
from langchain_core.tools import BaseTool
from typing import List

# configure existing Task with different params
target_coords = (0.1, 0.1)
disp = 0.1
task = PlaceObjectAtCoordTask(
obj_type="apple",
target_position=target_coords,
allowable_displacement=disp,
)

Scenario(task=task, scene_config=scene_config, scene_config_path=path_to_your_config)

# define subtask that requires
receive_robot_pos_subtask = CheckArgsToolCallSubTask(
Expand All @@ -270,7 +282,7 @@ receive_robot_pos_subtask = CheckArgsToolCallSubTask(
topics_ord_val = OrderedCallsValidator(subtasks=[receive_robot_pos_subtask])


class GetROS2RobotPositionTask(BasicTask):
class GetROS2RobotPositionTask(Task):
complexity = "easy"

@property
Expand All @@ -287,9 +299,18 @@ class GetROS2RobotPositionTask(BasicTask):
),
]

def get_system_prompt(self) -> str:
return "You are a ROS 2 expert that want to solve tasks. You have access to various tools that allow you to query the ROS 2 system."

def get_prompt(self) -> str:
return "Get the position of the robot."

@property
def optional_tool_calls_number(self) -> int:
# Listing topics before getting any message
return 1

# optionally pass number of extra tool calls
task = GetROS2RobotPositionTask(validators=[topics_ord_val], extra_tool_calls=1)
args = TaskArgs(extra_tool_calls=0)
task = GetROS2RobotPositionTask(validators=[topics_ord_val], task_args=args)
```
2 changes: 1 addition & 1 deletion src/rai_bench/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "rai-bench"
version = "0.1.0"
version = "0.2.0"
description = "Package for running and creating benchmarks."
authors = ["Jakub Matejczyk <[email protected]>", "Magdalena Kotynia <[email protected]>"]
readme = "README.md"
Expand Down
18 changes: 13 additions & 5 deletions src/rai_bench/rai_bench/examples/benchmarking_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,31 +20,39 @@

if __name__ == "__main__":
# Define models you want to benchmark
model_names = ["qwen2.5:7b", "llama3.2:3b"]
vendors = ["ollama", "ollama"]
model_names = ["qwen2.5:7b"]
vendors = ["ollama"]

# Define benchmarks that will be used
man_conf = ManipulationO3DEBenchmarkConfig(
mani_conf = ManipulationO3DEBenchmarkConfig(
o3de_config_path="src/rai_bench/rai_bench/manipulation_o3de/predefined/configs/o3de_config.yaml", # path to your o3de config
levels=[ # define what difficulty of tasks to include in benchmark
"trivial",
],
repeats=1, # how many times to repeat
)
tool_conf = ToolCallingAgentBenchmarkConfig(
extra_tool_calls=5, # how many extra tool calls allowed to still pass
extra_tool_calls=[0], # how many extra tool calls allowed to still pass
task_types=[ # what types of tasks to include
"basic",
"spatial_reasoning",
# "navigation",
"custom_interfaces",
"manipulation",
],
N_shots=[2], # examples in system prompt
prompt_detail=[ # how descriptive should task prompt be
"brief",
# "moderate",
"descriptive",
],
repeats=1,
)

out_dir = "src/rai_bench/rai_bench/experiments"
test_models(
model_names=model_names,
vendors=vendors,
benchmark_configs=[man_conf, tool_conf],
benchmark_configs=[tool_conf],
out_dir=out_dir,
)
2 changes: 2 additions & 0 deletions src/rai_bench/rai_bench/examples/tool_calling_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@
extra_tool_calls=args.extra_tool_calls,
complexities=args.complexities,
task_types=args.task_types,
n_shots=args.n_shots,
prompt_detail=args.prompt_detail,
)
for task in tasks:
task.set_logger(bench_logger)
Expand Down
28 changes: 12 additions & 16 deletions src/rai_bench/rai_bench/results_processing/data_loading.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,20 +70,19 @@ def convert_row_to_task_result(row: pd.Series) -> TaskResult:
)
validator_results.append(validator_result)

return TaskResult(
task_prompt=row["task_prompt"],
system_prompt=row["system_prompt"],
complexity=row["complexity"],
type=row["type"],
model_name=row["model_name"],
validation_info=validator_results,
extra_tool_calls=int(row["extra_tool_calls"]),
extra_tool_calls_used=int(row["extra_tool_calls_used"]),
score=float(row["score"]),
total_time=float(row["total_time"]),
run_id=uuid.UUID(row["run_id"]),
row.update(
{
"validation_info": validator_results,
"extra_tool_calls": int(row["extra_tool_calls"]),
"extra_tool_calls_used": int(row["extra_tool_calls_used"]),
"score": float(row["score"]),
"total_time": float(row["total_time"]),
"run_id": uuid.UUID(row["run_id"]),
}
)

return TaskResult(**row)


def convert_row_to_scenario_result(row: pd.Series) -> ScenarioResult:
"""
Expand All @@ -100,10 +99,7 @@ def convert_row_to_scenario_result(row: pd.Series) -> ScenarioResult:
A ScenarioResult object
"""
return ScenarioResult(
task_prompt=row["task_prompt"],
system_prompt=row["system_prompt"],
model_name=row["model_name"],
scene_config_path=row["scene_config_path"],
**row,
score=float(row["score"]),
total_time=float(row["total_time"]),
number_of_tool_calls=int(row["number_of_tool_calls"]),
Expand Down
34 changes: 29 additions & 5 deletions src/rai_bench/rai_bench/results_processing/data_processing.py
Original file line number Diff line number Diff line change
Expand Up @@ -181,17 +181,25 @@ def create_task_metrics_dataframe(


def create_task_details_dataframe(
model_results: ModelResults, task_type: Optional[str] = None
model_results: ModelResults,
task_type: Optional[str] = None,
complexity: Optional[str] = None,
examples_in_system_prompt: Optional[int] = None,
prompt_detail: Optional[str] = None,
) -> pd.DataFrame:
"""
Create a DataFrame with task details, optionally filtered by task type.
Create a DataFrame with task details, optionally filtered by multiple criteria.

Parameters
----------
model_results : ModelResults
The model results object
task_type : Optional[str]
Task type to filter by
complexity : Optional[str]
Complexity to filter by
examples_in_system_prompt : Optional[str]
Examples in system prompt to filter by

Returns
-------
Expand All @@ -201,14 +209,30 @@ def create_task_details_dataframe(
all_detailed_results = get_all_detailed_results_from_model_results(
model_results=model_results
)

if not all_detailed_results:
return pd.DataFrame()

# filter by task type
# Apply filters
if task_type:
all_detailed_results = [r for r in all_detailed_results if r.type == task_type]

if complexity:
all_detailed_results = [
r for r in all_detailed_results if r.complexity == complexity
]

if examples_in_system_prompt:
all_detailed_results = [
r
for r in all_detailed_results
if r.examples_in_system_prompt == examples_in_system_prompt
]

if prompt_detail:
all_detailed_results = [
r for r in all_detailed_results if r.prompt_detail == prompt_detail
]

rows: List[Dict[str, Any]] = [
{
"task_prompt": result.task_prompt,
Expand All @@ -217,10 +241,10 @@ def create_task_details_dataframe(
"score": result.score,
"total_time": result.total_time,
"extra_tool_calls_used": result.extra_tool_calls_used,
"examples_in_system_prompt": result.examples_in_system_prompt,
}
for result in all_detailed_results
]

return pd.DataFrame(rows)


Expand Down
Loading