-
Notifications
You must be signed in to change notification settings - Fork 46
feat: tool calling benchmark unified across types and prompts variety #620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b2a12ba
to
c3b1b35
Compare
removed redundant tasks
removed old tasks declarations
all tasks have same tools available now all tool have all topics available
added more tools available add optional toll calls number
add optional calls
c3b1b35
to
0c8cb9d
Compare
a5e9176
to
d532672
Compare
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
1 similar comment
✅ Actions performedFull review triggered. |
WalkthroughThe changes introduce a major refactor and extension of the tool-calling agent benchmark framework. Task definitions are modularized and parameterized, supporting new prompt detail levels and few-shot prompting. The public interface for defining, configuring, and filtering tasks is unified via new data models and factory functions. Visualization and result processing functions are generalized to support richer metadata. Documentation and examples are updated accordingly. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI/Script
participant BenchmarkConfig
participant TaskFactory
participant Task
participant Benchmark
participant ResultsProcessor
User->>CLI/Script: Specify benchmark args (model, extra_tool_calls, prompt_detail, n_shots, etc.)
CLI/Script->>BenchmarkConfig: Create config with lists for parameters
CLI/Script->>TaskFactory: get_tasks(extra_tool_calls, prompt_detail, n_shots, ...)
TaskFactory->>Task: Instantiate Task with TaskArgs for each parameter combination
TaskFactory-->>CLI/Script: Return list of parameterized Tasks
CLI/Script->>Benchmark: Run benchmark with Tasks and config
Benchmark->>Task: For each Task, execute and collect results
Benchmark->>ResultsProcessor: Pass results including examples_in_system_prompt, prompt_detail
ResultsProcessor-->>User: Visualize and filter results by new metadata fields
sequenceDiagram
participant User
participant WebUI
participant ResultsProcessor
User->>WebUI: Select filters (task type, complexity, examples_in_system_prompt, prompt_detail)
WebUI->>ResultsProcessor: create_task_details_dataframe with selected filters
ResultsProcessor-->>WebUI: Filtered DataFrame with detailed task results
WebUI-->>User: Display charts and tables grouped by selected fields
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 17
🔭 Outside diff range comments (2)
src/rai_bench/rai_bench/tool_calling_agent/predefined/tasks.py (1)
30-54
: Replace mutable default arguments with NoneUsing mutable objects as default arguments can lead to unexpected behavior.
def get_tasks( - extra_tool_calls: List[int] = [0], - complexities: List[Literal["easy", "medium", "hard"]] = ["easy", "medium", "hard"], - prompt_detail: List[Literal["brief", "moderate", "descriptive"]] = [ - "brief", - "moderate", - "descriptive", - ], - n_shots: List[Literal[0, 2, 5]] = [0, 2, 5], + extra_tool_calls: List[int] | None = None, + complexities: List[Literal["easy", "medium", "hard"]] | None = None, + prompt_detail: List[Literal["brief", "moderate", "descriptive"]] | None = None, + n_shots: List[Literal[0, 2, 5]] | None = None, task_types: List[ Literal[ "basic", "manipulation", "navigation", "custom_interfaces", "spatial_reasoning", ] - ] = [ + ] | None = None, +) -> List[Task]: + if extra_tool_calls is None: + extra_tool_calls = [0] + if complexities is None: + complexities = ["easy", "medium", "hard"] + if prompt_detail is None: + prompt_detail = ["brief", "moderate", "descriptive"] + if n_shots is None: + n_shots = [0, 2, 5] + if task_types is None: + task_types = [ "basic", "manipulation", "navigation", "custom_interfaces", "spatial_reasoning", - ], -) -> List[Task]: + ] all_tasks: List[Task] = []src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py (1)
36-102
: Fix line length violations in system prompts.The system prompts are well-structured with good incremental examples, but several lines exceed the 79-character limit.
Apply these formatting fixes:
-ROBOT_NAVIGATION_SYSTEM_PROMPT_0_SHOT = """You are an autonomous robot connected to ros2 environment. Your main goal is to fulfill the user's requests. +ROBOT_NAVIGATION_SYSTEM_PROMPT_0_SHOT = """You are an autonomous robot connected to ros2 environment. + Your main goal is to fulfill the user's requests. Do not make assumptions about the environment you are currently in. You can use ros2 topics, services and actions to operate. <rule> As a first step check transforms by getting 1 message from /tf topic </rule> - <rule> use /cmd_vel topic very carefully. Obstacle detection works only with nav2 stack, so be careful when it is not used. </rule>> + <rule> use /cmd_vel topic very carefully. Obstacle detection works only with nav2 stack, + so be careful when it is not used. </rule>Similar formatting should be applied to lines 93, 100-101, and line 85.
🧹 Nitpick comments (15)
src/rai_bench/rai_bench/examples/benchmarking_models.py (1)
35-35
: Fix line length violation.The line exceeds the 79-character limit enforced by flake8.
- extra_tool_calls=[0], # how many extra tool calls allowed to still pass + extra_tool_calls=[0], # extra tool calls allowed to still passsrc/rai_bench/rai_bench/results_processing/data_processing.py (1)
215-235
: Consider refactoring the filtering logic for better maintainability.The sequential filtering approach is correct but could be made more maintainable by using a dictionary-driven approach.
- # Apply filters - if task_type: - all_detailed_results = [r for r in all_detailed_results if r.type == task_type] - - if complexity: - all_detailed_results = [ - r for r in all_detailed_results if r.complexity == complexity - ] - - if examples_in_system_prompt: - all_detailed_results = [ - r - for r in all_detailed_results - if r.examples_in_system_prompt == examples_in_system_prompt - ] - - if prompt_detail: - all_detailed_results = [ - r for r in all_detailed_results if r.prompt_detail == prompt_detail - ] + # Apply filters + filters = { + 'type': task_type, + 'complexity': complexity, + 'examples_in_system_prompt': examples_in_system_prompt, + 'prompt_detail': prompt_detail, + } + + for attr, value in filters.items(): + if value is not None: + all_detailed_results = [ + r for r in all_detailed_results if getattr(r, attr) == value + ]src/rai_bench/rai_bench/tool_calling_agent/predefined/manipulation_tasks.py (1)
34-34
: Fix comment formatting for consistency.The comment blocks have formatting issues that should be addressed for code style compliance.
-########## SUBTASKS ################################################################# +# SUBTASKS-######### VALIDATORS ######################################################################################### +# VALIDATORSAlso applies to: 44-44
docs/simulation_and_benchmarking/rai_bench.md (2)
127-135
: Fix grammatical issues for better readability.The documentation has some grammatical issues that should be corrected.
-There are predefined Tasks available which are grouped by categories: +Predefined Tasks are available, grouped by categories: -- Basic - require retrieving info from certain topics +- Basic - requires retrieving info from certain topics
136-150
: Fix grammatical issues in task parameter description.-When creating a Task you can define few params: +When creating a Task, you can define a few parameters: -extra_tool_calls - How many extra tool calls can agent make and still pass the Task. +extra_tool_calls - How many extra tool calls an agent can make and still pass the Task.src/rai_bench/rai_bench/test_models.py (1)
59-66
: LGTM! Configuration changes align with PR objectives.The addition of
N_shots
andprompt_detail
parameters successfully implements the multi-level prompt system as described in the PR objectives. The type annotations and default values are appropriate.Minor formatting improvement to address line length:
- extra_tool_calls: List[int] = [0] + extra_tool_calls: List[int] = [0] complexities: List[Literal["easy", "medium", "hard"]] = ["easy", "medium", "hard"] - N_shots: List[Literal[0, 2, 5]] = [0, 2, 5] - prompt_detail: List[Literal["brief", "moderate", "descriptive"]] = [ - "brief", - "moderate", - "descriptive", - ] + N_shots: List[Literal[0, 2, 5]] = [0, 2, 5] + prompt_detail: List[Literal["brief", "moderate", "descriptive"]] = [ + "brief", "moderate", "descriptive" + ]src/rai_bench/rai_bench/tool_calling_agent/predefined/custom_interfaces_tasks.py (1)
32-32
: Formatting consistency improvement.Adjust comment formatting for consistency:
-########## SUBTASKS ################################################################# +# SUBTASKSsrc/rai_bench/rai_bench/tool_calling_agent/predefined/spatial_reasoning_tasks.py (1)
35-94
: Consider refactoring for better maintainability.The task inputs are well-defined, but there's some duplication between the initial definitions and the categorized versions within the function.
Consider moving the initial
true_response_inputs
andfalse_response_inputs
definitions into the function or removing them if they're not used elsewhere, to reduce duplication and improve maintainability.src/rai_bench/rai_bench/results_processing/visualise/tool_calling_agent_display.py (1)
376-376
: Fix typo in UI label- "Select prompt decriptivness", + "Select prompt descriptiveness",src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py (3)
111-116
: Simplify conditional logic by removing unnecessary elifdef get_system_prompt(self) -> str: if self.n_shots == 0: return SPATIAL_REASONING_SYSTEM_PROMPT_0_SHOT - elif self.n_shots == 2: + if self.n_shots == 2: return SPATIAL_REASONING_SYSTEM_PROMPT_2_SHOT - else: - return SPATIAL_REASONING_SYSTEM_PROMPT_5_SHOT + return SPATIAL_REASONING_SYSTEM_PROMPT_5_SHOT
147-156
: Simplify conditional logic by removing unnecessary elifdef get_prompt(self): if self.prompt_detail == "brief": return self.get_base_prompt() - elif self.prompt_detail == "moderate": + if self.prompt_detail == "moderate": return f"{self.get_base_prompt()} using visual analysis" - else: - return ( - f"{self.get_base_prompt()} using the visual analysis system. " - "You can examine the provided image(s) carefully to identify relevant features, " - "analyze the visual content, and provide a boolean response based on your observations." - ) + return ( + f"{self.get_base_prompt()} using the visual analysis system. " + "You can examine the provided image(s) carefully to identify relevant features, " + "analyze the visual content, and provide a boolean response based on your observations." + )
28-28
: Break up long line for better readability-SPATIAL_REASONING_SYSTEM_PROMPT_0_SHOT = """You are a helpful and knowledgeable AI assistant that specializes in interpreting and analyzing visual content. Your task is to answer questions based on the images provided to you. Please response with the use of the provided tools.""" +SPATIAL_REASONING_SYSTEM_PROMPT_0_SHOT = ( + "You are a helpful and knowledgeable AI assistant that specializes in " + "interpreting and analyzing visual content. Your task is to answer questions " + "based on the images provided to you. Please response with the use of the provided tools." +)src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py (1)
170-239
: Fix line length violations in task prompt methods.The prompt methods follow a good consistent pattern, but several lines exceed the character limit.
Consider breaking long strings across multiple lines:
- return ( - f"{self.get_base_prompt()} using the robotic navigation system. " - "You can use the navigation tools to move the robot to the specified coordinates. " - "First get the available actions, then set up the navigation goal to reach point (2.0, 2.0, 0.0)." - ) + return ( + f"{self.get_base_prompt()} using the robotic navigation system. " + "You can use the navigation tools to move the robot to the " + "specified coordinates. First get the available actions, then " + "set up the navigation goal to reach point (2.0, 2.0, 0.0)." + )Note: The pylint warnings about "elif after return" are style preferences and the current code is acceptable.
src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py (1)
193-201
: Consider refactoring constructors with many parameters.Several task constructors have 7+ parameters, which impacts readability and maintainability.
Consider these approaches to reduce parameter count:
- Group related parameters into configuration objects (e.g.,
AudioConfig
,DetectionConfig
)- Use the builder pattern for complex task initialization
- Move some parameters to class-level constants if they rarely change
This would improve the API design and make the code more maintainable.
Also applies to: 231-241, 283-295, 336-346, 373-382, 445-455
src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py (1)
179-196
: Fix line length violations in prompt methods.The task implementations are well-structured with good error handling, but several prompt strings exceed the line limit.
Format long strings properly:
- f"{self.get_base_prompt()} using the robotic manipulation system. " - "You can control the arm movement to the specified coordinates " - f"and perform the {self.move_to_tool_input.task} action at that location." + f"{self.get_base_prompt()} using the robotic manipulation " + "system. You can control the arm movement to the specified " + f"coordinates and perform the {self.move_to_tool_input.task} " + "action at that location."Also applies to: 215-240, 376-390
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (25)
docs/simulation_and_benchmarking/rai_bench.md
(2 hunks)docs/tutorials/benchmarking.md
(6 hunks)src/rai_bench/pyproject.toml
(1 hunks)src/rai_bench/rai_bench/examples/benchmarking_models.py
(2 hunks)src/rai_bench/rai_bench/examples/tool_calling_agent.py
(1 hunks)src/rai_bench/rai_bench/results_processing/data_loading.py
(2 hunks)src/rai_bench/rai_bench/results_processing/data_processing.py
(3 hunks)src/rai_bench/rai_bench/results_processing/visualise/tool_calling_agent_display.py
(4 hunks)src/rai_bench/rai_bench/test_models.py
(2 hunks)src/rai_bench/rai_bench/tool_calling_agent/benchmark.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/interfaces.py
(5 hunks)src/rai_bench/rai_bench/tool_calling_agent/predefined/__init__.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/predefined/basic_tasks.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/predefined/custom_interfaces_tasks.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/predefined/manipulation_tasks.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/predefined/navigation_tasks.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/predefined/spatial_reasoning_tasks.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/predefined/tasks.py
(2 hunks)src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py
(1 hunks)src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py
(5 hunks)src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py
(9 hunks)src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py
(3 hunks)src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py
(4 hunks)src/rai_bench/rai_bench/utils.py
(1 hunks)
🧰 Additional context used
🪛 Flake8 (7.2.0)
src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py
[error] 47-47: line too long (81 > 79 characters)
(E501)
[error] 49-49: line too long (83 > 79 characters)
(E501)
src/rai_bench/rai_bench/results_processing/data_processing.py
[error] 191-191: line too long (83 > 79 characters)
(E501)
[error] 217-217: line too long (87 > 79 characters)
(E501)
src/rai_bench/rai_bench/tool_calling_agent/predefined/manipulation_tasks.py
[error] 34-34: too many leading '#' for block comment
(E266)
[error] 34-34: line too long (85 > 79 characters)
(E501)
[error] 44-44: too many leading '#' for block comment
(E266)
[error] 44-44: line too long (110 > 79 characters)
(E501)
src/rai_bench/rai_bench/tool_calling_agent/predefined/navigation_tasks.py
[error] 34-34: too many leading '#' for block comment
(E266)
[error] 34-34: line too long (85 > 79 characters)
(E501)
[error] 63-63: too many leading '#' for block comment
(E266)
[error] 63-63: line too long (110 > 79 characters)
(E501)
[error] 67-67: line too long (87 > 79 characters)
(E501)
[error] 68-68: line too long (86 > 79 characters)
(E501)
src/rai_bench/rai_bench/tool_calling_agent/predefined/basic_tasks.py
[error] 39-39: too many leading '#' for block comment
(E266)
[error] 39-39: line too long (85 > 79 characters)
(E501)
[error] 181-181: too many leading '#' for block comment
(E266)
[error] 181-181: line too long (110 > 79 characters)
(E501)
[error] 187-187: line too long (88 > 79 characters)
(E501)
[error] 188-188: line too long (88 > 79 characters)
(E501)
[error] 221-221: line too long (85 > 79 characters)
(E501)
[error] 222-222: line too long (85 > 79 characters)
(E501)
[error] 254-254: line too long (81 > 79 characters)
(E501)
[error] 283-283: line too long (84 > 79 characters)
(E501)
[error] 286-286: line too long (84 > 79 characters)
(E501)
src/rai_bench/rai_bench/examples/benchmarking_models.py
[error] 35-35: line too long (80 > 79 characters)
(E501)
src/rai_bench/rai_bench/tool_calling_agent/predefined/spatial_reasoning_tasks.py
[error] 95-95: too many leading '#' for block comment
(E266)
[error] 95-95: line too long (85 > 79 characters)
(E501)
[error] 100-100: line too long (80 > 79 characters)
(E501)
[error] 103-103: too many leading '#' for block comment
(E266)
[error] 103-103: line too long (110 > 79 characters)
(E501)
[error] 127-127: line too long (85 > 79 characters)
(E501)
[error] 162-162: line too long (87 > 79 characters)
(E501)
[error] 165-165: line too long (85 > 79 characters)
(E501)
[error] 176-176: line too long (81 > 79 characters)
(E501)
src/rai_bench/rai_bench/tool_calling_agent/predefined/custom_interfaces_tasks.py
[error] 32-32: too many leading '#' for block comment
(E266)
[error] 32-32: line too long (85 > 79 characters)
(E501)
[error] 46-46: too many leading '#' for block comment
(E266)
[error] 46-46: line too long (110 > 79 characters)
(E501)
src/rai_bench/rai_bench/results_processing/visualise/tool_calling_agent_display.py
[error] 121-121: line too long (83 > 79 characters)
(E501)
[error] 132-132: line too long (83 > 79 characters)
(E501)
[error] 147-147: line too long (81 > 79 characters)
(E501)
[error] 149-149: line too long (80 > 79 characters)
(E501)
[error] 340-340: line too long (81 > 79 characters)
(E501)
[error] 357-357: line too long (83 > 79 characters)
(E501)
src/rai_bench/rai_bench/test_models.py
[error] 60-60: line too long (86 > 79 characters)
(E501)
src/rai_bench/rai_bench/tool_calling_agent/predefined/tasks.py
[error] 32-32: line too long (87 > 79 characters)
(E501)
src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py
[error] 36-36: line too long (151 > 79 characters)
(E501)
[error] 85-85: line too long (90 > 79 characters)
(E501)
[error] 93-93: line too long (220 > 79 characters)
(E501)
[error] 100-100: line too long (120 > 79 characters)
(E501)
[error] 101-101: line too long (181 > 79 characters)
(E501)
[error] 145-145: line too long (82 > 79 characters)
(E501)
[error] 176-176: line too long (81 > 79 characters)
(E501)
[error] 177-177: line too long (99 > 79 characters)
(E501)
[error] 178-178: line too long (114 > 79 characters)
(E501)
[error] 193-193: line too long (80 > 79 characters)
(E501)
[error] 196-196: line too long (81 > 79 characters)
(E501)
[error] 197-197: line too long (95 > 79 characters)
(E501)
[error] 216-216: line too long (81 > 79 characters)
(E501)
[error] 217-217: line too long (82 > 79 characters)
(E501)
[error] 218-218: line too long (88 > 79 characters)
(E501)
[error] 236-236: line too long (81 > 79 characters)
(E501)
[error] 237-237: line too long (96 > 79 characters)
(E501)
[error] 238-238: line too long (94 > 79 characters)
(E501)
src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py
[error] 28-28: line too long (280 > 79 characters)
(E501)
[error] 39-39: line too long (105 > 79 characters)
(E501)
[error] 43-43: line too long (82 > 79 characters)
(E501)
[error] 45-45: line too long (91 > 79 characters)
(E501)
[error] 76-76: line too long (86 > 79 characters)
(E501)
[error] 154-154: line too long (97 > 79 characters)
(E501)
[error] 155-155: line too long (104 > 79 characters)
(E501)
[error] 159-159: line too long (83 > 79 characters)
(E501)
[error] 163-163: line too long (87 > 79 characters)
(E501)
[error] 164-164: line too long (85 > 79 characters)
(E501)
[error] 167-167: line too long (89 > 79 characters)
(E501)
[error] 168-168: line too long (110 > 79 characters)
(E501)
src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py
[error] 74-74: line too long (171 > 79 characters)
(E501)
[error] 82-82: line too long (179 > 79 characters)
(E501)
[error] 89-89: line too long (81 > 79 characters)
(E501)
[error] 90-90: line too long (210 > 79 characters)
(E501)
[error] 175-175: line too long (82 > 79 characters)
(E501)
[error] 185-185: line too long (83 > 79 characters)
(E501)
[error] 186-186: line too long (91 > 79 characters)
(E501)
[error] 223-223: line too long (89 > 79 characters)
(E501)
[error] 224-224: line too long (98 > 79 characters)
(E501)
[error] 260-260: line too long (92 > 79 characters)
(E501)
[error] 267-267: line too long (80 > 79 characters)
(E501)
[error] 273-273: line too long (82 > 79 characters)
(E501)
[error] 274-274: line too long (104 > 79 characters)
(E501)
[error] 275-275: line too long (81 > 79 characters)
(E501)
[error] 313-313: line too long (103 > 79 characters)
(E501)
[error] 321-321: line too long (82 > 79 characters)
(E501)
[error] 326-326: line too long (88 > 79 characters)
(E501)
[error] 327-327: line too long (89 > 79 characters)
(E501)
[error] 328-328: line too long (103 > 79 characters)
(E501)
[error] 364-364: line too long (85 > 79 characters)
(E501)
[error] 365-365: line too long (86 > 79 characters)
(E501)
[error] 366-366: line too long (85 > 79 characters)
(E501)
[error] 392-392: line too long (84 > 79 characters)
(E501)
[error] 393-393: line too long (87 > 79 characters)
(E501)
[error] 405-405: line too long (88 > 79 characters)
(E501)
[error] 406-406: line too long (106 > 79 characters)
(E501)
[error] 407-407: line too long (114 > 79 characters)
(E501)
[error] 436-436: line too long (83 > 79 characters)
(E501)
[error] 437-437: line too long (84 > 79 characters)
(E501)
[error] 459-459: line too long (82 > 79 characters)
(E501)
[error] 469-469: line too long (85 > 79 characters)
(E501)
[error] 470-470: line too long (96 > 79 characters)
(E501)
[error] 496-496: line too long (81 > 79 characters)
(E501)
[error] 500-500: line too long (82 > 79 characters)
(E501)
src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py
[error] 24-24: line too long (87 > 79 characters)
(E501)
[error] 32-32: line too long (171 > 79 characters)
(E501)
[error] 40-40: line too long (179 > 79 characters)
(E501)
[error] 48-48: line too long (164 > 79 characters)
(E501)
[error] 66-66: line too long (88 > 79 characters)
(E501)
[error] 103-103: line too long (109 > 79 characters)
(E501)
[error] 122-122: line too long (90 > 79 characters)
(E501)
[error] 140-140: line too long (91 > 79 characters)
(E501)
[error] 158-158: line too long (98 > 79 characters)
(E501)
[error] 175-175: line too long (98 > 79 characters)
(E501)
[error] 193-193: line too long (93 > 79 characters)
(E501)
[error] 195-195: line too long (97 > 79 characters)
(E501)
[error] 212-212: line too long (97 > 79 characters)
(E501)
[error] 213-213: line too long (85 > 79 characters)
(E501)
[error] 231-231: line too long (94 > 79 characters)
(E501)
[error] 232-232: line too long (85 > 79 characters)
(E501)
[error] 233-233: line too long (88 > 79 characters)
(E501)
src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py
[error] 43-43: line too long (80 > 79 characters)
(E501)
[error] 153-153: line too long (81 > 79 characters)
(E501)
[error] 175-175: line too long (81 > 79 characters)
(E501)
[error] 193-193: line too long (83 > 79 characters)
(E501)
[error] 194-194: line too long (80 > 79 characters)
(E501)
[error] 195-195: line too long (90 > 79 characters)
(E501)
[error] 210-210: line too long (81 > 79 characters)
(E501)
[error] 237-237: line too long (82 > 79 characters)
(E501)
[error] 286-286: line too long (83 > 79 characters)
(E501)
[error] 313-313: line too long (86 > 79 characters)
(E501)
[error] 343-343: line too long (84 > 79 characters)
(E501)
[error] 370-370: line too long (81 > 79 characters)
(E501)
[error] 383-383: line too long (83 > 79 characters)
(E501)
[error] 386-386: line too long (93 > 79 characters)
(E501)
[error] 387-387: line too long (88 > 79 characters)
(E501)
[error] 388-388: line too long (98 > 79 characters)
(E501)
🪛 Ruff (0.11.9)
src/rai_bench/rai_bench/tool_calling_agent/predefined/manipulation_tasks.py
54-54: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
55-59: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
60-60: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
src/rai_bench/rai_bench/tool_calling_agent/predefined/navigation_tasks.py
72-72: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
73-77: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
78-78: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
src/rai_bench/rai_bench/tool_calling_agent/predefined/basic_tasks.py
244-244: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
245-249: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
250-250: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
src/rai_bench/rai_bench/tool_calling_agent/predefined/spatial_reasoning_tasks.py
109-109: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
110-114: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
115-115: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
src/rai_bench/rai_bench/tool_calling_agent/predefined/custom_interfaces_tasks.py
59-59: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
60-64: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
65-65: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
src/rai_bench/rai_bench/tool_calling_agent/predefined/tasks.py
31-31: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
32-32: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
33-37: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
38-38: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py
198-198: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
236-236: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
🪛 LanguageTool
docs/simulation_and_benchmarking/rai_bench.md
[uncategorized] ~128-~128: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...are grouped by categories: - Basic - require retrieving info from certain topics - ...
(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)
[uncategorized] ~136-~136: Possible missing comma found.
Context: ...flects the difficulty. When creating a Task you can define few params: ```python c...
(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~136-~136: You might be missing the article “a” here.
Context: ...y. When creating a Task you can define few params: ```python class TaskArgs(BaseM...
(AI_EN_LECTOR_MISSING_DETERMINER_A)
[uncategorized] ~149-~149: You might be missing the article “an” here.
Context: ...l_calls - How many extra tool calls can agent make and still pass the Task. If you w...
(AI_EN_LECTOR_MISSING_DETERMINER_AN)
🪛 Pylint (3.3.7)
src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py
[refactor] 155-160: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 170-179: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 190-199: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 210-219: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 230-239: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py
[refactor] 55-55: Too few public methods (0/2)
(R0903)
[refactor] 59-59: Too few public methods (0/2)
(R0903)
[refactor] 72-72: Too few public methods (0/2)
(R0903)
[refactor] 111-116: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 147-156: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
src/rai_bench/rai_bench/tool_calling_agent/interfaces.py
[refactor] 458-458: Too few public methods (0/2)
(R0903)
src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py
[refactor] 104-109: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 178-187: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 193-193: Too many arguments (7/5)
(R0913)
[refactor] 193-193: Too many positional arguments (7/5)
(R0917)
[refactor] 215-225: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 231-231: Too many arguments (9/5)
(R0913)
[refactor] 231-231: Too many positional arguments (9/5)
(R0917)
[refactor] 264-277: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 283-283: Too many arguments (11/5)
(R0913)
[refactor] 283-283: Too many positional arguments (11/5)
(R0917)
[refactor] 318-330: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 336-336: Too many arguments (6/5)
(R0913)
[refactor] 336-336: Too many positional arguments (6/5)
(R0917)
[refactor] 356-367: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 373-373: Too many arguments (8/5)
(R0913)
[refactor] 373-373: Too many positional arguments (8/5)
(R0917)
[refactor] 398-408: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 429-439: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 445-445: Too many arguments (6/5)
(R0913)
[refactor] 445-445: Too many positional arguments (6/5)
(R0917)
[refactor] 462-472: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 493-503: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py
[refactor] 78-83: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 97-105: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 115-123: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 133-141: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 151-159: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 169-177: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 187-196: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 206-215: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 225-235: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py
[refactor] 135-140: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 187-196: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 231-240: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 250-259: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 280-289: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 305-314: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 335-344: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
[refactor] 380-390: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
🔇 Additional comments (23)
src/rai_bench/pyproject.toml (1)
3-3
: Version bump appropriately reflects the feature additions.The version change from "0.1.0" to "0.2.0" correctly follows semantic versioning to indicate new features (prompt detail levels, few-shot examples, task modularization) without breaking changes.
src/rai_bench/rai_bench/examples/tool_calling_agent.py (1)
37-38
: Integration of new task parameters looks good.The addition of
n_shots
andprompt_detail
parameters correctly extends the example to support the new prompt parameterization features introduced in this PR.src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (1)
163-166
: Enhanced TaskResult metadata looks correct.The changes appropriately capture the new task configuration parameters:
- Using
get_base_prompt()
for consistent task identification- Adding
examples_in_system_prompt
andprompt_detail
for enriched result metadataThese additions will enable better filtering and analysis in the results processing pipeline.
src/rai_bench/rai_bench/tool_calling_agent/predefined/__init__.py (1)
15-27
: Excellent module organization following Python best practices.The centralized import/export pattern with explicit
__all__
definition provides a clean public API while maintaining modular task organization. This supports the PR objective of separating predefined tasks by type for improved code readability.src/rai_bench/rai_bench/examples/benchmarking_models.py (3)
23-24
: Simplified model configuration for focused testing.Reducing to a single model and vendor streamlines the example for demonstrating the new parameterization features.
39-48
: New task parameterization features integrated correctly.The addition of
custom_interfaces
task type and the newN_shots
andprompt_detail
parameters demonstrate the enhanced benchmark configuration capabilities introduced in this PR. The list format allows for testing multiple parameter combinations.
56-56
: Focused benchmark configuration for demonstration.Reducing to a single benchmark configuration (tool_conf) simplifies the example while showcasing the new parameterization features.
src/rai_bench/rai_bench/tool_calling_agent/predefined/manipulation_tasks.py (1)
68-97
: LGTM! Well-structured task generation logic.The nested loops create a comprehensive set of parameterized tasks. The use of TaskArgs to encapsulate configuration parameters is clean and the validator assignment is appropriate for the task types.
docs/simulation_and_benchmarking/rai_bench.md (1)
139-150
: LGTM! Clear documentation of TaskArgs parameters.The code snippet effectively illustrates the new TaskArgs configuration options and their purposes are well explained.
src/rai_bench/rai_bench/results_processing/data_loading.py (2)
73-84
: LGTM! Clean refactoring using dictionary unpacking.The use of dictionary unpacking with explicit type conversions is a good improvement that makes the code more maintainable and aligns well with the new TaskResult fields.
101-106
: LGTM! Consistent use of dictionary unpacking pattern.The refactoring follows the same clean pattern as the TaskResult conversion function.
src/rai_bench/rai_bench/test_models.py (1)
198-199
: Good integration of new parameters.The addition of
prompt_detail
andn_shots
parameters to theget_tasks
function call correctly implements the enhanced task generation capabilities.src/rai_bench/rai_bench/utils.py (1)
46-62
: Excellent CLI interface additions.The new command-line arguments for
--prompt-detail
and--n-shots
provide the necessary interface for the enhanced benchmark configuration. The choices are well-defined and match the configuration class, and the help text is clear and informative.src/rai_bench/rai_bench/tool_calling_agent/predefined/custom_interfaces_tasks.py (1)
69-87
: Well-structured task generation logic.The nested loop structure correctly generates tasks for all parameter combinations, and the use of
TaskArgs
provides a clean abstraction for task configuration.src/rai_bench/rai_bench/tool_calling_agent/predefined/navigation_tasks.py (2)
36-62
: Well-defined navigation task specifications.The ROS2 action specifications are correctly structured with appropriate expected fields for navigation, spinning, and drive-on-heading actions.
90-109
: Efficient task generation using extend.Good use of
tasks.extend()
to add multiple tasks at once, and the task instantiation covers all the navigation task types appropriately.src/rai_bench/rai_bench/tool_calling_agent/predefined/spatial_reasoning_tasks.py (2)
119-194
: Excellent task complexity categorization.The categorization of spatial reasoning tasks into easy (object presence), medium (counting/state), and hard (spatial relationships) is well-thought-out and provides good coverage of different visual reasoning capabilities.
205-270
: Comprehensive task generation with good organization.The task generation covers all complexity levels and response types systematically. The code structure is clear and maintainable.
src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (2)
458-464
: LGTM! Well-designed configuration model.The
TaskArgs
model provides a clean interface for task configuration with appropriate defaults and type constraints usingLiteral
types.Note: The pylint warning about too few public methods can be safely ignored for Pydantic data models.
468-502
: Excellent refactoring of the Task interface!The changes improve the design in several ways:
- Using
TaskArgs
simplifies task initialization and makes it more extensible- Making
type
a class attribute is cleaner than an abstract property- The new
optional_tool_calls_number
property adds flexibility for tasks that may make preliminary calls- The updated
max_tool_calls_number
calculation correctly includes all allowed calls- The
get_base_prompt()
method standardizes prompt handling across tasksAlso applies to: 538-553, 574-580
src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py (1)
20-28
: Good modularization of interface definitions.The imports from
mocked_ros2_interfaces
and the combination of common and navigation-specific constants provide a clean separation of concerns.Also applies to: 103-120
src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py (1)
57-84
: Well-designed BasicTask base class.The base class provides a unified set of tools for all basic tasks and correctly implements the optional tool calls pattern. The system prompt selection is consistent with other task modules.
src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py (1)
92-161
: Excellent class hierarchy design.The separation of
ManipulationTask
andGrabTask
provides good abstraction layers. The use of**kwargs
allows flexibility for subclass-specific parameters while maintaining a clean interface throughTaskArgs
.
src/rai_bench/rai_bench/tool_calling_agent/predefined/manipulation_tasks.py
Outdated
Show resolved
Hide resolved
src/rai_bench/rai_bench/tool_calling_agent/predefined/custom_interfaces_tasks.py
Outdated
Show resolved
Hide resolved
src/rai_bench/rai_bench/tool_calling_agent/predefined/navigation_tasks.py
Outdated
Show resolved
Hide resolved
src/rai_bench/rai_bench/results_processing/visualise/tool_calling_agent_display.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmatejcz thank you for this PR!. I run the benchmark. I have a couple of questions:
- Are all these API call logs required?
(rai-framework-py3.10) robo-pc-005 ➜ rai git:(jm/feat/tool-calling-tasks) ✗ python src/rai_bench/rai_bench/examples/benchmarking_models.py
UserWarning: <built-in function allocate_lock> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we
cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`.
2025-06-30 10:26:05 robo-pc-005 httpx[1634151] INFO HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-06-30 10:26:05 robo-pc-005 httpx[1634151] INFO HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-06-30 10:26:06 robo-pc-005 httpx[1634151] INFO HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
Also I left some questions to the code.
-
My main consideration is the level of customization of the brief, moderate and descriptive prompts. I am wondering how much they show the practical usecase. Usually more extended system prompt contains few shot examples. Current more complex prompts are more generic.
Did you notice a performance increase with more complex prompts? -
Could you share some example results from the benchmark?
src/rai_bench/rai_bench/tool_calling_agent/predefined/spatial_reasoning_tasks.py
Show resolved
Hide resolved
prompt_detail: List[Literal["brief", "moderate", "descriptive"]] = [ | ||
"brief", | ||
"moderate", | ||
"descriptive", | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it described elsewhere what is the meaning of this argument? (besides the PR description)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, I think moderate level don't provide additional information, so it would be better to remove it.
I think adding and example in docs of how the prompt_detail
is set in predefined tasks would be better, because it's hard to explain without an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boczekbartek thank you for the review yes i see that There is level of descriptiveness of a Task prompt and there is number of examples in system prompt - they are 2 different params About the Task prompt:
|
supress info httpx logs
@boczekbartek adjsuted manipultion_config name and supressed httpx info logs here: 308c3ce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmatejcz Thank you for applying the changes. LGTM!
Purpose
Unify approach to defining prompt for different types of tasks.
Unify mocks of interfaces, topics, services and actions across tasks.
Provide different types of prompts.
Proposed Changes
Introduced new levels of system prompts which can be defined by user ->
n_shots
param defines how many examples there are in system prompt. There are 0, 2 and 5 available now. Every Task type has now 3 system prompts available.New levels of Task prompts which can be defined by user ->
prompt_detail
param defines how descriptive the prompt is. There are:brief
- only short command, like:Get RGB camera image.
moderate
- slightly expanded by adding some context like:Get RGB camera image from the camera.
descriptive
- detailed explanation containing what can done to accomplish task, like:Get RGB camera image from the robot's camera system. You can explore available camera topics and capture the RGB color image.
added optional tool calls number to each Task, which solves the problem when for example -> Getting image using get_ros2_image requires only 1 tool call, but listing the topics before doing that should not be considered error or
extra tool call
. Same the other way around, lack of listing topics is not error either. In this case Task has optional tool calls number set to 1.Merged mock interfaces, topics, services and actions. Moved it to separate file and splitted by groups.
Separated predefined tasks into couple files by task type to make code more readable.
Results now contains the info about the prompt detail and how much examples there was in system prompt. Additionally task prompt is saved as
base prompt
which is the same independently from prompt_detail param. This is for processing results so the same task with different prompt detailness is not classified as separate.Adjusted visualisation script and docs to changes. UI has now drill down filters on tasks.
Issues
#576 - solved partially, adding concrete tasks to groups will be in seperate PRs
Testing
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Chores