Skip to content

feat: tool calling benchmark unified across types and prompts variety #620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Jul 1, 2025

Conversation

jmatejcz
Copy link
Contributor

@jmatejcz jmatejcz commented Jun 4, 2025

Purpose

Unify approach to defining prompt for different types of tasks.
Unify mocks of interfaces, topics, services and actions across tasks.
Provide different types of prompts.

Proposed Changes

  • Introduced new levels of system prompts which can be defined by user -> n_shots param defines how many examples there are in system prompt. There are 0, 2 and 5 available now. Every Task type has now 3 system prompts available.

  • New levels of Task prompts which can be defined by user -> prompt_detail param defines how descriptive the prompt is. There are:

    • brief - only short command, like: Get RGB camera image.
    • moderate - slightly expanded by adding some context like: Get RGB camera image from the camera.
    • descriptive - detailed explanation containing what can done to accomplish task, like: Get RGB camera image from the robot's camera system. You can explore available camera topics and capture the RGB color image.

    ⚠️ Warning: I'm not sure about the moderate level, at first it seems the right approach to have some middleground, but now i feel like it's not rly helping much. I didn't see much or even any improvement from brief to moderate . IMO moderate level can be removed

  • added optional tool calls number to each Task, which solves the problem when for example -> Getting image using get_ros2_image requires only 1 tool call, but listing the topics before doing that should not be considered error or extra tool call. Same the other way around, lack of listing topics is not error either. In this case Task has optional tool calls number set to 1.

  • Merged mock interfaces, topics, services and actions. Moved it to separate file and splitted by groups.

  • Separated predefined tasks into couple files by task type to make code more readable.

  • Results now contains the info about the prompt detail and how much examples there was in system prompt. Additionally task prompt is saved as base prompt which is the same independently from prompt_detail param. This is for processing results so the same task with different prompt detailness is not classified as separate.

  • Adjusted visualisation script and docs to changes. UI has now drill down filters on tasks.

⚠️ Warning: Not defining large number of new tasks and adjsuting the tasks prompts as it is a big PR already, it will be added in next PRs

Issues

#576 - solved partially, adding concrete tasks to groups will be in seperate PRs

Testing

python src/rai_bench/rai_bench/examples/benchmarking_models.py 
streamlit run  src/rai_bench/rai_bench/examples/visualise_streamlit.py 

Summary by CodeRabbit

  • New Features

    • Added support for configuring prompt detail levels and number of few-shot examples in tool-calling agent benchmarks.
    • Introduced new categories and flexible parameterization for task generation, including prompt detail and shot count.
    • Enhanced performance visualization and filtering with new breakdowns by prompt detail and number of examples.
  • Improvements

    • Standardized and modularized task definitions for basic, manipulation, navigation, custom interfaces, and spatial reasoning.
    • Unified and clarified documentation and tutorials to reflect new configuration options and usage patterns.
    • Improved prompt generation with selectable detail levels and few-shot variants across all task types.
    • Enhanced UI for task analysis with additional filtering options.
  • Bug Fixes

    • Corrected and clarified task descriptions and configuration options in documentation.
  • Chores

    • Updated package version to 0.2.0.

@jmatejcz jmatejcz force-pushed the jm/feat/tool-calling-tasks branch from b2a12ba to c3b1b35 Compare June 4, 2025 13:34
@jmatejcz jmatejcz marked this pull request as ready for review June 4, 2025 13:39
@jmatejcz jmatejcz requested a review from maciejmajek June 4, 2025 13:49
@jmatejcz jmatejcz force-pushed the jm/feat/tool-calling-tasks branch from c3b1b35 to 0c8cb9d Compare June 5, 2025 08:59
@jmatejcz jmatejcz removed the request for review from maciejmajek June 5, 2025 09:09
@jmatejcz jmatejcz marked this pull request as draft June 5, 2025 09:09
@jmatejcz jmatejcz marked this pull request as ready for review June 5, 2025 12:12
@jmatejcz jmatejcz requested a review from maciejmajek June 5, 2025 12:12
@jmatejcz jmatejcz force-pushed the jm/feat/tool-calling-tasks branch from a5e9176 to d532672 Compare June 6, 2025 09:13
@jmatejcz
Copy link
Contributor Author

@CodeRabbit full review

Copy link
Contributor

coderabbitai bot commented Jun 23, 2025

✅ Actions performed

Full review triggered.

1 similar comment
Copy link
Contributor

coderabbitai bot commented Jun 23, 2025

✅ Actions performed

Full review triggered.

Copy link
Contributor

coderabbitai bot commented Jun 23, 2025

Walkthrough

The changes introduce a major refactor and extension of the tool-calling agent benchmark framework. Task definitions are modularized and parameterized, supporting new prompt detail levels and few-shot prompting. The public interface for defining, configuring, and filtering tasks is unified via new data models and factory functions. Visualization and result processing functions are generalized to support richer metadata. Documentation and examples are updated accordingly.

Changes

File(s) Change Summary
docs/simulation_and_benchmarking/rai_bench.md, docs/tutorials/benchmarking.md Documentation updated to clarify task configuration, parameterization, and example usage, reflecting new API patterns and flexible benchmark options.
src/rai_bench/pyproject.toml Package version updated from 0.1.0 to 0.2.0.
src/rai_bench/rai_bench/examples/benchmarking_models.py, src/rai_bench/rai_bench/examples/tool_calling_agent.py Example scripts updated to use new task parameterization (extra_tool_calls, prompt_detail, n_shots) and filter logic.
src/rai_bench/rai_bench/results_processing/data_loading.py DataFrame row-to-domain object conversion refactored to use dictionary unpacking, simplifying object construction.
src/rai_bench/rai_bench/results_processing/data_processing.py Task details DataFrame creation extended with new filters: complexity, examples in system prompt, and prompt detail.
src/rai_bench/rai_bench/results_processing/visualise/tool_calling_agent_display.py Visualization functions generalized for arbitrary fields, detailed analysis supports multiple filters, UI enhanced with new selectors.
src/rai_bench/rai_bench/test_models.py Benchmark config for tool-calling agent updated: extra_tool_calls is now a list, new fields N_shots and prompt_detail added, and passed to task generation.
src/rai_bench/rai_bench/tool_calling_agent/benchmark.py TaskResult construction updated to include examples_in_system_prompt and prompt_detail, and use base prompt.
src/rai_bench/rai_bench/tool_calling_agent/interfaces.py New TaskArgs data model introduced; Task interface refactored to use TaskArgs, adds type, optional_tool_calls_number, and get_base_prompt.
src/rai_bench/rai_bench/tool_calling_agent/predefined/init.py, .../basic_tasks.py, .../custom_interfaces_tasks.py, .../manipulation_tasks.py, .../navigation_tasks.py, .../spatial_reasoning_tasks.py New modularized task definition modules added, each exporting a function to generate parameterized task lists for a specific category.
src/rai_bench/rai_bench/tool_calling_agent/predefined/tasks.py Refactored to delegate task creation to new modularized functions, supports multiple values for prompt detail and shots, filters by complexity.
src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py TaskResult model extended with examples_in_system_prompt and prompt_detail fields.
src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py, .../custom_interfaces.py, .../manipulation.py, .../navigation.py, .../spatial.py All task classes refactored: unified constructor with TaskArgs, support for prompt detail and few-shot variants, modularized tool and interface definitions, new/updated prompt generation methods, and richer metadata.
src/rai_bench/rai_bench/utils.py Argument parser extended with --prompt-detail and --n-shots options for command-line configuration.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI/Script
    participant BenchmarkConfig
    participant TaskFactory
    participant Task
    participant Benchmark
    participant ResultsProcessor

    User->>CLI/Script: Specify benchmark args (model, extra_tool_calls, prompt_detail, n_shots, etc.)
    CLI/Script->>BenchmarkConfig: Create config with lists for parameters
    CLI/Script->>TaskFactory: get_tasks(extra_tool_calls, prompt_detail, n_shots, ...)
    TaskFactory->>Task: Instantiate Task with TaskArgs for each parameter combination
    TaskFactory-->>CLI/Script: Return list of parameterized Tasks
    CLI/Script->>Benchmark: Run benchmark with Tasks and config
    Benchmark->>Task: For each Task, execute and collect results
    Benchmark->>ResultsProcessor: Pass results including examples_in_system_prompt, prompt_detail
    ResultsProcessor-->>User: Visualize and filter results by new metadata fields
Loading
sequenceDiagram
    participant User
    participant WebUI
    participant ResultsProcessor

    User->>WebUI: Select filters (task type, complexity, examples_in_system_prompt, prompt_detail)
    WebUI->>ResultsProcessor: create_task_details_dataframe with selected filters
    ResultsProcessor-->>WebUI: Filtered DataFrame with detailed task results
    WebUI-->>User: Display charts and tables grouped by selected fields
Loading
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

🔭 Outside diff range comments (2)
src/rai_bench/rai_bench/tool_calling_agent/predefined/tasks.py (1)

30-54: Replace mutable default arguments with None

Using mutable objects as default arguments can lead to unexpected behavior.

 def get_tasks(
-    extra_tool_calls: List[int] = [0],
-    complexities: List[Literal["easy", "medium", "hard"]] = ["easy", "medium", "hard"],
-    prompt_detail: List[Literal["brief", "moderate", "descriptive"]] = [
-        "brief",
-        "moderate",
-        "descriptive",
-    ],
-    n_shots: List[Literal[0, 2, 5]] = [0, 2, 5],
+    extra_tool_calls: List[int] | None = None,
+    complexities: List[Literal["easy", "medium", "hard"]] | None = None,
+    prompt_detail: List[Literal["brief", "moderate", "descriptive"]] | None = None,
+    n_shots: List[Literal[0, 2, 5]] | None = None,
     task_types: List[
         Literal[
             "basic",
             "manipulation",
             "navigation",
             "custom_interfaces",
             "spatial_reasoning",
         ]
-    ] = [
+    ] | None = None,
+) -> List[Task]:
+    if extra_tool_calls is None:
+        extra_tool_calls = [0]
+    if complexities is None:
+        complexities = ["easy", "medium", "hard"]
+    if prompt_detail is None:
+        prompt_detail = ["brief", "moderate", "descriptive"]
+    if n_shots is None:
+        n_shots = [0, 2, 5]
+    if task_types is None:
+        task_types = [
         "basic",
         "manipulation",
         "navigation",
         "custom_interfaces",
         "spatial_reasoning",
-    ],
-) -> List[Task]:
+        ]
     all_tasks: List[Task] = []
src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py (1)

36-102: Fix line length violations in system prompts.

The system prompts are well-structured with good incremental examples, but several lines exceed the 79-character limit.

Apply these formatting fixes:

-ROBOT_NAVIGATION_SYSTEM_PROMPT_0_SHOT = """You are an autonomous robot connected to ros2 environment. Your main goal is to fulfill the user's requests.
+ROBOT_NAVIGATION_SYSTEM_PROMPT_0_SHOT = """You are an autonomous robot connected to ros2 environment. 
+    Your main goal is to fulfill the user's requests.
     Do not make assumptions about the environment you are currently in.
     You can use ros2 topics, services and actions to operate.

     <rule> As a first step check transforms by getting 1 message from /tf topic </rule>
-    <rule> use /cmd_vel topic very carefully. Obstacle detection works only with nav2 stack, so be careful when it is not used. </rule>>
+    <rule> use /cmd_vel topic very carefully. Obstacle detection works only with nav2 stack, 
+    so be careful when it is not used. </rule>

Similar formatting should be applied to lines 93, 100-101, and line 85.

🧹 Nitpick comments (15)
src/rai_bench/rai_bench/examples/benchmarking_models.py (1)

35-35: Fix line length violation.

The line exceeds the 79-character limit enforced by flake8.

-        extra_tool_calls=[0],  # how many extra tool calls allowed to still pass
+        extra_tool_calls=[0],  # extra tool calls allowed to still pass
src/rai_bench/rai_bench/results_processing/data_processing.py (1)

215-235: Consider refactoring the filtering logic for better maintainability.

The sequential filtering approach is correct but could be made more maintainable by using a dictionary-driven approach.

-    # Apply filters
-    if task_type:
-        all_detailed_results = [r for r in all_detailed_results if r.type == task_type]
-
-    if complexity:
-        all_detailed_results = [
-            r for r in all_detailed_results if r.complexity == complexity
-        ]
-
-    if examples_in_system_prompt:
-        all_detailed_results = [
-            r
-            for r in all_detailed_results
-            if r.examples_in_system_prompt == examples_in_system_prompt
-        ]
-
-    if prompt_detail:
-        all_detailed_results = [
-            r for r in all_detailed_results if r.prompt_detail == prompt_detail
-        ]
+    # Apply filters
+    filters = {
+        'type': task_type,
+        'complexity': complexity,
+        'examples_in_system_prompt': examples_in_system_prompt,
+        'prompt_detail': prompt_detail,
+    }
+    
+    for attr, value in filters.items():
+        if value is not None:
+            all_detailed_results = [
+                r for r in all_detailed_results if getattr(r, attr) == value
+            ]
src/rai_bench/rai_bench/tool_calling_agent/predefined/manipulation_tasks.py (1)

34-34: Fix comment formatting for consistency.

The comment blocks have formatting issues that should be addressed for code style compliance.

-########## SUBTASKS #################################################################
+# SUBTASKS
-######### VALIDATORS #########################################################################################
+# VALIDATORS

Also applies to: 44-44

docs/simulation_and_benchmarking/rai_bench.md (2)

127-135: Fix grammatical issues for better readability.

The documentation has some grammatical issues that should be corrected.

-There are predefined Tasks available which are grouped by categories:
+Predefined Tasks are available, grouped by categories:

--   Basic - require retrieving info from certain topics
+-   Basic - requires retrieving info from certain topics

136-150: Fix grammatical issues in task parameter description.

-When creating a Task you can define few params:
+When creating a Task, you can define a few parameters:

-extra_tool_calls - How many extra tool calls can agent make and still pass the Task.
+extra_tool_calls - How many extra tool calls an agent can make and still pass the Task.
src/rai_bench/rai_bench/test_models.py (1)

59-66: LGTM! Configuration changes align with PR objectives.

The addition of N_shots and prompt_detail parameters successfully implements the multi-level prompt system as described in the PR objectives. The type annotations and default values are appropriate.

Minor formatting improvement to address line length:

-    extra_tool_calls: List[int] = [0]
+    extra_tool_calls: List[int] = [0]
     complexities: List[Literal["easy", "medium", "hard"]] = ["easy", "medium", "hard"]
-    N_shots: List[Literal[0, 2, 5]] = [0, 2, 5]
-    prompt_detail: List[Literal["brief", "moderate", "descriptive"]] = [
-        "brief",
-        "moderate",
-        "descriptive",
-    ]
+    N_shots: List[Literal[0, 2, 5]] = [0, 2, 5]
+    prompt_detail: List[Literal["brief", "moderate", "descriptive"]] = [
+        "brief", "moderate", "descriptive"
+    ]
src/rai_bench/rai_bench/tool_calling_agent/predefined/custom_interfaces_tasks.py (1)

32-32: Formatting consistency improvement.

Adjust comment formatting for consistency:

-########## SUBTASKS #################################################################
+# SUBTASKS
src/rai_bench/rai_bench/tool_calling_agent/predefined/spatial_reasoning_tasks.py (1)

35-94: Consider refactoring for better maintainability.

The task inputs are well-defined, but there's some duplication between the initial definitions and the categorized versions within the function.

Consider moving the initial true_response_inputs and false_response_inputs definitions into the function or removing them if they're not used elsewhere, to reduce duplication and improve maintainability.

src/rai_bench/rai_bench/results_processing/visualise/tool_calling_agent_display.py (1)

376-376: Fix typo in UI label

-        "Select prompt decriptivness",
+        "Select prompt descriptiveness",
src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py (3)

111-116: Simplify conditional logic by removing unnecessary elif

     def get_system_prompt(self) -> str:
         if self.n_shots == 0:
             return SPATIAL_REASONING_SYSTEM_PROMPT_0_SHOT
-        elif self.n_shots == 2:
+        if self.n_shots == 2:
             return SPATIAL_REASONING_SYSTEM_PROMPT_2_SHOT
-        else:
-            return SPATIAL_REASONING_SYSTEM_PROMPT_5_SHOT
+        return SPATIAL_REASONING_SYSTEM_PROMPT_5_SHOT

147-156: Simplify conditional logic by removing unnecessary elif

     def get_prompt(self):
         if self.prompt_detail == "brief":
             return self.get_base_prompt()
-        elif self.prompt_detail == "moderate":
+        if self.prompt_detail == "moderate":
             return f"{self.get_base_prompt()} using visual analysis"
-        else:
-            return (
-                f"{self.get_base_prompt()} using the visual analysis system. "
-                "You can examine the provided image(s) carefully to identify relevant features, "
-                "analyze the visual content, and provide a boolean response based on your observations."
-            )
+        return (
+            f"{self.get_base_prompt()} using the visual analysis system. "
+            "You can examine the provided image(s) carefully to identify relevant features, "
+            "analyze the visual content, and provide a boolean response based on your observations."
+        )

28-28: Break up long line for better readability

-SPATIAL_REASONING_SYSTEM_PROMPT_0_SHOT = """You are a helpful and knowledgeable AI assistant that specializes in interpreting and analyzing visual content. Your task is to answer questions based on the images provided to you. Please response with the use of the provided tools."""
+SPATIAL_REASONING_SYSTEM_PROMPT_0_SHOT = (
+    "You are a helpful and knowledgeable AI assistant that specializes in "
+    "interpreting and analyzing visual content. Your task is to answer questions "
+    "based on the images provided to you. Please response with the use of the provided tools."
+)
src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py (1)

170-239: Fix line length violations in task prompt methods.

The prompt methods follow a good consistent pattern, but several lines exceed the character limit.

Consider breaking long strings across multiple lines:

-            return (
-                f"{self.get_base_prompt()} using the robotic navigation system. "
-                "You can use the navigation tools to move the robot to the specified coordinates. "
-                "First get the available actions, then set up the navigation goal to reach point (2.0, 2.0, 0.0)."
-            )
+            return (
+                f"{self.get_base_prompt()} using the robotic navigation system. "
+                "You can use the navigation tools to move the robot to the "
+                "specified coordinates. First get the available actions, then "
+                "set up the navigation goal to reach point (2.0, 2.0, 0.0)."
+            )

Note: The pylint warnings about "elif after return" are style preferences and the current code is acceptable.

src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py (1)

193-201: Consider refactoring constructors with many parameters.

Several task constructors have 7+ parameters, which impacts readability and maintainability.

Consider these approaches to reduce parameter count:

  1. Group related parameters into configuration objects (e.g., AudioConfig, DetectionConfig)
  2. Use the builder pattern for complex task initialization
  3. Move some parameters to class-level constants if they rarely change

This would improve the API design and make the code more maintainable.

Also applies to: 231-241, 283-295, 336-346, 373-382, 445-455

src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py (1)

179-196: Fix line length violations in prompt methods.

The task implementations are well-structured with good error handling, but several prompt strings exceed the line limit.

Format long strings properly:

-                f"{self.get_base_prompt()} using the robotic manipulation system. "
-                "You can control the arm movement to the specified coordinates "
-                f"and perform the {self.move_to_tool_input.task} action at that location."
+                f"{self.get_base_prompt()} using the robotic manipulation "
+                "system. You can control the arm movement to the specified "
+                f"coordinates and perform the {self.move_to_tool_input.task} "
+                "action at that location."

Also applies to: 215-240, 376-390

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 650083e and 761ba5c.

📒 Files selected for processing (25)
  • docs/simulation_and_benchmarking/rai_bench.md (2 hunks)
  • docs/tutorials/benchmarking.md (6 hunks)
  • src/rai_bench/pyproject.toml (1 hunks)
  • src/rai_bench/rai_bench/examples/benchmarking_models.py (2 hunks)
  • src/rai_bench/rai_bench/examples/tool_calling_agent.py (1 hunks)
  • src/rai_bench/rai_bench/results_processing/data_loading.py (2 hunks)
  • src/rai_bench/rai_bench/results_processing/data_processing.py (3 hunks)
  • src/rai_bench/rai_bench/results_processing/visualise/tool_calling_agent_display.py (4 hunks)
  • src/rai_bench/rai_bench/test_models.py (2 hunks)
  • src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (1 hunks)
  • src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (5 hunks)
  • src/rai_bench/rai_bench/tool_calling_agent/predefined/__init__.py (1 hunks)
  • src/rai_bench/rai_bench/tool_calling_agent/predefined/basic_tasks.py (1 hunks)
  • src/rai_bench/rai_bench/tool_calling_agent/predefined/custom_interfaces_tasks.py (1 hunks)
  • src/rai_bench/rai_bench/tool_calling_agent/predefined/manipulation_tasks.py (1 hunks)
  • src/rai_bench/rai_bench/tool_calling_agent/predefined/navigation_tasks.py (1 hunks)
  • src/rai_bench/rai_bench/tool_calling_agent/predefined/spatial_reasoning_tasks.py (1 hunks)
  • src/rai_bench/rai_bench/tool_calling_agent/predefined/tasks.py (2 hunks)
  • src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py (1 hunks)
  • src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py (1 hunks)
  • src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py (5 hunks)
  • src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py (9 hunks)
  • src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py (3 hunks)
  • src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py (4 hunks)
  • src/rai_bench/rai_bench/utils.py (1 hunks)
🧰 Additional context used
🪛 Flake8 (7.2.0)
src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py

[error] 47-47: line too long (81 > 79 characters)

(E501)


[error] 49-49: line too long (83 > 79 characters)

(E501)

src/rai_bench/rai_bench/results_processing/data_processing.py

[error] 191-191: line too long (83 > 79 characters)

(E501)


[error] 217-217: line too long (87 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/predefined/manipulation_tasks.py

[error] 34-34: too many leading '#' for block comment

(E266)


[error] 34-34: line too long (85 > 79 characters)

(E501)


[error] 44-44: too many leading '#' for block comment

(E266)


[error] 44-44: line too long (110 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/predefined/navigation_tasks.py

[error] 34-34: too many leading '#' for block comment

(E266)


[error] 34-34: line too long (85 > 79 characters)

(E501)


[error] 63-63: too many leading '#' for block comment

(E266)


[error] 63-63: line too long (110 > 79 characters)

(E501)


[error] 67-67: line too long (87 > 79 characters)

(E501)


[error] 68-68: line too long (86 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/predefined/basic_tasks.py

[error] 39-39: too many leading '#' for block comment

(E266)


[error] 39-39: line too long (85 > 79 characters)

(E501)


[error] 181-181: too many leading '#' for block comment

(E266)


[error] 181-181: line too long (110 > 79 characters)

(E501)


[error] 187-187: line too long (88 > 79 characters)

(E501)


[error] 188-188: line too long (88 > 79 characters)

(E501)


[error] 221-221: line too long (85 > 79 characters)

(E501)


[error] 222-222: line too long (85 > 79 characters)

(E501)


[error] 254-254: line too long (81 > 79 characters)

(E501)


[error] 283-283: line too long (84 > 79 characters)

(E501)


[error] 286-286: line too long (84 > 79 characters)

(E501)

src/rai_bench/rai_bench/examples/benchmarking_models.py

[error] 35-35: line too long (80 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/predefined/spatial_reasoning_tasks.py

[error] 95-95: too many leading '#' for block comment

(E266)


[error] 95-95: line too long (85 > 79 characters)

(E501)


[error] 100-100: line too long (80 > 79 characters)

(E501)


[error] 103-103: too many leading '#' for block comment

(E266)


[error] 103-103: line too long (110 > 79 characters)

(E501)


[error] 127-127: line too long (85 > 79 characters)

(E501)


[error] 162-162: line too long (87 > 79 characters)

(E501)


[error] 165-165: line too long (85 > 79 characters)

(E501)


[error] 176-176: line too long (81 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/predefined/custom_interfaces_tasks.py

[error] 32-32: too many leading '#' for block comment

(E266)


[error] 32-32: line too long (85 > 79 characters)

(E501)


[error] 46-46: too many leading '#' for block comment

(E266)


[error] 46-46: line too long (110 > 79 characters)

(E501)

src/rai_bench/rai_bench/results_processing/visualise/tool_calling_agent_display.py

[error] 121-121: line too long (83 > 79 characters)

(E501)


[error] 132-132: line too long (83 > 79 characters)

(E501)


[error] 147-147: line too long (81 > 79 characters)

(E501)


[error] 149-149: line too long (80 > 79 characters)

(E501)


[error] 340-340: line too long (81 > 79 characters)

(E501)


[error] 357-357: line too long (83 > 79 characters)

(E501)

src/rai_bench/rai_bench/test_models.py

[error] 60-60: line too long (86 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/predefined/tasks.py

[error] 32-32: line too long (87 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py

[error] 36-36: line too long (151 > 79 characters)

(E501)


[error] 85-85: line too long (90 > 79 characters)

(E501)


[error] 93-93: line too long (220 > 79 characters)

(E501)


[error] 100-100: line too long (120 > 79 characters)

(E501)


[error] 101-101: line too long (181 > 79 characters)

(E501)


[error] 145-145: line too long (82 > 79 characters)

(E501)


[error] 176-176: line too long (81 > 79 characters)

(E501)


[error] 177-177: line too long (99 > 79 characters)

(E501)


[error] 178-178: line too long (114 > 79 characters)

(E501)


[error] 193-193: line too long (80 > 79 characters)

(E501)


[error] 196-196: line too long (81 > 79 characters)

(E501)


[error] 197-197: line too long (95 > 79 characters)

(E501)


[error] 216-216: line too long (81 > 79 characters)

(E501)


[error] 217-217: line too long (82 > 79 characters)

(E501)


[error] 218-218: line too long (88 > 79 characters)

(E501)


[error] 236-236: line too long (81 > 79 characters)

(E501)


[error] 237-237: line too long (96 > 79 characters)

(E501)


[error] 238-238: line too long (94 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py

[error] 28-28: line too long (280 > 79 characters)

(E501)


[error] 39-39: line too long (105 > 79 characters)

(E501)


[error] 43-43: line too long (82 > 79 characters)

(E501)


[error] 45-45: line too long (91 > 79 characters)

(E501)


[error] 76-76: line too long (86 > 79 characters)

(E501)


[error] 154-154: line too long (97 > 79 characters)

(E501)


[error] 155-155: line too long (104 > 79 characters)

(E501)


[error] 159-159: line too long (83 > 79 characters)

(E501)


[error] 163-163: line too long (87 > 79 characters)

(E501)


[error] 164-164: line too long (85 > 79 characters)

(E501)


[error] 167-167: line too long (89 > 79 characters)

(E501)


[error] 168-168: line too long (110 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py

[error] 74-74: line too long (171 > 79 characters)

(E501)


[error] 82-82: line too long (179 > 79 characters)

(E501)


[error] 89-89: line too long (81 > 79 characters)

(E501)


[error] 90-90: line too long (210 > 79 characters)

(E501)


[error] 175-175: line too long (82 > 79 characters)

(E501)


[error] 185-185: line too long (83 > 79 characters)

(E501)


[error] 186-186: line too long (91 > 79 characters)

(E501)


[error] 223-223: line too long (89 > 79 characters)

(E501)


[error] 224-224: line too long (98 > 79 characters)

(E501)


[error] 260-260: line too long (92 > 79 characters)

(E501)


[error] 267-267: line too long (80 > 79 characters)

(E501)


[error] 273-273: line too long (82 > 79 characters)

(E501)


[error] 274-274: line too long (104 > 79 characters)

(E501)


[error] 275-275: line too long (81 > 79 characters)

(E501)


[error] 313-313: line too long (103 > 79 characters)

(E501)


[error] 321-321: line too long (82 > 79 characters)

(E501)


[error] 326-326: line too long (88 > 79 characters)

(E501)


[error] 327-327: line too long (89 > 79 characters)

(E501)


[error] 328-328: line too long (103 > 79 characters)

(E501)


[error] 364-364: line too long (85 > 79 characters)

(E501)


[error] 365-365: line too long (86 > 79 characters)

(E501)


[error] 366-366: line too long (85 > 79 characters)

(E501)


[error] 392-392: line too long (84 > 79 characters)

(E501)


[error] 393-393: line too long (87 > 79 characters)

(E501)


[error] 405-405: line too long (88 > 79 characters)

(E501)


[error] 406-406: line too long (106 > 79 characters)

(E501)


[error] 407-407: line too long (114 > 79 characters)

(E501)


[error] 436-436: line too long (83 > 79 characters)

(E501)


[error] 437-437: line too long (84 > 79 characters)

(E501)


[error] 459-459: line too long (82 > 79 characters)

(E501)


[error] 469-469: line too long (85 > 79 characters)

(E501)


[error] 470-470: line too long (96 > 79 characters)

(E501)


[error] 496-496: line too long (81 > 79 characters)

(E501)


[error] 500-500: line too long (82 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py

[error] 24-24: line too long (87 > 79 characters)

(E501)


[error] 32-32: line too long (171 > 79 characters)

(E501)


[error] 40-40: line too long (179 > 79 characters)

(E501)


[error] 48-48: line too long (164 > 79 characters)

(E501)


[error] 66-66: line too long (88 > 79 characters)

(E501)


[error] 103-103: line too long (109 > 79 characters)

(E501)


[error] 122-122: line too long (90 > 79 characters)

(E501)


[error] 140-140: line too long (91 > 79 characters)

(E501)


[error] 158-158: line too long (98 > 79 characters)

(E501)


[error] 175-175: line too long (98 > 79 characters)

(E501)


[error] 193-193: line too long (93 > 79 characters)

(E501)


[error] 195-195: line too long (97 > 79 characters)

(E501)


[error] 212-212: line too long (97 > 79 characters)

(E501)


[error] 213-213: line too long (85 > 79 characters)

(E501)


[error] 231-231: line too long (94 > 79 characters)

(E501)


[error] 232-232: line too long (85 > 79 characters)

(E501)


[error] 233-233: line too long (88 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py

[error] 43-43: line too long (80 > 79 characters)

(E501)


[error] 153-153: line too long (81 > 79 characters)

(E501)


[error] 175-175: line too long (81 > 79 characters)

(E501)


[error] 193-193: line too long (83 > 79 characters)

(E501)


[error] 194-194: line too long (80 > 79 characters)

(E501)


[error] 195-195: line too long (90 > 79 characters)

(E501)


[error] 210-210: line too long (81 > 79 characters)

(E501)


[error] 237-237: line too long (82 > 79 characters)

(E501)


[error] 286-286: line too long (83 > 79 characters)

(E501)


[error] 313-313: line too long (86 > 79 characters)

(E501)


[error] 343-343: line too long (84 > 79 characters)

(E501)


[error] 370-370: line too long (81 > 79 characters)

(E501)


[error] 383-383: line too long (83 > 79 characters)

(E501)


[error] 386-386: line too long (93 > 79 characters)

(E501)


[error] 387-387: line too long (88 > 79 characters)

(E501)


[error] 388-388: line too long (98 > 79 characters)

(E501)

🪛 Ruff (0.11.9)
src/rai_bench/rai_bench/tool_calling_agent/predefined/manipulation_tasks.py

54-54: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


55-59: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


60-60: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

src/rai_bench/rai_bench/tool_calling_agent/predefined/navigation_tasks.py

72-72: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


73-77: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


78-78: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

src/rai_bench/rai_bench/tool_calling_agent/predefined/basic_tasks.py

244-244: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


245-249: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


250-250: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

src/rai_bench/rai_bench/tool_calling_agent/predefined/spatial_reasoning_tasks.py

109-109: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


110-114: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


115-115: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

src/rai_bench/rai_bench/tool_calling_agent/predefined/custom_interfaces_tasks.py

59-59: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


60-64: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


65-65: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

src/rai_bench/rai_bench/tool_calling_agent/predefined/tasks.py

31-31: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


32-32: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


33-37: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


38-38: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py

198-198: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


236-236: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

🪛 LanguageTool
docs/simulation_and_benchmarking/rai_bench.md

[uncategorized] ~128-~128: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...are grouped by categories: - Basic - require retrieving info from certain topics - ...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)


[uncategorized] ~136-~136: Possible missing comma found.
Context: ...flects the difficulty. When creating a Task you can define few params: ```python c...

(AI_HYDRA_LEO_MISSING_COMMA)


[uncategorized] ~136-~136: You might be missing the article “a” here.
Context: ...y. When creating a Task you can define few params: ```python class TaskArgs(BaseM...

(AI_EN_LECTOR_MISSING_DETERMINER_A)


[uncategorized] ~149-~149: You might be missing the article “an” here.
Context: ...l_calls - How many extra tool calls can agent make and still pass the Task. If you w...

(AI_EN_LECTOR_MISSING_DETERMINER_AN)

🪛 Pylint (3.3.7)
src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py

[refactor] 155-160: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 170-179: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 190-199: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 210-219: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 230-239: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py

[refactor] 55-55: Too few public methods (0/2)

(R0903)


[refactor] 59-59: Too few public methods (0/2)

(R0903)


[refactor] 72-72: Too few public methods (0/2)

(R0903)


[refactor] 111-116: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 147-156: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

src/rai_bench/rai_bench/tool_calling_agent/interfaces.py

[refactor] 458-458: Too few public methods (0/2)

(R0903)

src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py

[refactor] 104-109: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 178-187: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 193-193: Too many arguments (7/5)

(R0913)


[refactor] 193-193: Too many positional arguments (7/5)

(R0917)


[refactor] 215-225: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 231-231: Too many arguments (9/5)

(R0913)


[refactor] 231-231: Too many positional arguments (9/5)

(R0917)


[refactor] 264-277: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 283-283: Too many arguments (11/5)

(R0913)


[refactor] 283-283: Too many positional arguments (11/5)

(R0917)


[refactor] 318-330: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 336-336: Too many arguments (6/5)

(R0913)


[refactor] 336-336: Too many positional arguments (6/5)

(R0917)


[refactor] 356-367: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 373-373: Too many arguments (8/5)

(R0913)


[refactor] 373-373: Too many positional arguments (8/5)

(R0917)


[refactor] 398-408: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 429-439: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 445-445: Too many arguments (6/5)

(R0913)


[refactor] 445-445: Too many positional arguments (6/5)

(R0917)


[refactor] 462-472: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 493-503: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py

[refactor] 78-83: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 97-105: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 115-123: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 133-141: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 151-159: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 169-177: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 187-196: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 206-215: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 225-235: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py

[refactor] 135-140: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 187-196: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 231-240: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 250-259: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 280-289: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 305-314: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 335-344: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 380-390: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

🔇 Additional comments (23)
src/rai_bench/pyproject.toml (1)

3-3: Version bump appropriately reflects the feature additions.

The version change from "0.1.0" to "0.2.0" correctly follows semantic versioning to indicate new features (prompt detail levels, few-shot examples, task modularization) without breaking changes.

src/rai_bench/rai_bench/examples/tool_calling_agent.py (1)

37-38: Integration of new task parameters looks good.

The addition of n_shots and prompt_detail parameters correctly extends the example to support the new prompt parameterization features introduced in this PR.

src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (1)

163-166: Enhanced TaskResult metadata looks correct.

The changes appropriately capture the new task configuration parameters:

  • Using get_base_prompt() for consistent task identification
  • Adding examples_in_system_prompt and prompt_detail for enriched result metadata

These additions will enable better filtering and analysis in the results processing pipeline.

src/rai_bench/rai_bench/tool_calling_agent/predefined/__init__.py (1)

15-27: Excellent module organization following Python best practices.

The centralized import/export pattern with explicit __all__ definition provides a clean public API while maintaining modular task organization. This supports the PR objective of separating predefined tasks by type for improved code readability.

src/rai_bench/rai_bench/examples/benchmarking_models.py (3)

23-24: Simplified model configuration for focused testing.

Reducing to a single model and vendor streamlines the example for demonstrating the new parameterization features.


39-48: New task parameterization features integrated correctly.

The addition of custom_interfaces task type and the new N_shots and prompt_detail parameters demonstrate the enhanced benchmark configuration capabilities introduced in this PR. The list format allows for testing multiple parameter combinations.


56-56: Focused benchmark configuration for demonstration.

Reducing to a single benchmark configuration (tool_conf) simplifies the example while showcasing the new parameterization features.

src/rai_bench/rai_bench/tool_calling_agent/predefined/manipulation_tasks.py (1)

68-97: LGTM! Well-structured task generation logic.

The nested loops create a comprehensive set of parameterized tasks. The use of TaskArgs to encapsulate configuration parameters is clean and the validator assignment is appropriate for the task types.

docs/simulation_and_benchmarking/rai_bench.md (1)

139-150: LGTM! Clear documentation of TaskArgs parameters.

The code snippet effectively illustrates the new TaskArgs configuration options and their purposes are well explained.

src/rai_bench/rai_bench/results_processing/data_loading.py (2)

73-84: LGTM! Clean refactoring using dictionary unpacking.

The use of dictionary unpacking with explicit type conversions is a good improvement that makes the code more maintainable and aligns well with the new TaskResult fields.


101-106: LGTM! Consistent use of dictionary unpacking pattern.

The refactoring follows the same clean pattern as the TaskResult conversion function.

src/rai_bench/rai_bench/test_models.py (1)

198-199: Good integration of new parameters.

The addition of prompt_detail and n_shots parameters to the get_tasks function call correctly implements the enhanced task generation capabilities.

src/rai_bench/rai_bench/utils.py (1)

46-62: Excellent CLI interface additions.

The new command-line arguments for --prompt-detail and --n-shots provide the necessary interface for the enhanced benchmark configuration. The choices are well-defined and match the configuration class, and the help text is clear and informative.

src/rai_bench/rai_bench/tool_calling_agent/predefined/custom_interfaces_tasks.py (1)

69-87: Well-structured task generation logic.

The nested loop structure correctly generates tasks for all parameter combinations, and the use of TaskArgs provides a clean abstraction for task configuration.

src/rai_bench/rai_bench/tool_calling_agent/predefined/navigation_tasks.py (2)

36-62: Well-defined navigation task specifications.

The ROS2 action specifications are correctly structured with appropriate expected fields for navigation, spinning, and drive-on-heading actions.


90-109: Efficient task generation using extend.

Good use of tasks.extend() to add multiple tasks at once, and the task instantiation covers all the navigation task types appropriately.

src/rai_bench/rai_bench/tool_calling_agent/predefined/spatial_reasoning_tasks.py (2)

119-194: Excellent task complexity categorization.

The categorization of spatial reasoning tasks into easy (object presence), medium (counting/state), and hard (spatial relationships) is well-thought-out and provides good coverage of different visual reasoning capabilities.


205-270: Comprehensive task generation with good organization.

The task generation covers all complexity levels and response types systematically. The code structure is clear and maintainable.

src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (2)

458-464: LGTM! Well-designed configuration model.

The TaskArgs model provides a clean interface for task configuration with appropriate defaults and type constraints using Literal types.

Note: The pylint warning about too few public methods can be safely ignored for Pydantic data models.


468-502: Excellent refactoring of the Task interface!

The changes improve the design in several ways:

  • Using TaskArgs simplifies task initialization and makes it more extensible
  • Making type a class attribute is cleaner than an abstract property
  • The new optional_tool_calls_number property adds flexibility for tasks that may make preliminary calls
  • The updated max_tool_calls_number calculation correctly includes all allowed calls
  • The get_base_prompt() method standardizes prompt handling across tasks

Also applies to: 538-553, 574-580

src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py (1)

20-28: Good modularization of interface definitions.

The imports from mocked_ros2_interfaces and the combination of common and navigation-specific constants provide a clean separation of concerns.

Also applies to: 103-120

src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py (1)

57-84: Well-designed BasicTask base class.

The base class provides a unified set of tools for all basic tasks and correctly implements the optional tool calls pattern. The system prompt selection is consistent with other task modules.

src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py (1)

92-161: Excellent class hierarchy design.

The separation of ManipulationTask and GrabTask provides good abstraction layers. The use of **kwargs allows flexibility for subclass-specific parameters while maintaining a clean interface through TaskArgs.

@boczekbartek boczekbartek self-requested a review June 25, 2025 09:40
Copy link
Member

@boczekbartek boczekbartek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmatejcz thank you for this PR!. I run the benchmark. I have a couple of questions:

  1. Are all these API call logs required?
(rai-framework-py3.10) robo-pc-005 ➜  rai git:(jm/feat/tool-calling-tasks) ✗ python src/rai_bench/rai_bench/examples/benchmarking_models.py                   
UserWarning: <built-in function allocate_lock> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we
 cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`.                        
2025-06-30 10:26:05 robo-pc-005 httpx[1634151] INFO HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-06-30 10:26:05 robo-pc-005 httpx[1634151] INFO HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"                                            
2025-06-30 10:26:06 robo-pc-005 httpx[1634151] INFO HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"  

Also I left some questions to the code.

  1. My main consideration is the level of customization of the brief, moderate and descriptive prompts. I am wondering how much they show the practical usecase. Usually more extended system prompt contains few shot examples. Current more complex prompts are more generic.
    Did you notice a performance increase with more complex prompts?

  2. Could you share some example results from the benchmark?

Comment on lines 110 to 114
prompt_detail: List[Literal["brief", "moderate", "descriptive"]] = [
"brief",
"moderate",
"descriptive",
],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it described elsewhere what is the meaning of this argument? (besides the PR description)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I think moderate level don't provide additional information, so it would be better to remove it.
I think adding and example in docs of how the prompt_detail is set in predefined tasks would be better, because it's hard to explain without an example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moderate removed here: c03f1cf
docs updated here: 80e5ad2

@jmatejcz
Copy link
Contributor Author

@jmatejcz thank you for this PR!. I run the benchmark. I have a couple of questions:

1. Are all these API call logs required?
(rai-framework-py3.10) robo-pc-005 ➜  rai git:(jm/feat/tool-calling-tasks) ✗ python src/rai_bench/rai_bench/examples/benchmarking_models.py                   
UserWarning: <built-in function allocate_lock> is not a Python type (it may be an instance of an object), Pydantic will allow any object with no validation since we
 cannot even enforce that the input is an instance of the given type. To get rid of this error wrap the type with `pydantic.SkipValidation`.                        
2025-06-30 10:26:05 robo-pc-005 httpx[1634151] INFO HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-06-30 10:26:05 robo-pc-005 httpx[1634151] INFO HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"                                            
2025-06-30 10:26:06 robo-pc-005 httpx[1634151] INFO HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"  

Also I left some questions to the code.

2. My main consideration is the level of customization of the brief, moderate and descriptive prompts. I am wondering how much they show the practical usecase. Usually more extended system prompt contains few shot examples. Current more complex prompts are more generic.
   Did you notice a performance increase with more complex prompts?

3. Could you share some example results from the benchmark?

@boczekbartek thank you for the review

yes i see that descriptive prompt performs better, but i don't see almost any difference in moderate and brief.

There is level of descriptiveness of a Task prompt and there is number of examples in system prompt - they are 2 different params

About the Task prompt:
At first i had an idea to make 3 levels of prompts,

  • where brief is just for example: "Get RGB camera image"
  • moderate adds a tools that need to be used in this case
  • and descriptive adds how this tools should be called
    But i came to conclusion that telling model directly what he have to call and how does not make much sense in benchmark, the whole point is to evaluate how well model deduces the calls. So i changed the approach to make it more of a guidance, so now
    descriptive adds for example: "You can explore available camera topics and capture the RGB color image."
    In this case the moderate level doesn't have much to add, so its almost the same as brief prompt and that's why i think it should be removed now, but i'm not sure.
    Thats the results on basic tasks:

image

supress info httpx logs
@jmatejcz
Copy link
Contributor Author

@boczekbartek adjsuted manipultion_config name and supressed httpx info logs here: 308c3ce

@jmatejcz jmatejcz requested a review from boczekbartek June 30, 2025 14:15
@jmatejcz jmatejcz requested a review from boczekbartek July 1, 2025 08:40
Copy link
Member

@boczekbartek boczekbartek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmatejcz Thank you for applying the changes. LGTM!

@jmatejcz jmatejcz merged commit 083e7e5 into development Jul 1, 2025
6 checks passed
@jmatejcz jmatejcz deleted the jm/feat/tool-calling-tasks branch July 1, 2025 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants