feat: tool calling benchmark unified across types and prompts variety #620

jmatejcz · 2025-06-04T12:08:54Z

Purpose

Unify approach to defining prompt for different types of tasks.
Unify mocks of interfaces, topics, services and actions across tasks.
Provide different types of prompts.

Proposed Changes

Introduced new levels of system prompts which can be defined by user -> n_shots param defines how many examples there are in system prompt. There are 0, 2 and 5 available now. Every Task type has now 3 system prompts available.
New levels of Task prompts which can be defined by user -> prompt_detail param defines how descriptive the prompt is. There are:
- brief - only short command, like: Get RGB camera image.
- moderate - slightly expanded by adding some context like: Get RGB camera image from the camera.
- descriptive - detailed explanation containing what can done to accomplish task, like: Get RGB camera image from the robot's camera system. You can explore available camera topics and capture the RGB color image.
⚠️ Warning: I'm not sure about the moderate level, at first it seems the right approach to have some middleground, but now i feel like it's not rly helping much. I didn't see much or even any improvement from brief to moderate . IMO moderate level can be removed
added optional tool calls number to each Task, which solves the problem when for example -> Getting image using get_ros2_image requires only 1 tool call, but listing the topics before doing that should not be considered error or extra tool call. Same the other way around, lack of listing topics is not error either. In this case Task has optional tool calls number set to 1.
Merged mock interfaces, topics, services and actions. Moved it to separate file and splitted by groups.
Separated predefined tasks into couple files by task type to make code more readable.
Results now contains the info about the prompt detail and how much examples there was in system prompt. Additionally task prompt is saved as base prompt which is the same independently from prompt_detail param. This is for processing results so the same task with different prompt detailness is not classified as separate.
Adjusted visualisation script and docs to changes. UI has now drill down filters on tasks.

⚠️ Warning: Not defining large number of new tasks and adjsuting the tasks prompts as it is a big PR already, it will be added in next PRs

Issues

#576 - solved partially, adding concrete tasks to groups will be in seperate PRs

Testing

python src/rai_bench/rai_bench/examples/benchmarking_models.py

streamlit run  src/rai_bench/rai_bench/examples/visualise_streamlit.py

Summary by CodeRabbit

New Features
- Added support for configuring prompt detail levels and number of few-shot examples in tool-calling agent benchmarks.
- Introduced new categories and flexible parameterization for task generation, including prompt detail and shot count.
- Enhanced performance visualization and filtering with new breakdowns by prompt detail and number of examples.
Improvements
- Standardized and modularized task definitions for basic, manipulation, navigation, custom interfaces, and spatial reasoning.
- Unified and clarified documentation and tutorials to reflect new configuration options and usage patterns.
- Improved prompt generation with selectable detail levels and few-shot variants across all task types.
- Enhanced UI for task analysis with additional filtering options.
Bug Fixes
- Corrected and clarified task descriptions and configuration options in documentation.
Chores
- Updated package version to 0.2.0.

removed redundant tasks

removed old tasks declarations

all tasks have same tools available now all tool have all topics available

added more tools available add optional toll calls number

add optional calls

jmatejcz · 2025-06-23T14:54:27Z

@CodeRabbit full review

coderabbitai · 2025-06-23T14:54:32Z

✅ Actions performed

Full review triggered.

coderabbitai · 2025-06-23T15:24:32Z

✅ Actions performed

Full review triggered.

coderabbitai · 2025-06-23T15:47:02Z

Walkthrough

The changes introduce a major refactor and extension of the tool-calling agent benchmark framework. Task definitions are modularized and parameterized, supporting new prompt detail levels and few-shot prompting. The public interface for defining, configuring, and filtering tasks is unified via new data models and factory functions. Visualization and result processing functions are generalized to support richer metadata. Documentation and examples are updated accordingly.

Changes

File(s)	Change Summary
docs/simulation_and_benchmarking/rai_bench.md, docs/tutorials/benchmarking.md	Documentation updated to clarify task configuration, parameterization, and example usage, reflecting new API patterns and flexible benchmark options.
src/rai_bench/pyproject.toml	Package version updated from 0.1.0 to 0.2.0.
src/rai_bench/rai_bench/examples/benchmarking_models.py, src/rai_bench/rai_bench/examples/tool_calling_agent.py	Example scripts updated to use new task parameterization (`extra_tool_calls`, `prompt_detail`, `n_shots`) and filter logic.
src/rai_bench/rai_bench/results_processing/data_loading.py	DataFrame row-to-domain object conversion refactored to use dictionary unpacking, simplifying object construction.
src/rai_bench/rai_bench/results_processing/data_processing.py	Task details DataFrame creation extended with new filters: complexity, examples in system prompt, and prompt detail.
src/rai_bench/rai_bench/results_processing/visualise/tool_calling_agent_display.py	Visualization functions generalized for arbitrary fields, detailed analysis supports multiple filters, UI enhanced with new selectors.
src/rai_bench/rai_bench/test_models.py	Benchmark config for tool-calling agent updated: `extra_tool_calls` is now a list, new fields `N_shots` and `prompt_detail` added, and passed to task generation.
src/rai_bench/rai_bench/tool_calling_agent/benchmark.py	TaskResult construction updated to include `examples_in_system_prompt` and `prompt_detail`, and use base prompt.
src/rai_bench/rai_bench/tool_calling_agent/interfaces.py	New `TaskArgs` data model introduced; `Task` interface refactored to use `TaskArgs`, adds `type`, `optional_tool_calls_number`, and `get_base_prompt`.
src/rai_bench/rai_bench/tool_calling_agent/predefined/init.py, .../basic_tasks.py, .../custom_interfaces_tasks.py, .../manipulation_tasks.py, .../navigation_tasks.py, .../spatial_reasoning_tasks.py	New modularized task definition modules added, each exporting a function to generate parameterized task lists for a specific category.
src/rai_bench/rai_bench/tool_calling_agent/predefined/tasks.py	Refactored to delegate task creation to new modularized functions, supports multiple values for prompt detail and shots, filters by complexity.
src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py	`TaskResult` model extended with `examples_in_system_prompt` and `prompt_detail` fields.
src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py, .../custom_interfaces.py, .../manipulation.py, .../navigation.py, .../spatial.py	All task classes refactored: unified constructor with `TaskArgs`, support for prompt detail and few-shot variants, modularized tool and interface definitions, new/updated prompt generation methods, and richer metadata.
src/rai_bench/rai_bench/utils.py	Argument parser extended with `--prompt-detail` and `--n-shots` options for command-line configuration.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI/Script
    participant BenchmarkConfig
    participant TaskFactory
    participant Task
    participant Benchmark
    participant ResultsProcessor

    User->>CLI/Script: Specify benchmark args (model, extra_tool_calls, prompt_detail, n_shots, etc.)
    CLI/Script->>BenchmarkConfig: Create config with lists for parameters
    CLI/Script->>TaskFactory: get_tasks(extra_tool_calls, prompt_detail, n_shots, ...)
    TaskFactory->>Task: Instantiate Task with TaskArgs for each parameter combination
    TaskFactory-->>CLI/Script: Return list of parameterized Tasks
    CLI/Script->>Benchmark: Run benchmark with Tasks and config
    Benchmark->>Task: For each Task, execute and collect results
    Benchmark->>ResultsProcessor: Pass results including examples_in_system_prompt, prompt_detail
    ResultsProcessor-->>User: Visualize and filter results by new metadata fields

sequenceDiagram
    participant User
    participant WebUI
    participant ResultsProcessor

    User->>WebUI: Select filters (task type, complexity, examples_in_system_prompt, prompt_detail)
    WebUI->>ResultsProcessor: create_task_details_dataframe with selected filters
    ResultsProcessor-->>WebUI: Filtered DataFrame with detailed task results
    WebUI-->>User: Display charts and tables grouped by selected fields

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 17

🔭 Outside diff range comments (2)

src/rai_bench/rai_bench/tool_calling_agent/predefined/tasks.py (1)

30-54: Replace mutable default arguments with None

Using mutable objects as default arguments can lead to unexpected behavior.

 def get_tasks(
-    extra_tool_calls: List[int] = [0],
-    complexities: List[Literal["easy", "medium", "hard"]] = ["easy", "medium", "hard"],
-    prompt_detail: List[Literal["brief", "moderate", "descriptive"]] = [
-        "brief",
-        "moderate",
-        "descriptive",
-    ],
-    n_shots: List[Literal[0, 2, 5]] = [0, 2, 5],
+    extra_tool_calls: List[int] | None = None,
+    complexities: List[Literal["easy", "medium", "hard"]] | None = None,
+    prompt_detail: List[Literal["brief", "moderate", "descriptive"]] | None = None,
+    n_shots: List[Literal[0, 2, 5]] | None = None,
     task_types: List[
         Literal[
             "basic",
             "manipulation",
             "navigation",
             "custom_interfaces",
             "spatial_reasoning",
         ]
-    ] = [
+    ] | None = None,
+) -> List[Task]:
+    if extra_tool_calls is None:
+        extra_tool_calls = [0]
+    if complexities is None:
+        complexities = ["easy", "medium", "hard"]
+    if prompt_detail is None:
+        prompt_detail = ["brief", "moderate", "descriptive"]
+    if n_shots is None:
+        n_shots = [0, 2, 5]
+    if task_types is None:
+        task_types = [
         "basic",
         "manipulation",
         "navigation",
         "custom_interfaces",
         "spatial_reasoning",
-    ],
-) -> List[Task]:
+        ]
     all_tasks: List[Task] = []

src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py (1)

36-102: Fix line length violations in system prompts.

The system prompts are well-structured with good incremental examples, but several lines exceed the 79-character limit.

Apply these formatting fixes:

-ROBOT_NAVIGATION_SYSTEM_PROMPT_0_SHOT = """You are an autonomous robot connected to ros2 environment. Your main goal is to fulfill the user's requests.
+ROBOT_NAVIGATION_SYSTEM_PROMPT_0_SHOT = """You are an autonomous robot connected to ros2 environment. 
+    Your main goal is to fulfill the user's requests.
     Do not make assumptions about the environment you are currently in.
     You can use ros2 topics, services and actions to operate.

     <rule> As a first step check transforms by getting 1 message from /tf topic </rule>
-    <rule> use /cmd_vel topic very carefully. Obstacle detection works only with nav2 stack, so be careful when it is not used. </rule>>
+    <rule> use /cmd_vel topic very carefully. Obstacle detection works only with nav2 stack, 
+    so be careful when it is not used. </rule>

Similar formatting should be applied to lines 93, 100-101, and line 85.

🧹 Nitpick comments (15)

src/rai_bench/rai_bench/examples/benchmarking_models.py (1)
35-35: Fix line length violation.

The line exceeds the 79-character limit enforced by flake8.
-        extra_tool_calls=[0],  # how many extra tool calls allowed to still pass
+        extra_tool_calls=[0],  # extra tool calls allowed to still pass
src/rai_bench/rai_bench/results_processing/data_processing.py (1)
215-235: Consider refactoring the filtering logic for better maintainability.

The sequential filtering approach is correct but could be made more maintainable by using a dictionary-driven approach.
-    # Apply filters
-    if task_type:
-        all_detailed_results = [r for r in all_detailed_results if r.type == task_type]
-
-    if complexity:
-        all_detailed_results = [
-            r for r in all_detailed_results if r.complexity == complexity
-        ]
-
-    if examples_in_system_prompt:
-        all_detailed_results = [
-            r
-            for r in all_detailed_results
-            if r.examples_in_system_prompt == examples_in_system_prompt
-        ]
-
-    if prompt_detail:
-        all_detailed_results = [
-            r for r in all_detailed_results if r.prompt_detail == prompt_detail
-        ]
+    # Apply filters
+    filters = {
+        'type': task_type,
+        'complexity': complexity,
+        'examples_in_system_prompt': examples_in_system_prompt,
+        'prompt_detail': prompt_detail,
+    }
+    
+    for attr, value in filters.items():
+        if value is not None:
+            all_detailed_results = [
+                r for r in all_detailed_results if getattr(r, attr) == value
+            ]
src/rai_bench/rai_bench/tool_calling_agent/predefined/manipulation_tasks.py (1)
34-34: Fix comment formatting for consistency.

The comment blocks have formatting issues that should be addressed for code style compliance.
-########## SUBTASKS #################################################################
+# SUBTASKS
-######### VALIDATORS #########################################################################################
+# VALIDATORS
Also applies to: 44-44
docs/simulation_and_benchmarking/rai_bench.md (2)
127-135: Fix grammatical issues for better readability.

The documentation has some grammatical issues that should be corrected.
-There are predefined Tasks available which are grouped by categories:
+Predefined Tasks are available, grouped by categories:

--   Basic - require retrieving info from certain topics
+-   Basic - requires retrieving info from certain topics
136-150: Fix grammatical issues in task parameter description.
-When creating a Task you can define few params:
+When creating a Task, you can define a few parameters:

-extra_tool_calls - How many extra tool calls can agent make and still pass the Task.
+extra_tool_calls - How many extra tool calls an agent can make and still pass the Task.
src/rai_bench/rai_bench/test_models.py (1)
59-66: LGTM! Configuration changes align with PR objectives.

The addition of N_shots and prompt_detail parameters successfully implements the multi-level prompt system as described in the PR objectives. The type annotations and default values are appropriate.

Minor formatting improvement to address line length:
-    extra_tool_calls: List[int] = [0]
+    extra_tool_calls: List[int] = [0]
     complexities: List[Literal["easy", "medium", "hard"]] = ["easy", "medium", "hard"]
-    N_shots: List[Literal[0, 2, 5]] = [0, 2, 5]
-    prompt_detail: List[Literal["brief", "moderate", "descriptive"]] = [
-        "brief",
-        "moderate",
-        "descriptive",
-    ]
+    N_shots: List[Literal[0, 2, 5]] = [0, 2, 5]
+    prompt_detail: List[Literal["brief", "moderate", "descriptive"]] = [
+        "brief", "moderate", "descriptive"
+    ]
src/rai_bench/rai_bench/tool_calling_agent/predefined/custom_interfaces_tasks.py (1)
32-32: Formatting consistency improvement.

Adjust comment formatting for consistency:
-########## SUBTASKS #################################################################
+# SUBTASKS
src/rai_bench/rai_bench/tool_calling_agent/predefined/spatial_reasoning_tasks.py (1)

35-94: Consider refactoring for better maintainability.

The task inputs are well-defined, but there's some duplication between the initial definitions and the categorized versions within the function.

Consider moving the initial true_response_inputs and false_response_inputs definitions into the function or removing them if they're not used elsewhere, to reduce duplication and improve maintainability.
src/rai_bench/rai_bench/results_processing/visualise/tool_calling_agent_display.py (1)
376-376: Fix typo in UI label
-        "Select prompt decriptivness",
+        "Select prompt descriptiveness",
src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py (3)
111-116: Simplify conditional logic by removing unnecessary elif
     def get_system_prompt(self) -> str:
         if self.n_shots == 0:
             return SPATIAL_REASONING_SYSTEM_PROMPT_0_SHOT
-        elif self.n_shots == 2:
+        if self.n_shots == 2:
             return SPATIAL_REASONING_SYSTEM_PROMPT_2_SHOT
-        else:
-            return SPATIAL_REASONING_SYSTEM_PROMPT_5_SHOT
+        return SPATIAL_REASONING_SYSTEM_PROMPT_5_SHOT
147-156: Simplify conditional logic by removing unnecessary elif
     def get_prompt(self):
         if self.prompt_detail == "brief":
             return self.get_base_prompt()
-        elif self.prompt_detail == "moderate":
+        if self.prompt_detail == "moderate":
             return f"{self.get_base_prompt()} using visual analysis"
-        else:
-            return (
-                f"{self.get_base_prompt()} using the visual analysis system. "
-                "You can examine the provided image(s) carefully to identify relevant features, "
-                "analyze the visual content, and provide a boolean response based on your observations."
-            )
+        return (
+            f"{self.get_base_prompt()} using the visual analysis system. "
+            "You can examine the provided image(s) carefully to identify relevant features, "
+            "analyze the visual content, and provide a boolean response based on your observations."
+        )
28-28: Break up long line for better readability
-SPATIAL_REASONING_SYSTEM_PROMPT_0_SHOT = """You are a helpful and knowledgeable AI assistant that specializes in interpreting and analyzing visual content. Your task is to answer questions based on the images provided to you. Please response with the use of the provided tools."""
+SPATIAL_REASONING_SYSTEM_PROMPT_0_SHOT = (
+    "You are a helpful and knowledgeable AI assistant that specializes in "
+    "interpreting and analyzing visual content. Your task is to answer questions "
+    "based on the images provided to you. Please response with the use of the provided tools."
+)
src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py (1)
170-239: Fix line length violations in task prompt methods.

The prompt methods follow a good consistent pattern, but several lines exceed the character limit.

Consider breaking long strings across multiple lines:
-            return (
-                f"{self.get_base_prompt()} using the robotic navigation system. "
-                "You can use the navigation tools to move the robot to the specified coordinates. "
-                "First get the available actions, then set up the navigation goal to reach point (2.0, 2.0, 0.0)."
-            )
+            return (
+                f"{self.get_base_prompt()} using the robotic navigation system. "
+                "You can use the navigation tools to move the robot to the "
+                "specified coordinates. First get the available actions, then "
+                "set up the navigation goal to reach point (2.0, 2.0, 0.0)."
+            )
Note: The pylint warnings about "elif after return" are style preferences and the current code is acceptable.
src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py (1)

193-201: Consider refactoring constructors with many parameters.

Several task constructors have 7+ parameters, which impacts readability and maintainability.

Consider these approaches to reduce parameter count:

Group related parameters into configuration objects (e.g., AudioConfig, DetectionConfig)

Use the builder pattern for complex task initialization

Move some parameters to class-level constants if they rarely change

This would improve the API design and make the code more maintainable.

Also applies to: 231-241, 283-295, 336-346, 373-382, 445-455
src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py (1)
179-196: Fix line length violations in prompt methods.

The task implementations are well-structured with good error handling, but several prompt strings exceed the line limit.

Format long strings properly:
-                f"{self.get_base_prompt()} using the robotic manipulation system. "
-                "You can control the arm movement to the specified coordinates "
-                f"and perform the {self.move_to_tool_input.task} action at that location."
+                f"{self.get_base_prompt()} using the robotic manipulation "
+                "system. You can control the arm movement to the specified "
+                f"coordinates and perform the {self.move_to_tool_input.task} "
+                "action at that location."
Also applies to: 215-240, 376-390

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 650083e and 761ba5c.

📒 Files selected for processing (25)

docs/simulation_and_benchmarking/rai_bench.md (2 hunks)
docs/tutorials/benchmarking.md (6 hunks)
src/rai_bench/pyproject.toml (1 hunks)
src/rai_bench/rai_bench/examples/benchmarking_models.py (2 hunks)
src/rai_bench/rai_bench/examples/tool_calling_agent.py (1 hunks)
src/rai_bench/rai_bench/results_processing/data_loading.py (2 hunks)
src/rai_bench/rai_bench/results_processing/data_processing.py (3 hunks)
src/rai_bench/rai_bench/results_processing/visualise/tool_calling_agent_display.py (4 hunks)
src/rai_bench/rai_bench/test_models.py (2 hunks)
src/rai_bench/rai_bench/tool_calling_agent/benchmark.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/interfaces.py (5 hunks)
src/rai_bench/rai_bench/tool_calling_agent/predefined/__init__.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/predefined/basic_tasks.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/predefined/custom_interfaces_tasks.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/predefined/manipulation_tasks.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/predefined/navigation_tasks.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/predefined/spatial_reasoning_tasks.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/predefined/tasks.py (2 hunks)
src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py (1 hunks)
src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py (5 hunks)
src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py (9 hunks)
src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py (3 hunks)
src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py (4 hunks)
src/rai_bench/rai_bench/utils.py (1 hunks)

🧰 Additional context used

🪛 Flake8 (7.2.0)

src/rai_bench/rai_bench/tool_calling_agent/results_tracking.py

[error] 47-47: line too long (81 > 79 characters)

(E501)

[error] 49-49: line too long (83 > 79 characters)

(E501)

src/rai_bench/rai_bench/results_processing/data_processing.py

[error] 191-191: line too long (83 > 79 characters)

(E501)

[error] 217-217: line too long (87 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/predefined/manipulation_tasks.py

[error] 34-34: too many leading '#' for block comment

(E266)

[error] 34-34: line too long (85 > 79 characters)

(E501)

[error] 44-44: too many leading '#' for block comment

(E266)

[error] 44-44: line too long (110 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/predefined/navigation_tasks.py

[error] 34-34: too many leading '#' for block comment

(E266)

[error] 34-34: line too long (85 > 79 characters)

(E501)

[error] 63-63: too many leading '#' for block comment

(E266)

[error] 63-63: line too long (110 > 79 characters)

(E501)

[error] 67-67: line too long (87 > 79 characters)

(E501)

[error] 68-68: line too long (86 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/predefined/basic_tasks.py

[error] 39-39: too many leading '#' for block comment

(E266)

[error] 39-39: line too long (85 > 79 characters)

(E501)

[error] 181-181: too many leading '#' for block comment

(E266)

[error] 181-181: line too long (110 > 79 characters)

(E501)

[error] 187-187: line too long (88 > 79 characters)

(E501)

[error] 188-188: line too long (88 > 79 characters)

(E501)

[error] 221-221: line too long (85 > 79 characters)

(E501)

[error] 222-222: line too long (85 > 79 characters)

(E501)

[error] 254-254: line too long (81 > 79 characters)

(E501)

[error] 283-283: line too long (84 > 79 characters)

(E501)

[error] 286-286: line too long (84 > 79 characters)

(E501)

src/rai_bench/rai_bench/examples/benchmarking_models.py

[error] 35-35: line too long (80 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/predefined/spatial_reasoning_tasks.py

[error] 95-95: too many leading '#' for block comment

(E266)

[error] 95-95: line too long (85 > 79 characters)

(E501)

[error] 100-100: line too long (80 > 79 characters)

(E501)

[error] 103-103: too many leading '#' for block comment

(E266)

[error] 103-103: line too long (110 > 79 characters)

(E501)

[error] 127-127: line too long (85 > 79 characters)

(E501)

[error] 162-162: line too long (87 > 79 characters)

(E501)

[error] 165-165: line too long (85 > 79 characters)

(E501)

[error] 176-176: line too long (81 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/predefined/custom_interfaces_tasks.py

[error] 32-32: too many leading '#' for block comment

(E266)

[error] 32-32: line too long (85 > 79 characters)

(E501)

[error] 46-46: too many leading '#' for block comment

(E266)

[error] 46-46: line too long (110 > 79 characters)

(E501)

src/rai_bench/rai_bench/results_processing/visualise/tool_calling_agent_display.py

[error] 121-121: line too long (83 > 79 characters)

(E501)

[error] 132-132: line too long (83 > 79 characters)

(E501)

[error] 147-147: line too long (81 > 79 characters)

(E501)

[error] 149-149: line too long (80 > 79 characters)

(E501)

[error] 340-340: line too long (81 > 79 characters)

(E501)

[error] 357-357: line too long (83 > 79 characters)

(E501)

src/rai_bench/rai_bench/test_models.py

[error] 60-60: line too long (86 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/predefined/tasks.py

[error] 32-32: line too long (87 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/tasks/navigation.py

[error] 36-36: line too long (151 > 79 characters)

(E501)

[error] 85-85: line too long (90 > 79 characters)

(E501)

[error] 93-93: line too long (220 > 79 characters)

(E501)

[error] 100-100: line too long (120 > 79 characters)

(E501)

[error] 101-101: line too long (181 > 79 characters)

(E501)

[error] 145-145: line too long (82 > 79 characters)

(E501)

[error] 176-176: line too long (81 > 79 characters)

(E501)

[error] 177-177: line too long (99 > 79 characters)

(E501)

[error] 178-178: line too long (114 > 79 characters)

(E501)

[error] 193-193: line too long (80 > 79 characters)

(E501)

[error] 196-196: line too long (81 > 79 characters)

(E501)

[error] 197-197: line too long (95 > 79 characters)

(E501)

[error] 216-216: line too long (81 > 79 characters)

(E501)

[error] 217-217: line too long (82 > 79 characters)

(E501)

[error] 218-218: line too long (88 > 79 characters)

(E501)

[error] 236-236: line too long (81 > 79 characters)

(E501)

[error] 237-237: line too long (96 > 79 characters)

(E501)

[error] 238-238: line too long (94 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/tasks/spatial.py

[error] 28-28: line too long (280 > 79 characters)

(E501)

[error] 39-39: line too long (105 > 79 characters)

(E501)

[error] 43-43: line too long (82 > 79 characters)

(E501)

[error] 45-45: line too long (91 > 79 characters)

(E501)

[error] 76-76: line too long (86 > 79 characters)

(E501)

[error] 154-154: line too long (97 > 79 characters)

(E501)

[error] 155-155: line too long (104 > 79 characters)

(E501)

[error] 159-159: line too long (83 > 79 characters)

(E501)

[error] 163-163: line too long (87 > 79 characters)

(E501)

[error] 164-164: line too long (85 > 79 characters)

(E501)

[error] 167-167: line too long (89 > 79 characters)

(E501)

[error] 168-168: line too long (110 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/tasks/custom_interfaces.py

[error] 74-74: line too long (171 > 79 characters)

(E501)

[error] 82-82: line too long (179 > 79 characters)

(E501)

[error] 89-89: line too long (81 > 79 characters)

(E501)

[error] 90-90: line too long (210 > 79 characters)

(E501)

[error] 175-175: line too long (82 > 79 characters)

(E501)

[error] 185-185: line too long (83 > 79 characters)

(E501)

[error] 186-186: line too long (91 > 79 characters)

(E501)

[error] 223-223: line too long (89 > 79 characters)

(E501)

[error] 224-224: line too long (98 > 79 characters)

(E501)

[error] 260-260: line too long (92 > 79 characters)

(E501)

[error] 267-267: line too long (80 > 79 characters)

(E501)

[error] 273-273: line too long (82 > 79 characters)

(E501)

[error] 274-274: line too long (104 > 79 characters)

(E501)

[error] 275-275: line too long (81 > 79 characters)

(E501)

[error] 313-313: line too long (103 > 79 characters)

(E501)

[error] 321-321: line too long (82 > 79 characters)

(E501)

[error] 326-326: line too long (88 > 79 characters)

(E501)

[error] 327-327: line too long (89 > 79 characters)

(E501)

[error] 328-328: line too long (103 > 79 characters)

(E501)

[error] 364-364: line too long (85 > 79 characters)

(E501)

[error] 365-365: line too long (86 > 79 characters)

(E501)

[error] 366-366: line too long (85 > 79 characters)

(E501)

[error] 392-392: line too long (84 > 79 characters)

(E501)

[error] 393-393: line too long (87 > 79 characters)

(E501)

[error] 405-405: line too long (88 > 79 characters)

(E501)

[error] 406-406: line too long (106 > 79 characters)

(E501)

[error] 407-407: line too long (114 > 79 characters)

(E501)

[error] 436-436: line too long (83 > 79 characters)

(E501)

[error] 437-437: line too long (84 > 79 characters)

(E501)

[error] 459-459: line too long (82 > 79 characters)

(E501)

[error] 469-469: line too long (85 > 79 characters)

(E501)

[error] 470-470: line too long (96 > 79 characters)

(E501)

[error] 496-496: line too long (81 > 79 characters)

(E501)

[error] 500-500: line too long (82 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/tasks/basic.py

[error] 24-24: line too long (87 > 79 characters)

(E501)

[error] 32-32: line too long (171 > 79 characters)

(E501)

[error] 40-40: line too long (179 > 79 characters)

(E501)

[error] 48-48: line too long (164 > 79 characters)

(E501)

[error] 66-66: line too long (88 > 79 characters)

(E501)

[error] 103-103: line too long (109 > 79 characters)

(E501)

[error] 122-122: line too long (90 > 79 characters)

(E501)

[error] 140-140: line too long (91 > 79 characters)

(E501)

[error] 158-158: line too long (98 > 79 characters)

(E501)

[error] 175-175: line too long (98 > 79 characters)

(E501)

[error] 193-193: line too long (93 > 79 characters)

(E501)

[error] 195-195: line too long (97 > 79 characters)

(E501)

[error] 212-212: line too long (97 > 79 characters)

(E501)

[error] 213-213: line too long (85 > 79 characters)

(E501)

[error] 231-231: line too long (94 > 79 characters)

(E501)

[error] 232-232: line too long (85 > 79 characters)

(E501)

[error] 233-233: line too long (88 > 79 characters)

(E501)

src/rai_bench/rai_bench/tool_calling_agent/tasks/manipulation.py

[error] 43-43: line too long (80 > 79 characters)

(E501)

[error] 153-153: line too long (81 > 79 characters)

(E501)

[error] 175-175: line too long (81 > 79 characters)

(E501)

[error] 193-193: line too long (83 > 79 characters)

(E501)

[error] 194-194: line too long (80 > 79 characters)

(E501)

[error] 195-195: line too long (90 > 79 characters)

(E501)

[error] 210-210: line too long (81 > 79 characters)

(E501)

[error] 237-237: line too long (82 > 79 characters)

(E501)

[error] 286-286: line too long (83 > 79 characters)

(E501)

[error] 313-313: line too long (86 > 79 characters)

(E501)

[error] 343-343: line too long (84 > 79 characters)

(E501)

[error] 370-370: line too long (81 > 79 characters)

(E501)

[error] 383-383: line too long (83 > 79 characters)

(E501)

[error] 386-386: line too long (93 > 79 characters)

(E501)

[error] 387-387: line too long (88 > 79 characters)

(E501)

[error] 388-388: line too long (98 > 79 characters)

(E501)

🪛 Ruff (0.11.9)

src/rai_bench/rai_bench/tool_calling_agent/predefined/manipulation_tasks.py

54-54: Do not use mutable data structures for argument defaults