Skip to content

[WIP] Multiaction Support, CUA and refactoring #257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

amanjaiswal73892
Copy link
Collaborator

@amanjaiswal73892 amanjaiswal73892 commented Jun 30, 2025

This pull request introduces a new OpenAICUAModel for handling OpenAI's Computer Use Agent (CUA) functionality, along with several updates to improve tool call handling, multi-action support, and cost calculation. The changes span across multiple files, enhancing the agent's capabilities and refining its behavior.

New OpenAICUAModel Implementation:

  • Added OpenAICUAModel class in src/agentlab/agents/tool_use_agent/openai_cua.py to support OpenAI's CUA, including methods for API calls and handling various computer actions like clicks, scrolls, and drags. This also includes a default configuration for CUA-specific behavior.

Enhancements to Tool Use Agent:

  • Introduced contains_image method in tool_use_agent.py to detect the presence of images in groups, and updated the apply method in the Obs class to handle OpenAI CUA mode-specific behavior. [1] [2]
  • Modified the Goal class to allow custom system messages, enabling dynamic behavior based on configuration.
  • Added multi-action support in the get_action method, allowing multiple actions in a single step if configured. [1] [2]

WIP Config Classes:

  • Added WIP config classes for APIPayload config for llm calls and other config management.

Testing Improvements:

  • Adjusted test cases in test_response_api.py to align with the new tool call handling and action representation, ensuring robust validation of the changes. [1] [2] [3]
  • WIP Tests need to be updated for multiactions representation which is independent of the environment.

Refinements to Cost Calculation:

  • Updated cost calculation logic in tracking.py to handle both chat completion and response APIs, ensuring accurate token usage and caching details. [1] [2]

Testing Improvements:

  • Adjusted test cases in test_response_api.py to align with the new tool call handling and action representation, ensuring robust validation of the changes. [1] [2] [3]

Minor Updates:

  • Added ToolCalls import to relevant files for improved tool call handling. [1] [2]
  • Deprecated the ToolCall class in tool_use_agent.py by commenting out its implementation.

These changes collectively enhance the agent's functionality and adaptability, particularly for OpenAI CUA scenarios, while improving maintainability and test coverage.

Description by Korbit AI

What change is being made?

Add multiaction support for the Computer Use Agent (CUA), integrate OpenAI API enhancements, and refactor tool call handling alongside configuration updates.

Why are these changes being made?

The addition of multiaction capabilities enhances agent functionality by allowing multiple actions per step, aligning with more complex real-world scenarios. The integration of OpenAI API enrichments introduces robust handling and conversion of tool calls, while refactoring improves maintainability, code clarity, and positioning for future feature expansions.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

@amanjaiswal73892 amanjaiswal73892 marked this pull request as draft June 30, 2025 18:10
Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Status
Performance Unreachable Conditional Logic ▹ view
Performance Inefficient Token Detail Extraction ▹ view
Error Handling Insufficient error context ▹ view
Functionality Incomplete Multi-key Processing ▹ view
Performance Inefficient Message Type Checking in Loop ▹ view
Functionality Missing CUA to Environment Action Conversion ▹ view
Files scanned
File Path Reviewed
src/agentlab/agents/tool_use_agent/openai_cua.py
src/agentlab/llm/tracking.py
src/agentlab/agents/tool_use_agent/tool_use_agent.py
src/agentlab/analyze/agent_xray.py
src/agentlab/llm/response_api.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

Comment on lines +309 to +322
api_type = 'chatcompletion' if hasattr(usage, "prompt_tokens_details") else 'response'
if api_type == 'chatcompletion':
total_input_tokens = usage.prompt_tokens
output_tokens = usage.completion_tokens
cached_input_tokens = usage.prompt_tokens_details.cached_tokens
non_cached_input_tokens = total_input_tokens - cached_input_tokens
elif api_type == 'response':
total_input_tokens = usage.input_tokens
output_tokens = usage.output_tokens
cached_input_tokens = usage.input_tokens_details.cached_tokens
non_cached_input_tokens = total_input_tokens - cached_input_tokens
else:
logging.warning(f"Unsupported API type: {api_type}. Defaulting cost to 0.0.")
return 0.0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unreachable Conditional Logic category Performance

Tell me more
What is the issue?

Redundant API type check that can never reach the else branch due to the binary ternary operator assignment.

Why this matters

Unnecessary conditional logic and error handling that adds computational overhead with no possible benefit since api_type can only be 'chatcompletion' or 'response'.

Suggested change ∙ Feature Preview

Remove the else branch since it's impossible to reach and simplify the code to:

api_type = 'chatcompletion' if hasattr(usage, "prompt_tokens_details") else 'response'
if api_type == 'chatcompletion':
    ...
else:  # api_type == 'response'
    ...
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines 165 to +169
usage = dict(getattr(response, "usage", {}))
if 'prompt_tokens_details' in usage:
usage['cached_tokens'] = usage['prompt_token_details'].cached_tokens
if 'input_tokens_details' in usage:
usage['cached_tokens'] = usage['input_tokens_details'].cached_tokens
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inefficient Token Detail Extraction category Performance

Tell me more
What is the issue?

Dictionary conversion and multiple attribute lookups for cached tokens are performed unnecessarily.

Why this matters

Converting usage to dict and performing multiple lookups creates unnecessary object creation and memory allocation. The second condition overwrites the first one if both are true.

Suggested change ∙ Feature Preview

Optimize the code to avoid dict conversion and prevent redundant assignments:

usage = getattr(response, "usage", {})
cached_tokens = None
if hasattr(usage, "prompt_tokens_details"):
    cached_tokens = usage.prompt_tokens_details.cached_tokens
elif hasattr(usage, "input_tokens_details"):
    cached_tokens = usage.input_tokens_details.cached_tokens
if cached_tokens is not None:
    usage['cached_tokens'] = cached_tokens
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +73 to +87
try:
action_mapping = {
"click": lambda: self._handle_click_action(action),
"scroll": lambda: self._handle_scroll_action(action),
"keypress": lambda: self._handle_keypress_action(action),
"type": lambda: self._handle_type_action(action),
"wait": lambda: self._handle_wait_action(action),
"screenshot": lambda: self._handle_screenshot_action(action),
"drag": lambda: self._handle_drag_action(action),
}

if action_type in action_mapping:
return action_mapping[action_type]()
else:
raise ValueError(f"Unrecognized openAI CUA action type: {action_type}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insufficient error context category Error Handling

Tell me more
What is the issue?

The error message for unrecognized action types doesn't include the list of valid actions.

Why this matters

Without knowing the valid options, developers will have to search through code to understand what actions are supported, making debugging and usage more difficult.

Suggested change ∙ Feature Preview

Include valid actions in error message:

valid_actions = list(action_mapping.keys())
if action_type in action_mapping:
    return action_mapping[action_type]()
else:
    raise ValueError(f"Unrecognized openAI CUA action type: {action_type}. Valid actions are: {valid_actions}")
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +104 to +113
def _handle_keypress_action(self, action):
keys = action.keys
#TODO: Check this if is suitable for BGYM env.
for k in keys:
print(f"Action: keypress '{k}'")
if k.lower() == "enter":
key = "Enter"
elif k.lower() == "space":
key = " "
return "keyboard_press", {"key": key}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete Multi-key Processing category Functionality

Tell me more
What is the issue?

The keypress handler returns inside the for loop after processing only the first key, ignoring all subsequent keys in the action.keys list.

Why this matters

This will cause multi-key combinations (like Ctrl+C) to fail as only the first key will be processed.

Suggested change ∙ Feature Preview
def _handle_keypress_action(self, action):
    keys = action.keys
    processed_keys = []
    for k in keys:
        if k.lower() == "enter":
            processed_keys.append("Enter")
        elif k.lower() == "space":
            processed_keys.append(" ")
        else:
            processed_keys.append(k)
    return "keyboard_press", {"keys": processed_keys}
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +102 to +103
if not isinstance(messages[i], ToolCalls):
messages[i].mark_all_previous_msg_for_caching()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inefficient Message Type Checking in Loop category Performance

Tell me more
What is the issue?

Type checking in a loop to determine caching behavior creates unnecessary overhead for each message iteration.

Why this matters

Frequent type checking in loops can impact performance, especially with large message histories. The caching logic could be optimized to avoid repeated type checks.

Suggested change ∙ Feature Preview

Consider restructuring the caching logic to handle message types more efficiently. One approach is to separate ToolCalls messages earlier:

if messages and not isinstance(messages[-1], ToolCalls):
    messages[-1].mark_all_previous_msg_for_caching()
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +635 to +636
def cua_action_to_env_tool_name_and_args(self, action: str) -> tuple[str, Dict[str, Any]]:
pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing CUA to Environment Action Conversion category Functionality

Tell me more
What is the issue?

Empty function implementation for handling Computer Use Actions (CUA) to environment tool name and arguments conversion.

Why this matters

Will cause system failure when OpenAI returns a computer_call action type since the conversion functionality is not implemented.

Suggested change ∙ Feature Preview

Implement the CUA conversion function to properly handle OpenAI's Computer Use Actions format:

def cua_action_to_env_tool_name_and_args(self, action: str) -> tuple[str, Dict[str, Any]]:
    # Example implementation - adapt based on actual CUA format
    action_parts = action.split('(', 1)
    action_name = action_parts[0].strip()
    args_str = action_parts[1].rstrip(')')
    
    # Parse arguments string to dictionary
    args_dict = {}
    if args_str:
        for arg in args_str.split(','):
            key, value = arg.split('=', 1)
            args_dict[key.strip()] = value.strip().strip('"')
            
    return action_name, args_dict
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants