Skip to content

Visual browsing in CodeAct using set-of-marks annotated webpage screenshots #6464

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Feb 1, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
965cee7
added gitignore
adityasoni9998 Sep 28, 2024
a3c8bcc
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Oct 5, 2024
c2505c0
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Oct 7, 2024
6aba462
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Oct 12, 2024
1fef52a
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Nov 6, 2024
bad7ccf
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Dec 6, 2024
5258fe8
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Dec 6, 2024
dcdb448
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Dec 6, 2024
072d956
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Jan 20, 2025
ce0979f
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Jan 24, 2025
f5eed29
Visual browsing using Set-of-marks annotated screenshot in CodeActAgent
adityasoni9998 Jan 25, 2025
4315c22
Allow screenshot-based browsing in openhands with set-of-marks annota…
adityasoni9998 Jan 26, 2025
dfee306
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Jan 26, 2025
f45e7ec
Merge branch 'main' into codeact_browsing
adityasoni9998 Jan 26, 2025
9b742c5
Added LLM-check for visual browsing tool usage. (not support for GPT-…
adityasoni9998 Jan 26, 2025
6b94c97
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Jan 27, 2025
70772cc
Merge remote-tracking branch 'upstream/main' into codeact_browsing
adityasoni9998 Jan 27, 2025
a7d38cd
Undo changes in package-lock.json.
adityasoni9998 Jan 27, 2025
26c4f72
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Jan 30, 2025
818533f
Merge branch 'main' into codeact_browsing
adityasoni9998 Jan 30, 2025
a699a0d
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Jan 30, 2025
d34c412
Merge branch 'main' into codeact_browsing
adityasoni9998 Jan 30, 2025
e66a113
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Feb 1, 2025
ee1173e
Merge branch 'main' into codeact_browsing
adityasoni9998 Feb 1, 2025
83fa5b0
Rename visual browsing flag in agent config.
adityasoni9998 Feb 1, 2025
0b6a385
Minor bug-fix, rename flag for som-browsing in codeact.
adityasoni9998 Feb 1, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 25 additions & 4 deletions openhands/agenthub/codeact_agent/codeact_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from openhands.core.config import AgentConfig
from openhands.core.logger import openhands_logger as logger
from openhands.core.message import ImageContent, Message, TextContent
from openhands.core.schema import ActionType
from openhands.events.action import (
Action,
AgentDelegateAction,
Expand Down Expand Up @@ -304,10 +305,30 @@ def get_observation_message(
) # Content is already truncated by openhands-aci
elif isinstance(obs, BrowserOutputObservation):
text = obs.get_agent_obs_text()
message = Message(
role='user',
content=[TextContent(text=text)],
)
if (
obs.trigger_by_action == ActionType.BROWSE_INTERACTIVE
and obs.set_of_marks is not None
and len(obs.set_of_marks) > 0
and self.config.enable_som_visual_browsing
and self.llm.vision_is_active()
and (
self.mock_function_calling
or self.llm.is_visual_browser_tool_active()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adityasoni9998 Could you please tell, why do we check for self.mock_function_calling here? I'd like to remove this check and just check for self.llm.is_visual_browser_tool_active. After all, if it's not active, why would we insert that message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have kept self.mock_function_calling here to consider those cases where the LLM supports vision but it either does not support tool use or if user is trying to mock function calling. My understanding is that is if the parameter mock_function_calling is set to True, then we process all messages with role=tool and convert them to messages with role=user. So, this feature should still work in this case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the answer! I think that case is covered anyway because they're separate checks in llm.py: whether it supports vision and whether it supports function calling. I thought about this some more in the meantime, and I removed mock_function_calling completely from the agent. I think the result is equivalent. 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean mock_function_calling is not required at all and is a redundant feature? If not, the expected behaviour should be that visual browsing can be done even when these three conditions hold true:

  1. LLM supports vision
  2. mock_function_calling is set to True
  3. LLM doesn't support function calling (i.e. role='tool' is not allowed in LLM input)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's redundant in the agent. Otherwise it's perfectly valid to ask whether a llm has function calling or not, and it happens in llm.py.

I'm a bit confused by those conditions though, possibly because this attribute really shouldn't exist. How about: visual browsing can be active, both

  • when fn calling is native,
  • and when it's not, just the same.

Is this statement correct?

Reference:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I understood what you are doing in this PR. I think by removing mock_function_calling from this if block, we are essentially eliminating the possibility of using this visual browser feature with vision-supported LLMs (like gpt-4o). VISUAL_BROWSING_TOOL_SUPPORTED_MODELS consists of only those models which support images with role='tool'. I think we can do visual browsing even without tool-use (simply pass images with role='user', exactly what mock_function_calling did before) but the above PR will no longer allow that.

Copy link
Collaborator

@enyst enyst Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we need to figure this out, thanks a lot for your time on this! because there seem to be some misunderstandings here (on my side too for sure):

  • not important, just to get this out of the way: role='tool' doesn't really define native fn calling ( for example, Anthropic Claude Sonnet 3.5 doesn't support role='tool', but does support fn calling of course natively; litellm is doing it the conversion for us undercover, and actually sends 'user' to anthropic API)
  • there is indeed an important difference in APIs, between those that support native fn calling, and those that have only non-native fn calling (the format is different; for example, we send a 'tools' parameter separately from messages)
  • non-native function calling is supported, that PR did not remove anything from the functionality itself. llm.py is doing it, with the helper fn_call_converter.py.
  • the models listed in VISUAL_BROWSING_TOOL_SUPPORTED_MODELS support the visual browsing tool. Both cases: whether fn calling is enabled or not. (is there a doubt on this? maybe we can try to make sure)

However, I seem to understand from what you say, that although VISUAL_BROWSING_TOOL_SUPPORTED_MODELS lists models, there are other models that are supported by visual browsing tool? Can you please tell, which models apart from those, are supported?

Copy link
Collaborator

@enyst enyst Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VISUAL_BROWSING_TOOL_SUPPORTED_MODELS consists of only those models which support images with role='tool'

Why does it need to list only models with native fn calling? Could we:

  • list ALL models that support the visual browsing tool, whether they have native fn calling or not?
  • or remove it all together, and assume that if a model supports vision, then it can support the tool too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that if tool-calling is not supported by the model at all, the current idea is to just replace role='tool' with role='user' in all messages without the need of explicitly setting mock_function_calling=True. If this is correct, then visual browsing will be supported by all models that support vision, which implies we can remove VISUAL_BROWSING_TOOL_SUPPORTED_MODELS altogether.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! If tool calling is disabled, we convert the format in which send the request to the LLM API, into a text-based, regular format, practically we describe the equivalent "tools" in-context. This always happens if native is not supported at all or is disabled.

Great, thank you! More cleaning 😅

)
):
text += 'Image: Current webpage screenshot (Note that only visible portion of webpage is present in the screenshot. You may need to scroll to view the remaining portion of the web-page.)\n'
message = Message(
role='user',
content=[
TextContent(text=text),
ImageContent(image_urls=[obs.set_of_marks]),
],
)
else:
message = Message(
role='user',
content=[TextContent(text=text)],
)
elif isinstance(obs, AgentDelegateObservation):
text = truncate_content(
obs.outputs['content'] if 'content' in obs.outputs else '',
Expand Down
1 change: 1 addition & 0 deletions openhands/core/config/agent_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ class AgentConfig(BaseModel):
"""

codeact_enable_browsing: bool = Field(default=True)
enable_som_visual_browsing: bool = Field(default=False)
codeact_enable_llm_editor: bool = Field(default=False)
codeact_enable_jupyter: bool = Field(default=True)
micro_agent_name: str | None = Field(default=None)
Expand Down
6 changes: 5 additions & 1 deletion openhands/core/message.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,11 @@ def _list_serializer(self) -> dict:
# See discussion here for details: https://github.com/BerriAI/litellm/issues/6422#issuecomment-2438765472
if self.role == 'tool' and item.cache_prompt:
role_tool_with_prompt_caching = True
d.pop('cache_control')
if isinstance(d, dict):
d.pop('cache_control')
elif isinstance(d, list):
for d_item in d:
d_item.pop('cache_control')
if isinstance(item, TextContent):
content.append(d)
elif isinstance(item, ImageContent) and self.vision_enabled:
Expand Down
19 changes: 19 additions & 0 deletions openhands/llm/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,16 @@
'o1-2024-12-17',
]

# visual browsing tool supported models
# This flag is needed since gpt-4o and gpt-4o-mini do not allow passing image_urls with role='tool'
VISUAL_BROWSING_TOOL_SUPPORTED_MODELS = [
'claude-3-5-sonnet',
'claude-3-5-sonnet-20240620',
'claude-3-5-sonnet-20241022',
'o1-2024-12-17',
]


REASONING_EFFORT_SUPPORTED_MODELS = [
'o1-2024-12-17',
]
Expand Down Expand Up @@ -474,6 +484,15 @@ def is_function_calling_active(self) -> bool:
)
return supports_fn_call

def is_visual_browser_tool_active(self) -> bool:
return (
self.config.model in VISUAL_BROWSING_TOOL_SUPPORTED_MODELS
or self.config.model.split('/')[-1] in VISUAL_BROWSING_TOOL_SUPPORTED_MODELS
or any(
m in self.config.model for m in VISUAL_BROWSING_TOOL_SUPPORTED_MODELS
)
)

def _post_completion(self, response: ModelResponse) -> float:
"""Post-process the completion response.

Expand Down