Visual browsing in CodeAct using set-of-marks annotated webpage screenshots #6464

adityasoni9998 · 2025-01-26T04:50:38Z

End-user friendly description of the problem this fixes or functionality that this introduces

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions
Allow visual browsing in CodeAct Agent by passing set-of-marks annotated webpage screenshots in LLM input. Previously, browsing in CodeAct was fully dependent on text-based observations like AXTree.

Link of any specific issues this addresses

…ted screenshot.

…4o models)

frontend/package-lock.json

openhands/agenthub/codeact_agent/codeact_agent.py

openhands/core/message.py

openhands/agenthub/codeact_agent/codeact_agent.py

xingyaoww

LGTM! This is exciting to have!

openhands/core/config/agent_config.py

li-boxuan

LGTM!

enyst · 2025-02-03T19:28:36Z

openhands/agenthub/codeact_agent/codeact_agent.py

+                and self.llm.vision_is_active()
+                and (
+                    self.mock_function_calling
+                    or self.llm.is_visual_browser_tool_active()


@adityasoni9998 Could you please tell, why do we check for self.mock_function_calling here? I'd like to remove this check and just check for self.llm.is_visual_browser_tool_active. After all, if it's not active, why would we insert that message?

I have kept self.mock_function_calling here to consider those cases where the LLM supports vision but it either does not support tool use or if user is trying to mock function calling. My understanding is that is if the parameter mock_function_calling is set to True, then we process all messages with role=tool and convert them to messages with role=user. So, this feature should still work in this case.

Thank you for the answer! I think that case is covered anyway because they're separate checks in llm.py: whether it supports vision and whether it supports function calling. I thought about this some more in the meantime, and I removed mock_function_calling completely from the agent. I think the result is equivalent. 😅

So you mean mock_function_calling is not required at all and is a redundant feature? If not, the expected behaviour should be that visual browsing can be done even when these three conditions hold true:

LLM supports vision

mock_function_calling is set to True

LLM doesn't support function calling (i.e. role='tool' is not allowed in LLM input)

It's redundant in the agent. Otherwise it's perfectly valid to ask whether a llm has function calling or not, and it happens in llm.py.

I'm a bit confused by those conditions though, possibly because this attribute really shouldn't exist. How about: visual browsing can be active, both

when fn calling is native,

and when it's not, just the same.

Is this statement correct?

Reference:

Simplify fn calling usage #6596

Ok I understood what you are doing in this PR. I think by removing mock_function_calling from this if block, we are essentially eliminating the possibility of using this visual browser feature with vision-supported LLMs (like gpt-4o). VISUAL_BROWSING_TOOL_SUPPORTED_MODELS consists of only those models which support images with role='tool'. I think we can do visual browsing even without tool-use (simply pass images with role='user', exactly what mock_function_calling did before) but the above PR will no longer allow that.

OK, we need to figure this out, thanks a lot for your time on this! because there seem to be some misunderstandings here (on my side too for sure):

not important, just to get this out of the way: role='tool' doesn't really define native fn calling ( for example, Anthropic Claude Sonnet 3.5 doesn't support role='tool', but does support fn calling of course natively; litellm is doing it the conversion for us undercover, and actually sends 'user' to anthropic API)

there is indeed an important difference in APIs, between those that support native fn calling, and those that have only non-native fn calling (the format is different; for example, we send a 'tools' parameter separately from messages)

non-native function calling is supported, that PR did not remove anything from the functionality itself. llm.py is doing it, with the helper fn_call_converter.py.

the models listed in VISUAL_BROWSING_TOOL_SUPPORTED_MODELS support the visual browsing tool. Both cases: whether fn calling is enabled or not. (is there a doubt on this? maybe we can try to make sure)

However, I seem to understand from what you say, that although VISUAL_BROWSING_TOOL_SUPPORTED_MODELS lists models, there are other models that are supported by visual browsing tool? Can you please tell, which models apart from those, are supported?

VISUAL_BROWSING_TOOL_SUPPORTED_MODELS consists of only those models which support images with role='tool'

Why does it need to list only models with native fn calling? Could we:

list ALL models that support the visual browsing tool, whether they have native fn calling or not?

or remove it all together, and assume that if a model supports vision, then it can support the tool too?

My understanding is that if tool-calling is not supported by the model at all, the current idea is to just replace role='tool' with role='user' in all messages without the need of explicitly setting mock_function_calling=True. If this is correct, then visual browsing will be supported by all models that support vision, which implies we can remove VISUAL_BROWSING_TOOL_SUPPORTED_MODELS altogether.

Yes! If tool calling is disabled, we convert the format in which send the request to the LLM API, into a text-based, regular format, practically we describe the equivalent "tools" in-context. This always happens if native is not supported at all or is disabled.

Great, thank you! More cleaning 😅

…nshots (All-Hands-AI#6464)

adityasoni9998 added 15 commits September 28, 2024 19:37

added gitignore

965cee7

Merge remote-tracking branch 'upstream/main'

a3c8bcc

Merge remote-tracking branch 'upstream/main'

c2505c0

Merge remote-tracking branch 'upstream/main'

6aba462

Merge remote-tracking branch 'upstream/main'

1fef52a

Merge remote-tracking branch 'upstream/main'

bad7ccf

Merge remote-tracking branch 'upstream/main'

5258fe8

Merge remote-tracking branch 'upstream/main'

dcdb448

Merge remote-tracking branch 'upstream/main'

072d956

Merge remote-tracking branch 'upstream/main'

ce0979f

Visual browsing using Set-of-marks annotated screenshot in CodeActAgent

f5eed29

Allow screenshot-based browsing in openhands with set-of-marks annota…

4315c22

…ted screenshot.

Merge remote-tracking branch 'upstream/main'

dfee306

Merge branch 'main' into codeact_browsing

f45e7ec

Added LLM-check for visual browsing tool usage. (not support for GPT-…

9b742c5

…4o models)

adityasoni9998 marked this pull request as ready for review January 26, 2025 18:24

xingyaoww reviewed Jan 27, 2025

View reviewed changes

frontend/package-lock.json Outdated Show resolved Hide resolved

openhands/agenthub/codeact_agent/codeact_agent.py Show resolved Hide resolved

enyst reviewed Jan 27, 2025

View reviewed changes

openhands/core/message.py Show resolved Hide resolved

adityasoni9998 added 3 commits January 27, 2025 18:38

Merge remote-tracking branch 'upstream/main'

6b94c97

Merge remote-tracking branch 'upstream/main' into codeact_browsing

70772cc

Undo changes in package-lock.json.

a7d38cd

enyst reviewed Jan 28, 2025

View reviewed changes

openhands/agenthub/codeact_agent/codeact_agent.py Show resolved Hide resolved

xingyaoww approved these changes Jan 28, 2025

View reviewed changes

openhands/core/config/agent_config.py Outdated Show resolved Hide resolved

xingyaoww reviewed Jan 28, 2025

View reviewed changes

openhands/core/config/agent_config.py Outdated Show resolved Hide resolved

adityasoni9998 added 4 commits January 30, 2025 13:46

Merge remote-tracking branch 'upstream/main'

26c4f72

Merge branch 'main' into codeact_browsing

818533f

Merge remote-tracking branch 'upstream/main'

a699a0d

Merge branch 'main' into codeact_browsing

d34c412

li-boxuan reviewed Jan 31, 2025

View reviewed changes

openhands/core/config/agent_config.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main'

e66a113

adityasoni9998 added 3 commits February 1, 2025 11:37

Merge branch 'main' into codeact_browsing

ee1173e

Rename visual browsing flag in agent config.

83fa5b0

Minor bug-fix, rename flag for som-browsing in codeact.

0b6a385

li-boxuan approved these changes Feb 1, 2025

View reviewed changes

xingyaoww merged commit a593d9b into All-Hands-AI:main Feb 1, 2025
18 checks passed

enyst reviewed Feb 3, 2025

View reviewed changes

ryanhoangt mentioned this pull request Feb 3, 2025

[Experimental] Screenshot-based browsing #5324

Closed

1 task

zchn pushed a commit to zchn/OpenHands that referenced this pull request Feb 4, 2025

Visual browsing in CodeAct using set-of-marks annotated webpage scree…

f7bf4e1

…nshots (All-Hands-AI#6464)

li-boxuan mentioned this pull request Feb 9, 2025

Evaluation harness: Add agent config option #6662

Merged

1 task

enyst mentioned this pull request Feb 10, 2025

Clean up global in llm.py (we figured it's not needed) #6675

Merged

xingyaoww mentioned this pull request Apr 7, 2025

[Agent] Support browser control via screenshots #4570

Closed

Visual browsing in CodeAct using set-of-marks annotated webpage screenshots #6464

Visual browsing in CodeAct using set-of-marks annotated webpage screenshots #6464

Uh oh!

Conversation

adityasoni9998 commented Jan 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xingyaoww left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

li-boxuan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enyst Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enyst Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

enyst Feb 7, 2025 •

edited

Loading

enyst Feb 7, 2025 •

edited

Loading