-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Visual browsing in CodeAct using set-of-marks annotated webpage screenshots #6464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
965cee7
a3c8bcc
c2505c0
6aba462
1fef52a
bad7ccf
5258fe8
dcdb448
072d956
ce0979f
f5eed29
4315c22
dfee306
f45e7ec
9b742c5
6b94c97
70772cc
a7d38cd
26c4f72
818533f
a699a0d
d34c412
e66a113
ee1173e
83fa5b0
0b6a385
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,6 +11,7 @@ | |
from openhands.core.config import AgentConfig | ||
from openhands.core.logger import openhands_logger as logger | ||
from openhands.core.message import ImageContent, Message, TextContent | ||
from openhands.core.schema import ActionType | ||
from openhands.events.action import ( | ||
Action, | ||
AgentDelegateAction, | ||
|
@@ -304,10 +305,30 @@ def get_observation_message( | |
) # Content is already truncated by openhands-aci | ||
elif isinstance(obs, BrowserOutputObservation): | ||
text = obs.get_agent_obs_text() | ||
message = Message( | ||
role='user', | ||
content=[TextContent(text=text)], | ||
) | ||
if ( | ||
obs.trigger_by_action == ActionType.BROWSE_INTERACTIVE | ||
enyst marked this conversation as resolved.
Show resolved
Hide resolved
|
||
and obs.set_of_marks is not None | ||
and len(obs.set_of_marks) > 0 | ||
and self.config.enable_som_visual_browsing | ||
and self.llm.vision_is_active() | ||
and ( | ||
self.mock_function_calling | ||
or self.llm.is_visual_browser_tool_active() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @adityasoni9998 Could you please tell, why do we check for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have kept There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the answer! I think that case is covered anyway because they're separate checks in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So you mean
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's redundant in the agent. Otherwise it's perfectly valid to ask whether a llm has function calling or not, and it happens in llm.py. I'm a bit confused by those conditions though, possibly because this attribute really shouldn't exist. How about: visual browsing can be active, both
Is this statement correct? Reference: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok I understood what you are doing in this PR. I think by removing mock_function_calling from this if block, we are essentially eliminating the possibility of using this visual browser feature with vision-supported LLMs (like gpt-4o). VISUAL_BROWSING_TOOL_SUPPORTED_MODELS consists of only those models which support images with role='tool'. I think we can do visual browsing even without tool-use (simply pass images with role='user', exactly what mock_function_calling did before) but the above PR will no longer allow that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, we need to figure this out, thanks a lot for your time on this! because there seem to be some misunderstandings here (on my side too for sure):
However, I seem to understand from what you say, that although VISUAL_BROWSING_TOOL_SUPPORTED_MODELS lists models, there are other models that are supported by visual browsing tool? Can you please tell, which models apart from those, are supported? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Why does it need to list only models with native fn calling? Could we:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My understanding is that if tool-calling is not supported by the model at all, the current idea is to just replace role='tool' with role='user' in all messages without the need of explicitly setting There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes! If tool calling is disabled, we convert the format in which send the request to the LLM API, into a text-based, regular format, practically we describe the equivalent "tools" in-context. This always happens if native is not supported at all or is disabled. Great, thank you! More cleaning 😅 |
||
) | ||
): | ||
text += 'Image: Current webpage screenshot (Note that only visible portion of webpage is present in the screenshot. You may need to scroll to view the remaining portion of the web-page.)\n' | ||
message = Message( | ||
role='user', | ||
content=[ | ||
TextContent(text=text), | ||
ImageContent(image_urls=[obs.set_of_marks]), | ||
], | ||
) | ||
else: | ||
message = Message( | ||
role='user', | ||
content=[TextContent(text=text)], | ||
) | ||
elif isinstance(obs, AgentDelegateObservation): | ||
text = truncate_content( | ||
obs.outputs['content'] if 'content' in obs.outputs else '', | ||
|
Uh oh!
There was an error while loading. Please reload this page.