-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Visual browsing in CodeAct using set-of-marks annotated webpage screenshots #6464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! This is exciting to have!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
and self.llm.vision_is_active() | ||
and ( | ||
self.mock_function_calling | ||
or self.llm.is_visual_browser_tool_active() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adityasoni9998 Could you please tell, why do we check for self.mock_function_calling
here? I'd like to remove this check and just check for self.llm.is_visual_browser_tool_active
. After all, if it's not active, why would we insert that message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have kept self.mock_function_calling
here to consider those cases where the LLM supports vision but it either does not support tool use or if user is trying to mock function calling. My understanding is that is if the parameter mock_function_calling
is set to True, then we process all messages with role=tool and convert them to messages with role=user. So, this feature should still work in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the answer! I think that case is covered anyway because they're separate checks in llm.py
: whether it supports vision and whether it supports function calling. I thought about this some more in the meantime, and I removed mock_function_calling
completely from the agent. I think the result is equivalent. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you mean mock_function_calling
is not required at all and is a redundant feature? If not, the expected behaviour should be that visual browsing can be done even when these three conditions hold true:
- LLM supports vision
- mock_function_calling is set to True
- LLM doesn't support function calling (i.e. role='tool' is not allowed in LLM input)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's redundant in the agent. Otherwise it's perfectly valid to ask whether a llm has function calling or not, and it happens in llm.py.
I'm a bit confused by those conditions though, possibly because this attribute really shouldn't exist. How about: visual browsing can be active, both
- when fn calling is native,
- and when it's not, just the same.
Is this statement correct?
Reference:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I understood what you are doing in this PR. I think by removing mock_function_calling from this if block, we are essentially eliminating the possibility of using this visual browser feature with vision-supported LLMs (like gpt-4o). VISUAL_BROWSING_TOOL_SUPPORTED_MODELS consists of only those models which support images with role='tool'. I think we can do visual browsing even without tool-use (simply pass images with role='user', exactly what mock_function_calling did before) but the above PR will no longer allow that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, we need to figure this out, thanks a lot for your time on this! because there seem to be some misunderstandings here (on my side too for sure):
- not important, just to get this out of the way:
role='tool'
doesn't really define native fn calling ( for example, Anthropic Claude Sonnet 3.5 doesn't support role='tool', but does support fn calling of course natively; litellm is doing it the conversion for us undercover, and actually sends 'user' to anthropic API) - there is indeed an important difference in APIs, between those that support native fn calling, and those that have only non-native fn calling (the format is different; for example, we send a 'tools' parameter separately from messages)
- non-native function calling is supported, that PR did not remove anything from the functionality itself. llm.py is doing it, with the helper
fn_call_converter.py
. - the models listed in VISUAL_BROWSING_TOOL_SUPPORTED_MODELS support the visual browsing tool. Both cases: whether fn calling is enabled or not. (is there a doubt on this? maybe we can try to make sure)
However, I seem to understand from what you say, that although VISUAL_BROWSING_TOOL_SUPPORTED_MODELS lists models, there are other models that are supported by visual browsing tool? Can you please tell, which models apart from those, are supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VISUAL_BROWSING_TOOL_SUPPORTED_MODELS consists of only those models which support images with role='tool'
Why does it need to list only models with native fn calling? Could we:
- list ALL models that support the visual browsing tool, whether they have native fn calling or not?
- or remove it all together, and assume that if a model supports vision, then it can support the tool too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that if tool-calling is not supported by the model at all, the current idea is to just replace role='tool' with role='user' in all messages without the need of explicitly setting mock_function_calling=True
. If this is correct, then visual browsing will be supported by all models that support vision, which implies we can remove VISUAL_BROWSING_TOOL_SUPPORTED_MODELS altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! If tool calling is disabled, we convert the format in which send the request to the LLM API, into a text-based, regular format, practically we describe the equivalent "tools" in-context. This always happens if native is not supported at all or is disabled.
Great, thank you! More cleaning 😅
End-user friendly description of the problem this fixes or functionality that this introduces
Give a summary of what the PR does, explaining any non-trivial design decisions
Allow visual browsing in CodeAct Agent by passing set-of-marks annotated webpage screenshots in LLM input. Previously, browsing in CodeAct was fully dependent on text-based observations like AXTree.
Link of any specific issues this addresses