Skip to content

Visual browsing in CodeAct using set-of-marks annotated webpage screenshots #6464

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Feb 1, 2025

Conversation

adityasoni9998
Copy link
Contributor

End-user friendly description of the problem this fixes or functionality that this introduces

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions
Allow visual browsing in CodeAct Agent by passing set-of-marks annotated webpage screenshots in LLM input. Previously, browsing in CodeAct was fully dependent on text-based observations like AXTree.


Link of any specific issues this addresses

@adityasoni9998 adityasoni9998 marked this pull request as ready for review January 26, 2025 18:24
Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This is exciting to have!

Copy link
Collaborator

@li-boxuan li-boxuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@xingyaoww xingyaoww merged commit a593d9b into All-Hands-AI:main Feb 1, 2025
18 checks passed
and self.llm.vision_is_active()
and (
self.mock_function_calling
or self.llm.is_visual_browser_tool_active()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adityasoni9998 Could you please tell, why do we check for self.mock_function_calling here? I'd like to remove this check and just check for self.llm.is_visual_browser_tool_active. After all, if it's not active, why would we insert that message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have kept self.mock_function_calling here to consider those cases where the LLM supports vision but it either does not support tool use or if user is trying to mock function calling. My understanding is that is if the parameter mock_function_calling is set to True, then we process all messages with role=tool and convert them to messages with role=user. So, this feature should still work in this case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the answer! I think that case is covered anyway because they're separate checks in llm.py: whether it supports vision and whether it supports function calling. I thought about this some more in the meantime, and I removed mock_function_calling completely from the agent. I think the result is equivalent. 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean mock_function_calling is not required at all and is a redundant feature? If not, the expected behaviour should be that visual browsing can be done even when these three conditions hold true:

  1. LLM supports vision
  2. mock_function_calling is set to True
  3. LLM doesn't support function calling (i.e. role='tool' is not allowed in LLM input)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's redundant in the agent. Otherwise it's perfectly valid to ask whether a llm has function calling or not, and it happens in llm.py.

I'm a bit confused by those conditions though, possibly because this attribute really shouldn't exist. How about: visual browsing can be active, both

  • when fn calling is native,
  • and when it's not, just the same.

Is this statement correct?

Reference:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I understood what you are doing in this PR. I think by removing mock_function_calling from this if block, we are essentially eliminating the possibility of using this visual browser feature with vision-supported LLMs (like gpt-4o). VISUAL_BROWSING_TOOL_SUPPORTED_MODELS consists of only those models which support images with role='tool'. I think we can do visual browsing even without tool-use (simply pass images with role='user', exactly what mock_function_calling did before) but the above PR will no longer allow that.

Copy link
Collaborator

@enyst enyst Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we need to figure this out, thanks a lot for your time on this! because there seem to be some misunderstandings here (on my side too for sure):

  • not important, just to get this out of the way: role='tool' doesn't really define native fn calling ( for example, Anthropic Claude Sonnet 3.5 doesn't support role='tool', but does support fn calling of course natively; litellm is doing it the conversion for us undercover, and actually sends 'user' to anthropic API)
  • there is indeed an important difference in APIs, between those that support native fn calling, and those that have only non-native fn calling (the format is different; for example, we send a 'tools' parameter separately from messages)
  • non-native function calling is supported, that PR did not remove anything from the functionality itself. llm.py is doing it, with the helper fn_call_converter.py.
  • the models listed in VISUAL_BROWSING_TOOL_SUPPORTED_MODELS support the visual browsing tool. Both cases: whether fn calling is enabled or not. (is there a doubt on this? maybe we can try to make sure)

However, I seem to understand from what you say, that although VISUAL_BROWSING_TOOL_SUPPORTED_MODELS lists models, there are other models that are supported by visual browsing tool? Can you please tell, which models apart from those, are supported?

Copy link
Collaborator

@enyst enyst Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VISUAL_BROWSING_TOOL_SUPPORTED_MODELS consists of only those models which support images with role='tool'

Why does it need to list only models with native fn calling? Could we:

  • list ALL models that support the visual browsing tool, whether they have native fn calling or not?
  • or remove it all together, and assume that if a model supports vision, then it can support the tool too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that if tool-calling is not supported by the model at all, the current idea is to just replace role='tool' with role='user' in all messages without the need of explicitly setting mock_function_calling=True. If this is correct, then visual browsing will be supported by all models that support vision, which implies we can remove VISUAL_BROWSING_TOOL_SUPPORTED_MODELS altogether.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! If tool calling is disabled, we convert the format in which send the request to the LLM API, into a text-based, regular format, practically we describe the equivalent "tools" in-context. This always happens if native is not supported at all or is disabled.

Great, thank you! More cleaning 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants