Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[agent, browsing] Support viewing pdf and png/jpg via browser #7457

Merged
merged 31 commits into from
Mar 28, 2025

Conversation

xingyaoww
Copy link
Collaborator

@xingyaoww xingyaoww commented Mar 24, 2025

  • This change is worth documenting at https://docs.all-hands.dev/
  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

End-user friendly description of the problem this fixes or functionality that this introduces.

Turn on vision-based browsing by default and allow OpenHands to see PDF / images in browser.


Give a summary of what the PR does, explaining any non-trivial design decisions.

Pre-req: we need to get #7452 in first

  1. Turn on the set-of-mark based vision browsing
  2. Add a viewer in action execution server so the agent can use it to see PDF/Images directly - playwright disallow file:///workspace/some.pdf since it will trigger download workflow, rather than display them in browser, so we need this hack to display pdf & image.

The agent can now browse through PDF in the browser:

image

The agent can also now see png directly in the browser:

image

Link of any specific issues this addresses.


To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:e52dd8f-nikolaik   --name openhands-app-e52dd8f   docker.all-hands.dev/all-hands-ai/openhands:e52dd8f

@xingyaoww xingyaoww requested a review from li-boxuan March 27, 2025 15:24
@xingyaoww xingyaoww force-pushed the xw/view-pdf-via-browser branch from cd060ee to 273437f Compare March 27, 2025 15:26
@xingyaoww xingyaoww requested a review from neubig March 27, 2025 16:28
@All-Hands-AI All-Hands-AI deleted a comment from openhands-agent Mar 27, 2025
@@ -18,6 +18,11 @@
fill('a12', 'example with "quotes"')
click('a51')
click('48', button='middle', modifiers=['Shift'])

You can also use the browser to view pdf, png, jpg files.
You should first check the content of /tmp/oh-server-url to get the server url, and then use it to view the file by `goto("{server_url}/view?path={absolute_file_path}")`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a stupid question: isn't server_url always "127.0.0.1"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but the port could be very different across different environment (saas, docker), and we need to tell the agent the port 😢

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah I forgot about that.

So in the long-term, do we want to make prompts templating languages, and then just fill in the port info during runtime?

]

# Check if the file extension is supported
if file_extension not in supported_extensions:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adityasoni9998 I remember you found out that files downloaded by browsergym always miss extensions in their filenames, right?

Maybe we should handle that case here as well and try to guess the file type?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways this is non-blocking

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i think we can do that on the browsergym side of work

@xingyaoww xingyaoww enabled auto-merge (squash) March 28, 2025 06:58
@xingyaoww xingyaoww merged commit ac8b5e7 into main Mar 28, 2025
15 checks passed
@xingyaoww xingyaoww deleted the xw/view-pdf-via-browser branch March 28, 2025 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants