-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[agent, browsing] Support viewing pdf and png/jpg via browser #7457
Conversation
cd060ee
to
273437f
Compare
@@ -18,6 +18,11 @@ | |||
fill('a12', 'example with "quotes"') | |||
click('a51') | |||
click('48', button='middle', modifiers=['Shift']) | |||
|
|||
You can also use the browser to view pdf, png, jpg files. | |||
You should first check the content of /tmp/oh-server-url to get the server url, and then use it to view the file by `goto("{server_url}/view?path={absolute_file_path}")`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a stupid question: isn't server_url
always "127.0.0.1"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but the port could be very different across different environment (saas, docker), and we need to tell the agent the port 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah I forgot about that.
So in the long-term, do we want to make prompts templating languages, and then just fill in the port info during runtime?
] | ||
|
||
# Check if the file extension is supported | ||
if file_extension not in supported_extensions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adityasoni9998 I remember you found out that files downloaded by browsergym always miss extensions in their filenames, right?
Maybe we should handle that case here as well and try to guess the file type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyways this is non-blocking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, i think we can do that on the browsergym side of work
Co-authored-by: Boxuan Li <[email protected]>
End-user friendly description of the problem this fixes or functionality that this introduces.
Turn on vision-based browsing by default and allow OpenHands to see PDF / images in browser.
Give a summary of what the PR does, explaining any non-trivial design decisions.
Pre-req: we need to get #7452 in first
viewer
in action execution server so the agent can use it to see PDF/Images directly - playwright disallowfile:///workspace/some.pdf
since it will trigger download workflow, rather than display them in browser, so we need this hack to display pdf & image.The agent can now browse through PDF in the browser:
The agent can also now see png directly in the browser:
Link of any specific issues this addresses.
To run this PR locally, use the following command: