Skip to content

[OH-Versa] Add remaining browsing & GAIA eval improvement #9015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 25, 2025

Conversation

ryanhoangt
Copy link
Collaborator

@ryanhoangt ryanhoangt commented Jun 9, 2025

  • This change is worth documenting at https://docs.all-hands.dev/
  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

End-user friendly description of the problem this fixes or functionality this introduces.


Summarize what the PR does, explaining any non-trivial design decisions.

Ported from #8598

  • Add file download observation via browsing
  • Some minor improvements to multimodal and text-based browsing
  • GAIA eval changes

Eval result:

  • GAIA - claude-3-7-sonnet:
    • main: 12/30
    • This PR: 15/30 (50%)

Link of any specific issues this addresses:


To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:4796978-nikolaik   --name openhands-app-4796978   docker.all-hands.dev/all-hands-ai/openhands:4796978

@ryanhoangt ryanhoangt changed the title [OH-Versa] Add remaining browsing improvement & GAIA eval [OH-Versa] Add remaining browsing & GAIA eval improvement Jun 9, 2025
@neubig
Copy link
Contributor

neubig commented Jun 9, 2025

Thanks @ryanhoangt , has this ben evaluated at all?

@@ -198,6 +200,8 @@ def __init__(
self.start_time = time.time()
self.last_execution_time = self.start_time
self._initialized = False
self.downloaded_files: list[str] = []
self.downloads_directory = '/workspace/downloads'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This folder might not exist at all (in GAIA eval harness we explicitly create this folder). Whatever folder path we use, we should also keep both docker runtime and local runtime in mind.

@xingyaoww Do you have some idea here? I was going to suggest a new config option until I see a recent refactoring of yours (#8242) that deprecates that approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways, whatever decision we make, we need to make sure browser_env.py uses the same path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the folder in browser_env.py must match with this folder. Also note that the directory must pre-exist or have sufficient permissions so that it can be created by the Playwright logic. The simplest, error-free way is to have a default location (that can be configured somehow if you want) where we can download files while ensuring that the directory exists before running the agent.

Also I think a better idea would be to directly open the file in the browser (using Xinyao's HTTP-based file viewing server) whenever possible. Say the agent downloaded a PDF, we can show the content of the file download observation in the text and open the file in the browser and send screenshot to agent.

Copy link
Collaborator Author

@ryanhoangt ryanhoangt Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adityasoni9998 what do you think about making this a hidden directory, so that it doesn't interfere a lot with the ACI? If the user launches OH from scratch, the working directory will be /workspace, and having a downloads/ directory there might be a bit unexpected to the agent?

Regarding the ACI output, if the download dir isn't hidden, the output is:

Here's the files and directories up to 2 levels deep in /workspace, excluding hidden items:
/workspace/
/workspace/downloads/
/workspace/test.txt

If we make it /workspace/.downloads, it will be:

Here's the files and directories up to 2 levels deep in /workspace, excluding hidden items:
/workspace/
/workspace/test.txt

1 hidden files/directories in this directory are excluded. You can use 'ls -la /workspace' to see them.

which I think looks a bit nicer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either should just work fine. Anyways we provide the absolute path of the downloaded file to the agent whenever a FileDownloadObservation is returned by the action_execution_server resulting from a BrowseInteractiveAction

@li-boxuan
Copy link
Collaborator

@OpenHands

Please add a test for download action in tests/runtime/test_browsing.py.

Copy link

openhands-ai bot commented Jun 10, 2025

I'm on it! li-boxuan can track my progress at all-hands.dev

if not browser_observation.error:
return browser_observation
else:
curr_files = os.listdir(self.downloads_directory)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adityasoni9998 I don't exactly remember about this, but we are doing a hack here, right? We are saying, if a browse action fails, then we assume it's a download action?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here is my understanding of how downloads work in BrowserGym and Playwright:

  1. When agent tries to download a file, apparently the following error is returned by BrowserGym Error: net::ERR_ABORTED at <URL> (for example try executing goto() action for https://www.dailyscript.com/scripts/GOLDFINGER.doc to observe this). If you set the playwright config in browser_env.py to accept downloads, then apparently a file is downloaded and still an error is returned by browsergym.
  2. So my logic is to check if a new file is present in the downloads folder (I keep track of this list), whenever the agent faces any error during browsing. We don't assume every browsing failure is a download.

I also remember that the above behaviour occurs if the agent clicks on a link that points to a file. I am not sure if there are cases when no error is raised by BGym and a file is downloaded at the same time. There maybe such cases, but I don't have any reproducible example for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BrowserGym/Playwright is also very weird for file extensions of downloaded files - these files are downloaded with some random strings (similar to a uuid str) that don't have any extension. This is why I use puremagic to guess the extension when copying it to workspace. I don't know of better ways of tackling this.

Copy link

openhands-ai bot commented Jun 10, 2025

I've successfully added a test for the download action in the browser functionality to the tests/runtime/test_browsing.py file. Here's a summary of what I did:

  1. First, I explored the codebase to understand how the browser functionality works, particularly focusing on:

    • The browser environment setup in openhands/runtime/browser/browser_env.py
    • The browser actions in openhands/events/action/browse.py
    • The browser observations in openhands/events/observation/browse.py
  2. I created a new test function called test_download_file that:

    • Creates a test file to be downloaded
    • Creates an HTML page with a download link
    • Starts a simple HTTP server
    • Uses the browser to navigate to the HTML page
    • Clicks on the download link
    • Verifies that the file was downloaded to the /workspace/downloads/ directory
    • Checks that the downloaded file has the correct content
    • Cleans up after the test
  3. I ran the pre-commit hooks to ensure code quality and formatting standards were met.

  4. I committed the changes and pushed them to the ht/oh-versa-eval branch, successfully updating PR [OH-Versa] Add remaining browsing & GAIA eval improvement #9015.

The test verifies that the browser environment is correctly configured to handle downloads, which is an important feature for agents that need to download files from the web.

if not browser_observation.error:
return browser_observation
else:
curr_files = os.listdir(self.downloads_directory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here is my understanding of how downloads work in BrowserGym and Playwright:

  1. When agent tries to download a file, apparently the following error is returned by BrowserGym Error: net::ERR_ABORTED at <URL> (for example try executing goto() action for https://www.dailyscript.com/scripts/GOLDFINGER.doc to observe this). If you set the playwright config in browser_env.py to accept downloads, then apparently a file is downloaded and still an error is returned by browsergym.
  2. So my logic is to check if a new file is present in the downloads folder (I keep track of this list), whenever the agent faces any error during browsing. We don't assume every browsing failure is a download.

I also remember that the above behaviour occurs if the agent clicks on a link that points to a file. I am not sure if there are cases when no error is raised by BGym and a file is downloaded at the same time. There maybe such cases, but I don't have any reproducible example for now.

@@ -198,6 +200,8 @@ def __init__(
self.start_time = time.time()
self.last_execution_time = self.start_time
self._initialized = False
self.downloaded_files: list[str] = []
self.downloads_directory = '/workspace/downloads'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the folder in browser_env.py must match with this folder. Also note that the directory must pre-exist or have sufficient permissions so that it can be created by the Playwright logic. The simplest, error-free way is to have a default location (that can be configured somehow if you want) where we can download files while ensuring that the directory exists before running the agent.

Also I think a better idea would be to directly open the file in the browser (using Xinyao's HTTP-based file viewing server) whenever possible. Say the agent downloaded a PDF, we can show the content of the file download observation in the text and open the file in the browser and send screenshot to agent.

if not browser_observation.error:
return browser_observation
else:
curr_files = os.listdir(self.downloads_directory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BrowserGym/Playwright is also very weird for file extensions of downloaded files - these files are downloaded with some random strings (similar to a uuid str) that don't have any extension. This is why I use puremagic to guess the extension when copying it to workspace. I don't know of better ways of tackling this.

@neubig neubig removed request for neubig and xingyaoww June 14, 2025 12:50
@neubig
Copy link
Contributor

neubig commented Jun 14, 2025

@ryanhoangt I think I may let @adityasoni9998 and @li-boxuan review this one, But I'm very happy to have it in.

@ryanhoangt ryanhoangt requested a review from enyst June 17, 2025 14:47
Copy link
Contributor

@adityasoni9998 adityasoni9998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me! I am not too aware about the changes in the test_mcp.py file. But the source code logic seems fine.

Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally good to me, thank you. GAIA eval works as e2e test 😅

Copy link

openhands-ai bot commented Jun 25, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Docker

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #9015

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@ryanhoangt ryanhoangt merged commit dfa5467 into main Jun 25, 2025
18 checks passed
@ryanhoangt ryanhoangt deleted the ht/oh-versa-eval branch June 25, 2025 05:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants