[OH-Versa] Add remaining browsing & GAIA eval improvement #9015

ryanhoangt · 2025-06-09T17:51:20Z

This change is worth documenting at https://docs.all-hands.dev/
Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

End-user friendly description of the problem this fixes or functionality this introduces.

Summarize what the PR does, explaining any non-trivial design decisions.

Ported from #8598

Add file download observation via browsing
Some minor improvements to multimodal and text-based browsing
GAIA eval changes

Eval result:

GAIA - claude-3-7-sonnet:
- main: 12/30
- This PR: 15/30 (50%)

Link of any specific issues this addresses:

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:4796978-nikolaik   --name openhands-app-4796978   docker.all-hands.dev/all-hands-ai/openhands:4796978

…s, gaia eval

neubig · 2025-06-09T18:44:19Z

Thanks @ryanhoangt , has this ben evaluated at all?

li-boxuan · 2025-06-10T03:39:11Z

openhands/runtime/action_execution_server.py

@@ -198,6 +200,8 @@ def __init__(
        self.start_time = time.time()
        self.last_execution_time = self.start_time
        self._initialized = False
+        self.downloaded_files: list[str] = []
+        self.downloads_directory = '/workspace/downloads'


This folder might not exist at all (in GAIA eval harness we explicitly create this folder). Whatever folder path we use, we should also keep both docker runtime and local runtime in mind.

@xingyaoww Do you have some idea here? I was going to suggest a new config option until I see a recent refactoring of yours (#8242) that deprecates that approach.

Anyways, whatever decision we make, we need to make sure browser_env.py uses the same path.

Yes, the folder in browser_env.py must match with this folder. Also note that the directory must pre-exist or have sufficient permissions so that it can be created by the Playwright logic. The simplest, error-free way is to have a default location (that can be configured somehow if you want) where we can download files while ensuring that the directory exists before running the agent.

Also I think a better idea would be to directly open the file in the browser (using Xinyao's HTTP-based file viewing server) whenever possible. Say the agent downloaded a PDF, we can show the content of the file download observation in the text and open the file in the browser and send screenshot to agent.

@adityasoni9998 what do you think about making this a hidden directory, so that it doesn't interfere a lot with the ACI? If the user launches OH from scratch, the working directory will be /workspace, and having a downloads/ directory there might be a bit unexpected to the agent?

Regarding the ACI output, if the download dir isn't hidden, the output is:

Here's the files and directories up to 2 levels deep in /workspace, excluding hidden items: /workspace/ /workspace/downloads/ /workspace/test.txt

If we make it /workspace/.downloads, it will be:

Here's the files and directories up to 2 levels deep in /workspace, excluding hidden items: /workspace/ /workspace/test.txt 1 hidden files/directories in this directory are excluded. You can use 'ls -la /workspace' to see them.

which I think looks a bit nicer.

I think either should just work fine. Anyways we provide the absolute path of the downloaded file to the agent whenever a FileDownloadObservation is returned by the action_execution_server resulting from a BrowseInteractiveAction

li-boxuan · 2025-06-10T03:41:40Z

@OpenHands

Please add a test for download action in tests/runtime/test_browsing.py.

openhands-ai · 2025-06-10T03:41:48Z

I'm on it! li-boxuan can track my progress at all-hands.dev

li-boxuan · 2025-06-10T03:44:28Z

openhands/runtime/action_execution_server.py

+        if not browser_observation.error:
+            return browser_observation
+        else:
+            curr_files = os.listdir(self.downloads_directory)


@adityasoni9998 I don't exactly remember about this, but we are doing a hack here, right? We are saying, if a browse action fails, then we assume it's a download action?

So here is my understanding of how downloads work in BrowserGym and Playwright:

When agent tries to download a file, apparently the following error is returned by BrowserGym Error: net::ERR_ABORTED at <URL> (for example try executing goto() action for https://www.dailyscript.com/scripts/GOLDFINGER.doc to observe this). If you set the playwright config in browser_env.py to accept downloads, then apparently a file is downloaded and still an error is returned by browsergym.

So my logic is to check if a new file is present in the downloads folder (I keep track of this list), whenever the agent faces any error during browsing. We don't assume every browsing failure is a download.

I also remember that the above behaviour occurs if the agent clicks on a link that points to a file. I am not sure if there are cases when no error is raised by BGym and a file is downloaded at the same time. There maybe such cases, but I don't have any reproducible example for now.

BrowserGym/Playwright is also very weird for file extensions of downloaded files - these files are downloaded with some random strings (similar to a uuid str) that don't have any extension. This is why I use puremagic to guess the extension when copying it to workspace. I don't know of better ways of tackling this.

openhands-ai · 2025-06-10T03:44:59Z

I've successfully added a test for the download action in the browser functionality to the tests/runtime/test_browsing.py file. Here's a summary of what I did:

First, I explored the codebase to understand how the browser functionality works, particularly focusing on:
- The browser environment setup in openhands/runtime/browser/browser_env.py
- The browser actions in openhands/events/action/browse.py
- The browser observations in openhands/events/observation/browse.py
I created a new test function called test_download_file that:
- Creates a test file to be downloaded
- Creates an HTML page with a download link
- Starts a simple HTTP server
- Uses the browser to navigate to the HTML page
- Clicks on the download link
- Verifies that the file was downloaded to the /workspace/downloads/ directory
- Checks that the downloaded file has the correct content
- Cleans up after the test
I ran the pre-commit hooks to ensure code quality and formatting standards were met.
I committed the changes and pushed them to the ht/oh-versa-eval branch, successfully updating PR [OH-Versa] Add remaining browsing & GAIA eval improvement #9015.

The test verifies that the browser environment is correctly configured to handle downloads, which is an important feature for agents that need to download files from the web.

adityasoni9998 · 2025-06-10T06:37:34Z

openhands/runtime/action_execution_server.py

+        if not browser_observation.error:
+            return browser_observation
+        else:
+            curr_files = os.listdir(self.downloads_directory)


So here is my understanding of how downloads work in BrowserGym and Playwright:

When agent tries to download a file, apparently the following error is returned by BrowserGym Error: net::ERR_ABORTED at <URL> (for example try executing goto() action for https://www.dailyscript.com/scripts/GOLDFINGER.doc to observe this). If you set the playwright config in browser_env.py to accept downloads, then apparently a file is downloaded and still an error is returned by browsergym.

So my logic is to check if a new file is present in the downloads folder (I keep track of this list), whenever the agent faces any error during browsing. We don't assume every browsing failure is a download.

I also remember that the above behaviour occurs if the agent clicks on a link that points to a file. I am not sure if there are cases when no error is raised by BGym and a file is downloaded at the same time. There maybe such cases, but I don't have any reproducible example for now.

openhands/runtime/browser/browser_env.py

evaluation/benchmarks/gaia/scripts/run_infer.sh

adityasoni9998 · 2025-06-10T06:50:35Z

openhands/runtime/action_execution_server.py

@@ -198,6 +200,8 @@ def __init__(
        self.start_time = time.time()
        self.last_execution_time = self.start_time
        self._initialized = False
+        self.downloaded_files: list[str] = []
+        self.downloads_directory = '/workspace/downloads'


Yes, the folder in browser_env.py must match with this folder. Also note that the directory must pre-exist or have sufficient permissions so that it can be created by the Playwright logic. The simplest, error-free way is to have a default location (that can be configured somehow if you want) where we can download files while ensuring that the directory exists before running the agent.

Also I think a better idea would be to directly open the file in the browser (using Xinyao's HTTP-based file viewing server) whenever possible. Say the agent downloaded a PDF, we can show the content of the file download observation in the text and open the file in the browser and send screenshot to agent.

adityasoni9998 · 2025-06-10T06:53:22Z

openhands/runtime/action_execution_server.py

+        if not browser_observation.error:
+            return browser_observation
+        else:
+            curr_files = os.listdir(self.downloads_directory)


BrowserGym/Playwright is also very weird for file extensions of downloaded files - these files are downloaded with some random strings (similar to a uuid str) that don't have any extension. This is why I use puremagic to guess the extension when copying it to workspace. I don't know of better ways of tackling this.

neubig · 2025-06-14T12:50:49Z

@ryanhoangt I think I may let @adityasoni9998 and @li-boxuan review this one, But I'm very happy to have it in.

openhands/memory/conversation_memory.py

Co-authored-by: Engel Nyst <[email protected]>

adityasoni9998

The changes look good to me! I am not too aware about the changes in the test_mcp.py file. But the source code logic seems fine.

enyst

Looks generally good to me, thank you. GAIA eval works as e2e test 😅

openhands-ai · 2025-06-25T04:04:44Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Docker

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #9015

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

add browsing file download obs, remaining visual browsing improvement…

11eb700

…s, gaia eval

ryanhoangt requested review from xingyaoww and neubig as code owners June 9, 2025 17:51

ryanhoangt changed the title ~~[OH-Versa] Add remaining browsing improvement & GAIA eval~~ [OH-Versa] Add remaining browsing & GAIA eval improvement Jun 9, 2025

li-boxuan reviewed Jun 10, 2025

View reviewed changes

Add test for download action in browser

3e16cb7

mamoodi assigned ryanhoangt Jun 10, 2025

adityasoni9998 reviewed Jun 10, 2025

View reviewed changes

neubig removed request for neubig and xingyaoww June 14, 2025 12:50

enyst reviewed Jun 14, 2025

View reviewed changes

openhands/memory/conversation_memory.py Outdated Show resolved Hide resolved

Merge branch 'main' into ht/oh-versa-eval

60e8b3d

ryanhoangt requested review from mamoodi, rbren and amanape as code owners June 17, 2025 07:58

Merge branch 'main' into ht/oh-versa-eval

77c401e

ryanhoangt removed request for mamoodi, rbren and amanape June 17, 2025 08:05

ryanhoangt and others added 3 commits June 17, 2025 16:06

Update openhands/memory/conversation_memory.py

70d5427

Co-authored-by: Engel Nyst <[email protected]>

make download dir hidden

be37ace

fix test

a7224c7

ryanhoangt requested review from adityasoni9998 and li-boxuan June 17, 2025 14:47

ryanhoangt requested a review from enyst June 17, 2025 14:47

adityasoni9998 approved these changes Jun 23, 2025

View reviewed changes

enyst approved these changes Jun 23, 2025

View reviewed changes

ryanhoangt added 3 commits June 23, 2025 16:28

add ffmpeg to runtime

2443262

Merge branch 'main' into ht/oh-versa-eval

5910215

add ffprobe

21bf0a5

remove ffprobe

4796978

ryanhoangt merged commit dfa5467 into main Jun 25, 2025
18 checks passed

ryanhoangt deleted the ht/oh-versa-eval branch June 25, 2025 05:36

[OH-Versa] Add remaining browsing & GAIA eval improvement #9015

[OH-Versa] Add remaining browsing & GAIA eval improvement #9015

Uh oh!

Conversation

ryanhoangt commented Jun 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neubig commented Jun 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanhoangt Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

li-boxuan commented Jun 10, 2025

Uh oh!

openhands-ai bot commented Jun 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openhands-ai bot commented Jun 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neubig commented Jun 14, 2025

Uh oh!

Uh oh!

adityasoni9998 left a comment

Choose a reason for hiding this comment

Uh oh!

enyst left a comment

Choose a reason for hiding this comment

Uh oh!

openhands-ai bot commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

ryanhoangt commented Jun 9, 2025 •

edited by github-actions bot

Loading

ryanhoangt Jun 17, 2025 •

edited

Loading