-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[OH-Versa] Add remaining browsing & GAIA eval improvement #9015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks @ryanhoangt , has this ben evaluated at all? |
@@ -198,6 +200,8 @@ def __init__( | |||
self.start_time = time.time() | |||
self.last_execution_time = self.start_time | |||
self._initialized = False | |||
self.downloaded_files: list[str] = [] | |||
self.downloads_directory = '/workspace/downloads' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This folder might not exist at all (in GAIA eval harness we explicitly create this folder). Whatever folder path we use, we should also keep both docker runtime and local runtime in mind.
@xingyaoww Do you have some idea here? I was going to suggest a new config option until I see a recent refactoring of yours (#8242) that deprecates that approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyways, whatever decision we make, we need to make sure browser_env.py
uses the same path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the folder in browser_env.py must match with this folder. Also note that the directory must pre-exist or have sufficient permissions so that it can be created by the Playwright logic. The simplest, error-free way is to have a default location (that can be configured somehow if you want) where we can download files while ensuring that the directory exists before running the agent.
Also I think a better idea would be to directly open the file in the browser (using Xinyao's HTTP-based file viewing server) whenever possible. Say the agent downloaded a PDF, we can show the content of the file download observation in the text and open the file in the browser and send screenshot to agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adityasoni9998 what do you think about making this a hidden directory, so that it doesn't interfere a lot with the ACI? If the user launches OH from scratch, the working directory will be /workspace
, and having a downloads/
directory there might be a bit unexpected to the agent?
Regarding the ACI output, if the download dir isn't hidden, the output is:
Here's the files and directories up to 2 levels deep in /workspace, excluding hidden items:
/workspace/
/workspace/downloads/
/workspace/test.txt
If we make it /workspace/.downloads
, it will be:
Here's the files and directories up to 2 levels deep in /workspace, excluding hidden items:
/workspace/
/workspace/test.txt
1 hidden files/directories in this directory are excluded. You can use 'ls -la /workspace' to see them.
which I think looks a bit nicer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either should just work fine. Anyways we provide the absolute path of the downloaded file to the agent whenever a FileDownloadObservation is returned by the action_execution_server
resulting from a BrowseInteractiveAction
Please add a test for download action in |
I'm on it! li-boxuan can track my progress at all-hands.dev |
if not browser_observation.error: | ||
return browser_observation | ||
else: | ||
curr_files = os.listdir(self.downloads_directory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adityasoni9998 I don't exactly remember about this, but we are doing a hack here, right? We are saying, if a browse action fails, then we assume it's a download action?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here is my understanding of how downloads work in BrowserGym and Playwright:
- When agent tries to download a file, apparently the following error is returned by BrowserGym
Error: net::ERR_ABORTED at <URL>
(for example try executing goto() action for https://www.dailyscript.com/scripts/GOLDFINGER.doc to observe this). If you set the playwright config in browser_env.py to accept downloads, then apparently a file is downloaded and still an error is returned by browsergym. - So my logic is to check if a new file is present in the downloads folder (I keep track of this list), whenever the agent faces any error during browsing. We don't assume every browsing failure is a download.
I also remember that the above behaviour occurs if the agent clicks on a link that points to a file. I am not sure if there are cases when no error is raised by BGym and a file is downloaded at the same time. There maybe such cases, but I don't have any reproducible example for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BrowserGym/Playwright is also very weird for file extensions of downloaded files - these files are downloaded with some random strings (similar to a uuid str) that don't have any extension. This is why I use puremagic to guess the extension when copying it to workspace. I don't know of better ways of tackling this.
I've successfully added a test for the download action in the browser functionality to the
The test verifies that the browser environment is correctly configured to handle downloads, which is an important feature for agents that need to download files from the web. |
if not browser_observation.error: | ||
return browser_observation | ||
else: | ||
curr_files = os.listdir(self.downloads_directory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here is my understanding of how downloads work in BrowserGym and Playwright:
- When agent tries to download a file, apparently the following error is returned by BrowserGym
Error: net::ERR_ABORTED at <URL>
(for example try executing goto() action for https://www.dailyscript.com/scripts/GOLDFINGER.doc to observe this). If you set the playwright config in browser_env.py to accept downloads, then apparently a file is downloaded and still an error is returned by browsergym. - So my logic is to check if a new file is present in the downloads folder (I keep track of this list), whenever the agent faces any error during browsing. We don't assume every browsing failure is a download.
I also remember that the above behaviour occurs if the agent clicks on a link that points to a file. I am not sure if there are cases when no error is raised by BGym and a file is downloaded at the same time. There maybe such cases, but I don't have any reproducible example for now.
@@ -198,6 +200,8 @@ def __init__( | |||
self.start_time = time.time() | |||
self.last_execution_time = self.start_time | |||
self._initialized = False | |||
self.downloaded_files: list[str] = [] | |||
self.downloads_directory = '/workspace/downloads' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the folder in browser_env.py must match with this folder. Also note that the directory must pre-exist or have sufficient permissions so that it can be created by the Playwright logic. The simplest, error-free way is to have a default location (that can be configured somehow if you want) where we can download files while ensuring that the directory exists before running the agent.
Also I think a better idea would be to directly open the file in the browser (using Xinyao's HTTP-based file viewing server) whenever possible. Say the agent downloaded a PDF, we can show the content of the file download observation in the text and open the file in the browser and send screenshot to agent.
if not browser_observation.error: | ||
return browser_observation | ||
else: | ||
curr_files = os.listdir(self.downloads_directory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BrowserGym/Playwright is also very weird for file extensions of downloaded files - these files are downloaded with some random strings (similar to a uuid str) that don't have any extension. This is why I use puremagic to guess the extension when copying it to workspace. I don't know of better ways of tackling this.
@ryanhoangt I think I may let @adityasoni9998 and @li-boxuan review this one, But I'm very happy to have it in. |
Co-authored-by: Engel Nyst <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me! I am not too aware about the changes in the test_mcp.py file. But the source code logic seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks generally good to me, thank you. GAIA eval works as e2e test 😅
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like
Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
End-user friendly description of the problem this fixes or functionality this introduces.
Summarize what the PR does, explaining any non-trivial design decisions.
Ported from #8598
Eval result:
claude-3-7-sonnet
:main
: 12/30Link of any specific issues this addresses:
To run this PR locally, use the following command: