-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] WebArena benchmark, MiniWoB++ benchmark and related arch changes #2170
Merged
Merged
Changes from 9 commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
81fb9f7
add webarena, and revamp messaging for webarena eval
frankxu2004 0b0328f
Merge branch 'main' of https://github.com/OpenDevin/OpenDevin into we…
frankxu2004 c1b3d8e
add changes for browsergym
frankxu2004 2afa63a
Merge branch 'main' of https://github.com/OpenDevin/OpenDevin into we…
frankxu2004 5abfa39
update infer script
frankxu2004 0194bc5
fix unit tests
frankxu2004 bad115e
update
frankxu2004 7330bc4
Merge branch 'main' of https://github.com/OpenDevin/OpenDevin into we…
frankxu2004 1a533b4
add multiple run for miniwob
frankxu2004 85cff1d
update instruction, remove personal path
frankxu2004 854d41d
Merge branch 'main' of https://github.com/OpenDevin/OpenDevin into we…
frankxu2004 4e6959e
update
frankxu2004 db0904e
Merge branch 'main' of https://github.com/OpenDevin/OpenDevin into we…
frankxu2004 be9771e
add code for getting final reward, fix integration, add results
frankxu2004 1ebf177
Merge branch 'main' of https://github.com/OpenDevin/OpenDevin into we…
frankxu2004 f1e1dfd
add avg cost calculation
frankxu2004 d9c13c1
Merge branch 'main' of https://github.com/OpenDevin/OpenDevin into we…
frankxu2004 5bcd31b
Merge branch 'main' of https://github.com/OpenDevin/OpenDevin into we…
frankxu2004 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
# WebArena Evaluation with OpenDevin Browsing Agents | ||
|
||
This folder contains evaluation for [MiniWoB++](https://miniwob.farama.org/) benchmark, powered by [BrowserGym](https://github.com/ServiceNow/BrowserGym) for easy evaluation of how well an agent capable of browsing can perform on synthetic web browsing tasks. | ||
|
||
## Setup OpenDevin Environment | ||
|
||
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin. | ||
|
||
## Configure OpenDevin and your LLM | ||
|
||
Create a `config.toml` file if it does not exist at the root of the workspace. | ||
|
||
Add the following configurations: | ||
|
||
```toml | ||
[core] | ||
max_iterations = 100 | ||
cache_dir = "/tmp/cache" | ||
sandbox_container_image = "ghcr.io/opendevin/sandbox:latest" | ||
sandbox_type = "ssh" | ||
ssh_hostname = "localhost" | ||
sandbox_timeout = 120 | ||
|
||
# TODO: Change these to the model you want to evaluate | ||
[eval_gpt4_1106_preview] | ||
model = "gpt-4-1106-preview" | ||
api_key = "XXX" | ||
temperature = 0.0 | ||
|
||
[eval_some_openai_compatible_model] | ||
model = "openai/MODEL_NAME" | ||
base_url = "https://OPENAI_COMPATIBLE_URL/v1" | ||
api_key = "XXX" | ||
temperature = 0.0 | ||
``` | ||
|
||
## Setup MiniWoB++ Environment and Environment Variables of MiniWoB++ | ||
MiniWoB++ requires you to set up websites containing a static website that is accessible via URL to the machine running the OpenDevin agents. | ||
|
||
- Clone miniwob (use a specific frozen commit for reproducibility) | ||
```sh | ||
git clone [email protected]:Farama-Foundation/miniwob-plusplus.git | ||
git -C "./miniwob-plusplus" reset --hard 7fd85d71a4b60325c6585396ec4f48377d049838 | ||
``` | ||
|
||
- Setup Miniwob URL (change `PATH_TO_MINIWOB_CLONED_REPO` here to the absolute path to your `miniwob-plusplus` folder) | ||
```sh | ||
export MINIWOB_URL="file://<PATH_TO_MINIWOB_CLONED_REPO>/miniwob/html/miniwob/" | ||
``` | ||
|
||
## Test if your environment works | ||
|
||
Access with browser the above MiniWoB URLs and see if they load correctly. | ||
|
||
## Submit your evaluation results | ||
|
||
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions). |
Empty file.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frankxu2004 can you please help me understand this, if we have state.history of 1, we already have an (action, observation), why do we need to send another action to get the first observation, isn't that the second observation?
I need to merge it in #2021 which has refactored history to get events from the event stream, and they're no longer perfect pairs... I'm trying to not break this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is a bit hacky - when len(state.history) == 1, the history is [(MessageAction("user's initial instruction"), NullObservattion())], and then we send a BrowserInteractiveAction("noop()"), and get a current BrowserObservation.
I think in the future, when the EventStream is more accessible to other processes (e.g. inside sandbox, or the separate Browser process), we could simply let the browser to subscribe to EventStream, and send the initial Observation to the EventStream, and then the agent will already have an initial observation, without having to rely on this hack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha! Sorry, I edited my comment to explain meanwhile, too.
Okay, thank you. I don't know if #2021 goes far enough for that, just in that direction I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they are no longer in perfect pairs, do we have an idea of what observation is from what action?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this specific case, maybe we can check if there’s no BrowserObservation at all in the whole history, we issue a noop to get the initial browser observation . What do you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note, when the browsing agent runs as a delegate, it doesn't seem to have a previous MessageAction, it starts with 0 history I think, so this doesn't do any extra 'noop', but the rest works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, how does the browsing agent get the delegate intent then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Via
inputs
: https://github.com/OpenDevin/OpenDevin/blob/45ce09d70ec9e46ded03df3b168f684e030c6dee/agenthub/browsing_agent/browsing_agent.py#L112I think passing a MessageAction is better... if we can do it. Figuring out the current task is needed in other parts of the code.
Or, I don't know if a message action, but the delegate action is at this point the latest action in the stream. That action has
inputs
, which is passed in State, which is read above.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had an issue with a stubborn integration test and while debugging it on 2021 branch, I commented out this bit of code, and it all worked without it. I am less sure about
main
right now (I live in a parallel world these days, a world with a refactored history 😅), but testing onmain
it worked just fine without it and fixed the integration test in a normal way (with the expected history), so I removed it frommain
to fix that test.Without it, the llm gets a prompt with the task, and no history, and that seemed to work fine, both as independent agent and as delegate. The LLM asks to go to the URL, which results in an observation, so it all continues normally from there.
Do you think that actually breaks something? What use case was it, that required another observation, is it possible it's no longer the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For normal web browsing, we don't need to issue the noop since the browser always starts with a blank page, so the initial observation is non-existent anyway, and the browsing agent will first issue a goto(url) action to kick start the process.
However, for evaluation on benchmarks such as webarena and miniwob++, the browser env is initialized with some specific URL and the initial observation is quite crucial for the browsing agent to make the first decision - that's way I added the noop action first to get the initial observation. I fixed this issue in a new PR: #2341