You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here is an example with chain of thought of a valid action when clicking on a button:
139
189
"
140
190
In order to accomplish my goal I need to click on the button with bid 12
141
191
```click("12")```
142
192
"
143
193
""".strip()
194
+
195
+
ifUSE_CONCISE_ANSWER:
196
+
concise_instruction="""\
197
+
198
+
Here is another example with chain of thought of a valid action when providing a concise answer to user:
199
+
"
200
+
In order to accomplish my goal I need to send the information asked back to the user. This page list the information of HP Inkjet Fax Machine, which is the product identified in the objective. Its price is $279.49. I will send a message back to user with the answer.
# WebArena Evaluation with OpenDevin Browsing Agents
2
+
3
+
This folder contains evaluation for [MiniWoB++](https://miniwob.farama.org/) benchmark, powered by [BrowserGym](https://github.com/ServiceNow/BrowserGym) for easy evaluation of how well an agent capable of browsing can perform on synthetic web browsing tasks.
4
+
5
+
## Setup OpenDevin Environment
6
+
7
+
Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.
8
+
9
+
## Configure OpenDevin and your LLM
10
+
11
+
Create a `config.toml` file if it does not exist at the root of the workspace.
- Setup Miniwob URL (change `PATH_TO_MINIWOB_CLONED_REPO` here to the absolute path to your `miniwob-plusplus` folder) in `evaluation/miniwob/scripts/run_infer.sh`
Access with browser the above MiniWoB URLs and see if they load correctly.
54
+
55
+
## Run Evaluation
56
+
57
+
```sh
58
+
bash evaluation/miniwob/scripts/run_infer.sh
59
+
```
60
+
61
+
Results will be in `evaluation/evaluation_outputs/outputs/miniwob/`
62
+
63
+
To calculate the average reward, run:
64
+
65
+
```sh
66
+
poetry run python evaluation/miniwob/get_success_rate.py evaluation/evaluation_outputs/outputs/miniwob/SOME_AGENT/EXP_NAME/output.jsonl
67
+
```
68
+
69
+
## Submit your evaluation results
70
+
71
+
You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenDevin/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).
72
+
73
+
74
+
## BrowsingAgent V1.0 result
75
+
76
+
Tested on BrowsingAgent V1.0
77
+
78
+
MiniWoB++, 125 tasks (3 runs due to random init task), max step 10
0 commit comments