Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] WebArena benchmark, MiniWoB++ benchmark and related arch changes #2170

Merged
merged 18 commits into from
Jun 6, 2024

Conversation

frankxu2004
Copy link
Collaborator

@frankxu2004 frankxu2004 commented Jun 1, 2024

Major changes:

  • Arch changes to support eval with browsergym infrastructure inside the project
  • Incorporate WebArena benchmark code
  • Incorporate MiniWoB++ benchmark code

Eval Results:

All tested on BrowsingAgent V1.0

MiniWoB++, 125 tasks (3 runs due to random init task)

GPT4o: 0.384, 0.416, 0.424, avg: 0.408
GPT3.5: 0.288, 0.256, 0.272, avg: 0.272

WebArena, 812 tasks (high cost, single run due to fixed task)
GPT4o: 0.1478
GPT3.5: 0.0517

@frankxu2004 frankxu2004 changed the title [feat] WebArena benchmark and related arch changes [feat] WebArena benchmark, MiniWoB++ benchmark and related arch changes Jun 3, 2024
@frankxu2004 frankxu2004 marked this pull request as ready for review June 3, 2024 08:09
--agent-cls $AGENT \
--llm-config $MODEL_CONFIG \
--max-iterations 15 \
--max-chars 10000000 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: 10000000 is a random number?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's just there to not truncate anything

obs, info = env.reset()
# EVAL only: save the goal into file for evaluation
if self.eval_mode:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to find a better way to record the intermedia data in the future, to avoid evaluation and runtime logic get mixed.

But it is fine for me to do this quick implementation here.

Copy link
Collaborator Author

@frankxu2004 frankxu2004 Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This is ugly for now - I used filesystem to do IPC...

A much better solution in my opinion to better support BrowserGym benchmarks are allowing BrowserEnv (running in a separate process) to access the EventStream that our main process have. If we can send the initial MessageAction from BrowserEnv process then these goal sharing code and reward saving code can be just removed - they can be communicated to the main process and we can handle all these eval-specific things in evaluation codebase only.

Maybe somehow in the future if we have a unified MQ this might be easier? Or if we do not want middleware we can probably make the EventStream subscribe-able with multiprocessing Now maybe we have to expose something to allow for better IPC between browser environment to the main process where we handle all the events. Love your thoughts, but I think this might be worth a new issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be worth a new issue?

Yes, I agree. Better to do it in future PR.

@frankxu2004
Copy link
Collaborator Author

Also related: I wonder if we should allow configurable agents somewhere in the codebase? Now we can configure LLM configs, but not agents. It's nice to have configurable options passed the Agent class to control the prompts and stuff.

@li-boxuan
Copy link
Collaborator

For some reason I cannot directly push to your branch. Could you please run

ONLY_TEST_AGENT=BrowsingAgent ./tests/integration/regenerate.sh

to fix the test, now that the prompt has been updated?

@frankxu2004
Copy link
Collaborator Author

@li-boxuan fixed, and should be in a mergeable state.

@yufansong yufansong merged commit 48151bd into All-Hands-AI:main Jun 6, 2024
2 checks passed
if len(state.history) == 1:
# initialize and retrieve the first observation by issuing an noop OP
# TODO: need more elegant way of doing this
return BrowseInteractiveAction(browser_actions='noop()')
Copy link
Collaborator

@enyst enyst Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frankxu2004 can you please help me understand this, if we have state.history of 1, we already have an (action, observation), why do we need to send another action to get the first observation, isn't that the second observation?

I need to merge it in #2021 which has refactored history to get events from the event stream, and they're no longer perfect pairs... I'm trying to not break this :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is a bit hacky - when len(state.history) == 1, the history is [(MessageAction("user's initial instruction"), NullObservattion())], and then we send a BrowserInteractiveAction("noop()"), and get a current BrowserObservation.

I think in the future, when the EventStream is more accessible to other processes (e.g. inside sandbox, or the separate Browser process), we could simply let the browser to subscribe to EventStream, and send the initial Observation to the EventStream, and then the agent will already have an initial observation, without having to rely on this hack

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! Sorry, I edited my comment to explain meanwhile, too.

Okay, thank you. I don't know if #2021 goes far enough for that, just in that direction I guess.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are no longer in perfect pairs, do we have an idea of what observation is from what action?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specific case, maybe we can check if there’s no BrowserObservation at all in the whole history, we issue a noop to get the initial browser observation . What do you think

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note, when the browsing agent runs as a delegate, it doesn't seem to have a previous MessageAction, it starts with 0 history I think, so this doesn't do any extra 'noop', but the rest works.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, how does the browsing agent get the delegate intent then?

Copy link
Collaborator

@enyst enyst Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Via inputs: https://github.com/OpenDevin/OpenDevin/blob/45ce09d70ec9e46ded03df3b168f684e030c6dee/agenthub/browsing_agent/browsing_agent.py#L112

I think passing a MessageAction is better... if we can do it. Figuring out the current task is needed in other parts of the code.

Or, I don't know if a message action, but the delegate action is at this point the latest action in the stream. That action has inputs, which is passed in State, which is read above.

Copy link
Collaborator

@enyst enyst Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the future, when the EventStream is more accessible to other processes (e.g. inside sandbox, or the separate Browser process), we could simply let the browser to subscribe to EventStream, and send the initial Observation to the EventStream, and then the agent will already have an initial observation, without having to rely on this hack

We had an issue with a stubborn integration test and while debugging it on 2021 branch, I commented out this bit of code, and it all worked without it. I am less sure about main right now (I live in a parallel world these days, a world with a refactored history 😅), but testing on main it worked just fine without it and fixed the integration test in a normal way (with the expected history), so I removed it from main to fix that test.

Without it, the llm gets a prompt with the task, and no history, and that seemed to work fine, both as independent agent and as delegate. The LLM asks to go to the URL, which results in an observation, so it all continues normally from there.

Do you think that actually breaks something? What use case was it, that required another observation, is it possible it's no longer the case?

Copy link
Collaborator Author

@frankxu2004 frankxu2004 Jun 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For normal web browsing, we don't need to issue the noop since the browser always starts with a blank page, so the initial observation is non-existent anyway, and the browsing agent will first issue a goto(url) action to kick start the process.

However, for evaluation on benchmarks such as webarena and miniwob++, the browser env is initialized with some specific URL and the initial observation is quite crucial for the browsing agent to make the first decision - that's way I added the noop action first to get the initial observation. I fixed this issue in a new PR: #2341

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants