-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] WebArena benchmark, MiniWoB++ benchmark and related arch changes #2170
Conversation
--agent-cls $AGENT \ | ||
--llm-config $MODEL_CONFIG \ | ||
--max-iterations 15 \ | ||
--max-chars 10000000 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: 10000000
is a random number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it's just there to not truncate anything
obs, info = env.reset() | ||
# EVAL only: save the goal into file for evaluation | ||
if self.eval_mode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to find a better way to record the intermedia data in the future, to avoid evaluation and runtime logic get mixed.
But it is fine for me to do this quick implementation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This is ugly for now - I used filesystem to do IPC...
A much better solution in my opinion to better support BrowserGym benchmarks are allowing BrowserEnv (running in a separate process) to access the EventStream
that our main process have. If we can send the initial MessageAction
from BrowserEnv
process then these goal sharing code and reward saving code can be just removed - they can be communicated to the main process and we can handle all these eval-specific things in evaluation codebase only.
Maybe somehow in the future if we have a unified MQ this might be easier? Or if we do not want middleware we can probably make the EventStream
subscribe-able with multiprocessing
Now maybe we have to expose something to allow for better IPC between browser environment to the main process where we handle all the events. Love your thoughts, but I think this might be worth a new issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be worth a new issue?
Yes, I agree. Better to do it in future PR.
Also related: I wonder if we should allow configurable agents somewhere in the codebase? Now we can configure LLM configs, but not agents. It's nice to have configurable options passed the Agent class to control the prompts and stuff. |
For some reason I cannot directly push to your branch. Could you please run
to fix the test, now that the prompt has been updated? |
@li-boxuan fixed, and should be in a mergeable state. |
if len(state.history) == 1: | ||
# initialize and retrieve the first observation by issuing an noop OP | ||
# TODO: need more elegant way of doing this | ||
return BrowseInteractiveAction(browser_actions='noop()') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@frankxu2004 can you please help me understand this, if we have state.history of 1, we already have an (action, observation), why do we need to send another action to get the first observation, isn't that the second observation?
I need to merge it in #2021 which has refactored history to get events from the event stream, and they're no longer perfect pairs... I'm trying to not break this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is a bit hacky - when len(state.history) == 1, the history is [(MessageAction("user's initial instruction"), NullObservattion())], and then we send a BrowserInteractiveAction("noop()"), and get a current BrowserObservation.
I think in the future, when the EventStream is more accessible to other processes (e.g. inside sandbox, or the separate Browser process), we could simply let the browser to subscribe to EventStream, and send the initial Observation to the EventStream, and then the agent will already have an initial observation, without having to rely on this hack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha! Sorry, I edited my comment to explain meanwhile, too.
Okay, thank you. I don't know if #2021 goes far enough for that, just in that direction I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they are no longer in perfect pairs, do we have an idea of what observation is from what action?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this specific case, maybe we can check if there’s no BrowserObservation at all in the whole history, we issue a noop to get the initial browser observation . What do you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note, when the browsing agent runs as a delegate, it doesn't seem to have a previous MessageAction, it starts with 0 history I think, so this doesn't do any extra 'noop', but the rest works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, how does the browsing agent get the delegate intent then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think passing a MessageAction is better... if we can do it. Figuring out the current task is needed in other parts of the code.
Or, I don't know if a message action, but the delegate action is at this point the latest action in the stream. That action has inputs
, which is passed in State, which is read above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the future, when the EventStream is more accessible to other processes (e.g. inside sandbox, or the separate Browser process), we could simply let the browser to subscribe to EventStream, and send the initial Observation to the EventStream, and then the agent will already have an initial observation, without having to rely on this hack
We had an issue with a stubborn integration test and while debugging it on 2021 branch, I commented out this bit of code, and it all worked without it. I am less sure about main
right now (I live in a parallel world these days, a world with a refactored history 😅), but testing on main
it worked just fine without it and fixed the integration test in a normal way (with the expected history), so I removed it from main
to fix that test.
Without it, the llm gets a prompt with the task, and no history, and that seemed to work fine, both as independent agent and as delegate. The LLM asks to go to the URL, which results in an observation, so it all continues normally from there.
Do you think that actually breaks something? What use case was it, that required another observation, is it possible it's no longer the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For normal web browsing, we don't need to issue the noop since the browser always starts with a blank page, so the initial observation is non-existent anyway, and the browsing agent will first issue a goto(url) action to kick start the process.
However, for evaluation on benchmarks such as webarena and miniwob++, the browser env is initialized with some specific URL and the initial observation is quite crucial for the browsing agent to make the first decision - that's way I added the noop action first to get the initial observation. I fixed this issue in a new PR: #2341
Major changes:
Eval Results:
All tested on BrowsingAgent V1.0
MiniWoB++, 125 tasks (3 runs due to random init task)
GPT4o: 0.384, 0.416, 0.424, avg: 0.408
GPT3.5: 0.288, 0.256, 0.272, avg: 0.272
WebArena, 812 tasks (high cost, single run due to fixed task)
GPT4o: 0.1478
GPT3.5: 0.0517