-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CodeActAgent: Only delegate to BrowsingAgent as last resort #2326
CodeActAgent: Only delegate to BrowsingAgent as last resort #2326
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks OK, but it makes the prompt more complex. It'd be good to evaluate this against SWE-bench lite to check its effect on accuracy.
I have a bunch of evals I'd like to run once #2085 is merged, so I'll add this one to the list.
agent = delegate_action | ||
task = thought | ||
return AgentDelegateAction( | ||
agent=agent, inputs={'task': task}, thought=thought |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an issue with this PR, just picking your brain on this: do we really have to set task in inputs? Can we, if not passing an actual MessageAction, at least make delegate action and obs a little more in-line with others, and for example use a field in AgentDelegateAction for the task, and use the content
field in AgentDelegateObservation, which all obs have, for its result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really have to set task in inputs?
Some micro-agents needs more than just a task as an input. For example, coder-agent needs both "task" and "summary" of codebase - although one could argue "summary" could be added to the "task" prompt.
and use the content field in AgentDelegateObservation
Some micro-agents needs more than just a "content" as an output. E.g. CommitWriterAgent can either return outputs['answer']
if it generates a commit message, or outputs['reason']
if it rejects the request.
which all obs have, for its result
Fundamentally, AgentDelegateObservation is different from others. CmdOutputObservation, e.g., is essentially an observation of stdout/stderr - that being said, what if in the future, we need to catch other side-effects? Likely we would need to do something similar here - use outputs
as a dict rather than content
as a string.
I believe this will lower the score - the performance between CodeActAgent and CodeActSWEAgent on SWE-bench lite has proved so. Question is how much it hurts the benchmark - that would be interesting to know. |
1cb27ad
to
ccd44cd
Compare
tests/integration/mock/CodeActAgent/test_browse_internet/prompt_001.log
Outdated
Show resolved
Hide resolved
Co-authored-by: Yufan Song <[email protected]>
I think |
Yeah I mostly agree. That being said, I guess it's still interesting to see how CodeActAgent perform differently than CodeActSWEAgent on SWE-bench lite though. |
@@ -38,8 +45,8 @@ | |||
|
|||
SYSTEM_SUFFIX = """Responses should be concise. | |||
The assistant should attempt fewer things at a time instead of putting too much commands OR code in one "execute" block. | |||
Include ONLY ONE <execute_ipython>, <execute_bash>, or <execute_browse> per response, unless the assistant is finished with the task or need more input or action from the user in order to proceed. | |||
IMPORTANT: Execute code using <execute_ipython>, <execute_bash>, or <execute_browse> whenever possible. | |||
Include ONLY ONE <execute_ipython>, <execute_bash>, <execute_browse>, or <execute_delegate> per response, unless the assistant is finished with the task or needs more input or action from the user in order to proceed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually thinking about a path where we:
- Refractor the arch a bit to use event stream of orchestrating everything
- Make primitives of
<execute_browse>
available inside<execute_ipython>
as python function (e.g.,goto_url(XXX)
) so it is one fewer tags for model to remember - We ONLY keep
<execute_ipython>
,<execute_bash>
, or<execute_delegate>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, let's hold this PR for now until we can remove execute_browse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocked by #2404
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so I guess this issue is blocked by a refactor that will likely be done by @xingyaoww , so maybe I should assign it to him for now?
Converted to a draft since this is not ready for review at the moment. |
I am closing this PR. I feel like this makes our prompt unnecessarily complicated. Just simply delegating browsing actions to BrowsingAgent seems to be the right thing to do. FWIW, the following bug mentioned in the opening comment
was fixed by another PR. |
#2103 was checked-in to main branch, but it is reported that it seems to be an overkill (and waste at least one round of conversation) to delegate a simple web browsing action to BrowsingAgent.
This PR is a partial revert of the previous PR: it brings browsing action back, and prompts CodeActAgent to delegate to BrowsingAgent only if it cannot find any other way. Hopefully this will also make LLM to browse less - it is also reported that CodeActAgent likes to browse even if it could just use API.
This PR doesn't solve one problem: sometimes the browsing agent has finished its job and reported back the result, yet codeact agent still decides to try other ways or delegates again, as if the job is not done. My guesstimate is since we only include the final observation of child agent in parent agent's history, the parent loses the chain of thought to achieve the result, and thus gets confused. I believe the solution would be to append child's history to parent's history, but in a summary format (a.k.a. "condensed"). @enyst has been doing some work and preliminary result seems promising. See discussion here #2103 (comment).
I regenerated all prompts & responses use GPT-4o since this PR changes the prompt of CodeActAgent by a lot. Using real LLM to regenerate could help us make sure that the agent at least keeps the bare minimum quality to pass all tests.
Now CodeActAgent would first use simple
goto
browse action, figures out the answer is hidden behind a button, and then decides to delegate to BrowsingAgent. This seems promising, but it might also just because I included a similar example in the one-shot learning.