Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CodeActAgent: Only delegate to BrowsingAgent as last resort #2326

Closed

Conversation

li-boxuan
Copy link
Collaborator

@li-boxuan li-boxuan commented Jun 8, 2024

#2103 was checked-in to main branch, but it is reported that it seems to be an overkill (and waste at least one round of conversation) to delegate a simple web browsing action to BrowsingAgent.

This PR is a partial revert of the previous PR: it brings browsing action back, and prompts CodeActAgent to delegate to BrowsingAgent only if it cannot find any other way. Hopefully this will also make LLM to browse less - it is also reported that CodeActAgent likes to browse even if it could just use API.

This PR doesn't solve one problem: sometimes the browsing agent has finished its job and reported back the result, yet codeact agent still decides to try other ways or delegates again, as if the job is not done. My guesstimate is since we only include the final observation of child agent in parent agent's history, the parent loses the chain of thought to achieve the result, and thus gets confused. I believe the solution would be to append child's history to parent's history, but in a summary format (a.k.a. "condensed"). @enyst has been doing some work and preliminary result seems promising. See discussion here #2103 (comment).

I regenerated all prompts & responses use GPT-4o since this PR changes the prompt of CodeActAgent by a lot. Using real LLM to regenerate could help us make sure that the agent at least keeps the bare minimum quality to pass all tests.

Now CodeActAgent would first use simple goto browse action, figures out the answer is hidden behind a button, and then decides to delegate to BrowsingAgent. This seems promising, but it might also just because I included a similar example in the one-shot learning.

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks OK, but it makes the prompt more complex. It'd be good to evaluate this against SWE-bench lite to check its effect on accuracy.

I have a bunch of evals I'd like to run once #2085 is merged, so I'll add this one to the list.

@neubig neubig self-assigned this Jun 8, 2024
agent = delegate_action
task = thought
return AgentDelegateAction(
agent=agent, inputs={'task': task}, thought=thought
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an issue with this PR, just picking your brain on this: do we really have to set task in inputs? Can we, if not passing an actual MessageAction, at least make delegate action and obs a little more in-line with others, and for example use a field in AgentDelegateAction for the task, and use the content field in AgentDelegateObservation, which all obs have, for its result?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really have to set task in inputs?

Some micro-agents needs more than just a task as an input. For example, coder-agent needs both "task" and "summary" of codebase - although one could argue "summary" could be added to the "task" prompt.

and use the content field in AgentDelegateObservation

Some micro-agents needs more than just a "content" as an output. E.g. CommitWriterAgent can either return outputs['answer'] if it generates a commit message, or outputs['reason'] if it rejects the request.

which all obs have, for its result

Fundamentally, AgentDelegateObservation is different from others. CmdOutputObservation, e.g., is essentially an observation of stdout/stderr - that being said, what if in the future, we need to catch other side-effects? Likely we would need to do something similar here - use outputs as a dict rather than content as a string.

@li-boxuan
Copy link
Collaborator Author

This looks OK, but it makes the prompt more complex. It'd be good to evaluate this against SWE-bench lite to check its effect on accuracy.

I believe this will lower the score - the performance between CodeActAgent and CodeActSWEAgent on SWE-bench lite has proved so. Question is how much it hurts the benchmark - that would be interesting to know.

@li-boxuan li-boxuan force-pushed the boxuan/tweak-codeact-browse-prompt branch from 1cb27ad to ccd44cd Compare June 9, 2024 18:38
@frankxu2004 frankxu2004 requested a review from xingyaoww June 10, 2024 00:38
@assertion
Copy link
Contributor

This looks OK, but it makes the prompt more complex. It'd be good to evaluate this against SWE-bench lite to check its effect on accuracy.

I believe this will lower the score - the performance between CodeActAgent and CodeActSWEAgent on SWE-bench lite has proved so. Question is how much it hurts the benchmark - that would be interesting to know.

I think CodeActAgent for end-user, and CodeActSWEAgent for benchmark. So CodeActAgent can explore on multi-agent mode( maybe some other new skills/modes ), and don't pay too much attention on the benchmark score, and CodeActSWEAgent focus on the benchmark score, accuracy/cost/performance are the most important, so maybe new skills/modes won't used at the beginning. @li-boxuan @neubig

@li-boxuan
Copy link
Collaborator Author

So CodeActAgent can explore on multi-agent mode( maybe some other new skills/modes ), and don't pay too much attention on the benchmark score

Yeah I mostly agree. That being said, I guess it's still interesting to see how CodeActAgent perform differently than CodeActSWEAgent on SWE-bench lite though.

@@ -38,8 +45,8 @@

SYSTEM_SUFFIX = """Responses should be concise.
The assistant should attempt fewer things at a time instead of putting too much commands OR code in one "execute" block.
Include ONLY ONE <execute_ipython>, <execute_bash>, or <execute_browse> per response, unless the assistant is finished with the task or need more input or action from the user in order to proceed.
IMPORTANT: Execute code using <execute_ipython>, <execute_bash>, or <execute_browse> whenever possible.
Include ONLY ONE <execute_ipython>, <execute_bash>, <execute_browse>, or <execute_delegate> per response, unless the assistant is finished with the task or needs more input or action from the user in order to proceed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually thinking about a path where we:

  1. Refractor the arch a bit to use event stream of orchestrating everything
  2. Make primitives of <execute_browse> available inside <execute_ipython> as python function (e.g., goto_url(XXX)) so it is one fewer tags for model to remember
  3. We ONLY keep <execute_ipython>, <execute_bash>, or <execute_delegate>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, let's hold this PR for now until we can remove execute_browse

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocked by #2404

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I guess this issue is blocked by a refactor that will likely be done by @xingyaoww , so maybe I should assign it to him for now?

@neubig neubig assigned xingyaoww and unassigned neubig Jun 16, 2024
@neubig neubig marked this pull request as draft June 22, 2024 02:43
@neubig
Copy link
Contributor

neubig commented Jun 22, 2024

Converted to a draft since this is not ready for review at the moment.

@li-boxuan
Copy link
Collaborator Author

li-boxuan commented Jul 11, 2024

I am closing this PR. I feel like this makes our prompt unnecessarily complicated. Just simply delegating browsing actions to BrowsingAgent seems to be the right thing to do.

FWIW, the following bug mentioned in the opening comment

This PR doesn't solve one problem: sometimes the browsing agent has finished its job and reported back the result, yet codeact agent still decides to try other ways or delegates again, as if the job is not done

was fixed by another PR.

@li-boxuan li-boxuan closed this Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants