CodeActAgent: Only delegate to BrowsingAgent as last resort #2326

li-boxuan · 2024-06-08T01:01:44Z

#2103 was checked-in to main branch, but it is reported that it seems to be an overkill (and waste at least one round of conversation) to delegate a simple web browsing action to BrowsingAgent.

This PR is a partial revert of the previous PR: it brings browsing action back, and prompts CodeActAgent to delegate to BrowsingAgent only if it cannot find any other way. Hopefully this will also make LLM to browse less - it is also reported that CodeActAgent likes to browse even if it could just use API.

This PR doesn't solve one problem: sometimes the browsing agent has finished its job and reported back the result, yet codeact agent still decides to try other ways or delegates again, as if the job is not done. My guesstimate is since we only include the final observation of child agent in parent agent's history, the parent loses the chain of thought to achieve the result, and thus gets confused. I believe the solution would be to append child's history to parent's history, but in a summary format (a.k.a. "condensed"). @enyst has been doing some work and preliminary result seems promising. See discussion here #2103 (comment).

I regenerated all prompts & responses use GPT-4o since this PR changes the prompt of CodeActAgent by a lot. Using real LLM to regenerate could help us make sure that the agent at least keeps the bare minimum quality to pass all tests.

Now CodeActAgent would first use simple goto browse action, figures out the answer is hidden behind a button, and then decides to delegate to BrowsingAgent. This seems promising, but it might also just because I included a similar example in the one-shot learning.

…t-browse-prompt

neubig

This looks OK, but it makes the prompt more complex. It'd be good to evaluate this against SWE-bench lite to check its effect on accuracy.

I have a bunch of evals I'd like to run once #2085 is merged, so I'll add this one to the list.

agenthub/codeact_agent/prompt.py

enyst · 2024-06-08T15:11:19Z

agenthub/codeact_agent/codeact_agent.py

+                agent = delegate_action
+                task = thought
+            return AgentDelegateAction(
+                agent=agent, inputs={'task': task}, thought=thought


Not an issue with this PR, just picking your brain on this: do we really have to set task in inputs? Can we, if not passing an actual MessageAction, at least make delegate action and obs a little more in-line with others, and for example use a field in AgentDelegateAction for the task, and use the content field in AgentDelegateObservation, which all obs have, for its result?

do we really have to set task in inputs?

Some micro-agents needs more than just a task as an input. For example, coder-agent needs both "task" and "summary" of codebase - although one could argue "summary" could be added to the "task" prompt.

and use the content field in AgentDelegateObservation

Some micro-agents needs more than just a "content" as an output. E.g. CommitWriterAgent can either return outputs['answer'] if it generates a commit message, or outputs['reason'] if it rejects the request.

which all obs have, for its result

Fundamentally, AgentDelegateObservation is different from others. CmdOutputObservation, e.g., is essentially an observation of stdout/stderr - that being said, what if in the future, we need to catch other side-effects? Likely we would need to do something similar here - use outputs as a dict rather than content as a string.

li-boxuan · 2024-06-09T05:03:38Z

This looks OK, but it makes the prompt more complex. It'd be good to evaluate this against SWE-bench lite to check its effect on accuracy.

I believe this will lower the score - the performance between CodeActAgent and CodeActSWEAgent on SWE-bench lite has proved so. Question is how much it hurts the benchmark - that would be interesting to know.

…t-browse-prompt

agenthub/codeact_agent/prompt.py

tests/integration/mock/CodeActAgent/test_browse_internet/prompt_001.log

Co-authored-by: Yufan Song <[email protected]>

assertion · 2024-06-10T03:26:57Z

This looks OK, but it makes the prompt more complex. It'd be good to evaluate this against SWE-bench lite to check its effect on accuracy.

I believe this will lower the score - the performance between CodeActAgent and CodeActSWEAgent on SWE-bench lite has proved so. Question is how much it hurts the benchmark - that would be interesting to know.

I think CodeActAgent for end-user, and CodeActSWEAgent for benchmark. So CodeActAgent can explore on multi-agent mode( maybe some other new skills/modes ), and don't pay too much attention on the benchmark score, and CodeActSWEAgent focus on the benchmark score, accuracy/cost/performance are the most important, so maybe new skills/modes won't used at the beginning. @li-boxuan @neubig

li-boxuan · 2024-06-10T03:56:50Z

So CodeActAgent can explore on multi-agent mode( maybe some other new skills/modes ), and don't pay too much attention on the benchmark score

Yeah I mostly agree. That being said, I guess it's still interesting to see how CodeActAgent perform differently than CodeActSWEAgent on SWE-bench lite though.

…t-browse-prompt

tests/integration/test_agent.py

xingyaoww · 2024-06-11T09:45:09Z

agenthub/codeact_agent/prompt.py

@@ -38,8 +45,8 @@

 SYSTEM_SUFFIX = """Responses should be concise.
 The assistant should attempt fewer things at a time instead of putting too much commands OR code in one "execute" block.
-Include ONLY ONE <execute_ipython>, <execute_bash>, or <execute_browse> per response, unless the assistant is finished with the task or need more input or action from the user in order to proceed.
-IMPORTANT: Execute code using <execute_ipython>, <execute_bash>, or <execute_browse> whenever possible.
+Include ONLY ONE <execute_ipython>, <execute_bash>, <execute_browse>, or <execute_delegate> per response, unless the assistant is finished with the task or needs more input or action from the user in order to proceed.


I'm actually thinking about a path where we:

Refractor the arch a bit to use event stream of orchestrating everything

Make primitives of <execute_browse> available inside <execute_ipython> as python function (e.g., goto_url(XXX)) so it is one fewer tags for model to remember

We ONLY keep <execute_ipython>, <execute_bash>, or <execute_delegate>

Cool, let's hold this PR for now until we can remove execute_browse

Blocked by #2404

OK, so I guess this issue is blocked by a refactor that will likely be done by @xingyaoww , so maybe I should assign it to him for now?

neubig · 2024-06-22T02:43:42Z

Converted to a draft since this is not ready for review at the moment.

li-boxuan · 2024-07-11T03:46:07Z

I am closing this PR. I feel like this makes our prompt unnecessarily complicated. Just simply delegating browsing actions to BrowsingAgent seems to be the right thing to do.

FWIW, the following bug mentioned in the opening comment

This PR doesn't solve one problem: sometimes the browsing agent has finished its job and reported back the result, yet codeact agent still decides to try other ways or delegates again, as if the job is not done

was fixed by another PR.

li-boxuan added 5 commits June 7, 2024 17:53

One-shot learning for CodeActAgent to delegate

83ca2d4

Fix indentation in the prompt example

c8632f2

Tune prompts more

73494fc

Regenerate

b504697

Merge remote-tracking branch 'upstream/main' into boxuan/tweak-codeac…

d497d82

…t-browse-prompt

neubig reviewed Jun 8, 2024

View reviewed changes

agenthub/codeact_agent/prompt.py Outdated Show resolved Hide resolved

neubig self-assigned this Jun 8, 2024

enyst reviewed Jun 8, 2024

View reviewed changes

li-boxuan mentioned this pull request Jun 9, 2024

Refactored prompt.py to reduce token usage #1996

Merged

li-boxuan added 2 commits June 8, 2024 23:35

Merge remote-tracking branch 'upstream/main' into boxuan/tweak-codeac…

0283970

…t-browse-prompt

Fix typo

ccd44cd

li-boxuan force-pushed the boxuan/tweak-codeact-browse-prompt branch from 1cb27ad to ccd44cd Compare June 9, 2024 18:38

Merge remote-tracking branch 'upstream/main' into boxuan/tweak-codeac…

7124cf5

…t-browse-prompt

frankxu2004 requested a review from xingyaoww June 10, 2024 00:38

yufansong reviewed Jun 10, 2024

View reviewed changes

agenthub/codeact_agent/prompt.py Outdated Show resolved Hide resolved

agenthub/codeact_agent/prompt.py Outdated Show resolved Hide resolved

tests/integration/mock/CodeActAgent/test_browse_internet/prompt_001.log Outdated Show resolved Hide resolved

Apply suggestions from code review

e0cbf4a

Co-authored-by: Yufan Song <[email protected]>

Regenerate prompts

9009896

Merge remote-tracking branch 'upstream/main' into boxuan/tweak-codeac…

c698c2a

…t-browse-prompt

li-boxuan commented Jun 11, 2024

View reviewed changes

tests/integration/test_agent.py Show resolved Hide resolved

xingyaoww reviewed Jun 11, 2024

View reviewed changes

neubig assigned xingyaoww and unassigned neubig Jun 16, 2024

neubig marked this pull request as draft June 22, 2024 02:43

li-boxuan closed this Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CodeActAgent: Only delegate to BrowsingAgent as last resort #2326

CodeActAgent: Only delegate to BrowsingAgent as last resort #2326

Uh oh!

li-boxuan commented Jun 8, 2024 •

edited

Loading

Uh oh!

neubig left a comment

Uh oh!

Uh oh!

enyst Jun 8, 2024

Uh oh!

li-boxuan Jun 9, 2024

Uh oh!

li-boxuan commented Jun 9, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

assertion commented Jun 10, 2024

Uh oh!

li-boxuan commented Jun 10, 2024

Uh oh!

Uh oh!

xingyaoww Jun 11, 2024

Uh oh!

li-boxuan Jun 12, 2024

Uh oh!

li-boxuan Jun 12, 2024

Uh oh!

neubig Jun 16, 2024

Uh oh!

neubig commented Jun 22, 2024

Uh oh!

li-boxuan commented Jul 11, 2024 •

edited

Loading

Uh oh!

Uh oh!

CodeActAgent: Only delegate to BrowsingAgent as last resort #2326

CodeActAgent: Only delegate to BrowsingAgent as last resort #2326

Uh oh!

Conversation

li-boxuan commented Jun 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neubig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

enyst Jun 8, 2024

Choose a reason for hiding this comment

Uh oh!

li-boxuan Jun 9, 2024

Choose a reason for hiding this comment

Uh oh!

li-boxuan commented Jun 9, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

assertion commented Jun 10, 2024

Uh oh!

li-boxuan commented Jun 10, 2024

Uh oh!

Uh oh!

xingyaoww Jun 11, 2024

Choose a reason for hiding this comment

Uh oh!

li-boxuan Jun 12, 2024

Choose a reason for hiding this comment

Uh oh!

li-boxuan Jun 12, 2024

Choose a reason for hiding this comment

Uh oh!

neubig Jun 16, 2024

Choose a reason for hiding this comment

Uh oh!

neubig commented Jun 22, 2024

Uh oh!

li-boxuan commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

li-boxuan commented Jun 8, 2024 •

edited

Loading

li-boxuan commented Jul 11, 2024 •

edited

Loading