Skip to content

Enable CodeAct agents with browsing, and also enable arbitrary BrowserGym action support #1807

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 18 additions & 2 deletions agenthub/codeact_agent/codeact_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,13 @@
from opendevin.events.action import (
Action,
AgentFinishAction,
BrowseInteractiveAction,
CmdRunAction,
IPythonRunCellAction,
MessageAction,
)
from opendevin.events.observation import (
BrowserOutputObservation,
CmdOutputObservation,
IPythonRunCellObservation,
)
Expand All @@ -33,7 +35,7 @@

def parse_response(response) -> str:
action = response.choices[0].message.content
for lang in ['bash', 'ipython']:
for lang in ['bash', 'ipython', 'browse']:
if f'<execute_{lang}>' in action and f'</execute_{lang}>' not in action:
action += f'</execute_{lang}>'
return action
Expand Down Expand Up @@ -85,7 +87,7 @@ def swe_agent_edit_hack(bash_command: str) -> str:


class CodeActAgent(Agent):
VERSION = '1.2'
VERSION = '1.3'
"""
The Code Act Agent is a minimalist agent.
The agent works by passing the model a list of action-observation pairs and prompting the model to take the next step.
Expand Down Expand Up @@ -171,6 +173,7 @@ def step(self, state: State) -> Action:
Returns:
- CmdRunAction(command) - bash command to run
- IPythonRunCellAction(code) - IPython code to run
- BrowseInteractiveAction(browsergym_command) - BrowserGym commands to run
- MessageAction(content) - Message action to run (e.g. ask for clarification)
- AgentFinishAction() - end the interaction
"""
Expand Down Expand Up @@ -205,6 +208,9 @@ def step(self, state: State) -> Action:
content = '\n'.join(splitted)
content = truncate_observation(content)
self.messages.append({'role': 'user', 'content': content})
elif isinstance(obs, BrowserOutputObservation):
content = 'OBSERVATION:\n' + truncate_observation(obs.content)
self.messages.append({'role': 'user', 'content': content})

latest_user_message = [m for m in self.messages if m['role'] == 'user'][-1]
if latest_user_message:
Expand All @@ -217,6 +223,7 @@ def step(self, state: State) -> Action:
stop=[
'</execute_ipython>',
'</execute_bash>',
'</execute_browse>',
],
temperature=0.0,
)
Expand Down Expand Up @@ -251,6 +258,15 @@ def step(self, state: State) -> Action:
code_group = python_code.group(1).strip()
thought = action_str.replace(python_code.group(0), '').strip()
return IPythonRunCellAction(code=code_group, thought=thought)
elif browse_command := re.search(
r'<execute_browse>(.*)</execute_browse>', action_str, re.DOTALL
):
# BrowserGym actions was found
browse_actions = browse_command.group(1).strip()
thought = action_str.replace(browse_command.group(0), '').strip()
return BrowseInteractiveAction(
browser_actions=browse_actions, thought=thought
)
else:
# We assume the LLM is GOOD enough that when it returns pure natural language
# it want to talk to the user
Expand Down
22 changes: 20 additions & 2 deletions agenthub/codeact_agent/prompt.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@
</execute_ipython>
The assistant can execute bash commands on behalf of the user by wrapping them with <execute_bash> and </execute_bash>.
For example, you can list the files in the current directory by <execute_bash> ls </execute_bash>.
The assistant can browse the Internet with commands on behalf of the user by wrapping them with <execute_browse> and </execute_browse>.
For example, you can browse a given URL by <execute_browse> goto("<URL>") </execute_browse>.
The assistant should attempt fewer things at a time instead of putting too much commands OR code in one "execute" block.
The assistant can install Python packages through bash by <execute_bash> pip install [package needed] </execute_bash> and should always import packages and define variables before starting to use them.
The assistant should stop <execute> and provide an answer when they have already obtained the answer from the execution result.
Expand All @@ -49,8 +51,8 @@
If you require access to GitHub but $GITHUB_TOKEN is not set, ask the user to set it for you."""

SYSTEM_SUFFIX = """The assistant's response should be concise.
You should include <execute_ipython> or <execute_bash> in every one of your responses, unless you are finished with the task or need more input or action from the user in order to proceed.
IMPORTANT: Whenever possible, execute the code for the user using <execute_ipython> or <execute_bash> instead of providing it.
You should include <execute_ipython> or <execute_bash> or <execute_browse> in every one of your responses, unless you are finished with the task or need more input or action from the user in order to proceed.
IMPORTANT: Whenever possible, execute the code for the user using <execute_ipython> or <execute_bash> or <execute_browse> instead of providing it.
"""

EXAMPLES = """
Expand Down Expand Up @@ -154,6 +156,21 @@ def index():
ASSISTANT:
The server is running on port 5000 with PID 124. You can access the list of numbers by visiting http://127.0.0.1:5000. If you have any further questions, feel free to ask!

USER: Now browse the newly started server's homepage and show me the content.

ASSISTANT:
Sure! Let me browse the server's homepage at http://127.0.0.1:5000:
<execute_browse>
goto("http://127.0.0.1:5000")
</execute_browse>

USER:
Observation:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

ASSISTANT:
The content of the server's homepage is "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]". If you have any further questions, feel free to ask!

USER: Now kill the server, make it display the numbers in a table format.

ASSISTANT:
Expand Down Expand Up @@ -230,4 +247,5 @@ def index():
"I don't understand your input. \n"
'If you want to execute a bash command, please use <execute_bash> YOUR_COMMAND_HERE </execute_bash>.\n'
'If you want to execute a block of Python code, please use <execute_ipython> YOUR_COMMAND_HERE </execute_ipython>.\n'
'If you want to browse the Internet, please use <execute_browse> YOUR_COMMAND_HERE </execute_browse>.\n'
)
5 changes: 5 additions & 0 deletions frontend/src/services/actions.ts
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,11 @@ const messageActions = {
store.dispatch(setUrl(url));
store.dispatch(setScreenshotSrc(screenshotSrc));
},
[ActionType.BROWSE_INTERACTIVE]: (message: ActionMessage) => {
const { url, screenshotSrc } = message.args;
store.dispatch(setUrl(url));
store.dispatch(setScreenshotSrc(screenshotSrc));
},
[ActionType.WRITE]: (message: ActionMessage) => {
const { path, content } = message.args;
store.dispatch(updatePath(path));
Expand Down
3 changes: 3 additions & 0 deletions frontend/src/types/ActionType.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ enum ActionType {
// Opens a web page.
BROWSE = "browse",

// Interact with the browser instance.
BROWSE_INTERACTIVE = "browse_interactive",

// Searches long-term memory.
RECALL = "recall",

Expand Down
4 changes: 4 additions & 0 deletions opendevin/core/schema/action.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,10 @@ class ActionTypeSchema(BaseModel):
"""Opens a web page.
"""

BROWSE_INTERACTIVE: str = Field(default='browse_interactive')
"""Interact with the browser instance.
"""

RECALL: str = Field(default='recall')
"""Searches long-term memory
"""
Expand Down
3 changes: 2 additions & 1 deletion opendevin/events/action/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
AgentSummarizeAction,
ChangeAgentStateAction,
)
from .browse import BrowseURLAction
from .browse import BrowseInteractiveAction, BrowseURLAction
from .commands import CmdKillAction, CmdRunAction, IPythonRunCellAction
from .empty import NullAction
from .files import FileReadAction, FileWriteAction
Expand All @@ -20,6 +20,7 @@
'CmdRunAction',
'CmdKillAction',
'BrowseURLAction',
'BrowseInteractiveAction',
'FileReadAction',
'FileWriteAction',
'AgentRecallAction',
Expand Down
12 changes: 12 additions & 0 deletions opendevin/events/action/browse.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,15 @@ class BrowseURLAction(Action):
@property
def message(self) -> str:
return f'Browsing URL: {self.url}'


@dataclass
class BrowseInteractiveAction(Action):
browser_actions: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I only see 'goto' browser_actions in this pr, is there any other type? Since if only one browser_action, then maybe BrowseInteractiveAction is just the same as BrowseURLAction ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're planning on adding more actions later, but want to do it a little bit at the time as we validate that this doesn't hurt accuracy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, this is true, I only showed to the model in prompts about the usage of goto(). However, there are many other actions that can be supported: https://github.com/ServiceNow/BrowserGym/blob/main/core/src/browsergym/core/action/highlevel.py

Now this is to lay ground to this support in the future

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, got it.

thought: str = ''
action: str = ActionType.BROWSE_INTERACTIVE
runnable: ClassVar[bool] = True

@property
def message(self) -> str:
return f'Executing browser actions: {self.browser_actions}'
3 changes: 2 additions & 1 deletion opendevin/events/serialization/action.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
AgentRejectAction,
ChangeAgentStateAction,
)
from opendevin.events.action.browse import BrowseURLAction
from opendevin.events.action.browse import BrowseInteractiveAction, BrowseURLAction
from opendevin.events.action.commands import (
CmdKillAction,
CmdRunAction,
Expand All @@ -22,6 +22,7 @@
CmdRunAction,
IPythonRunCellAction,
BrowseURLAction,
BrowseInteractiveAction,
FileReadAction,
FileWriteAction,
AgentRecallAction,
Expand Down
5 changes: 5 additions & 0 deletions opendevin/runtime/runtime.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from opendevin.events.action import (
Action,
AgentRecallAction,
BrowseInteractiveAction,
BrowseURLAction,
CmdKillAction,
CmdRunAction,
Expand Down Expand Up @@ -154,6 +155,10 @@ async def write(self, action: FileWriteAction) -> Observation:
async def browse(self, action: BrowseURLAction) -> Observation:
pass

@abstractmethod
async def browse_interactive(self, action: BrowseInteractiveAction) -> Observation:
pass

@abstractmethod
async def recall(self, action: AgentRecallAction) -> Observation:
pass
25 changes: 18 additions & 7 deletions opendevin/runtime/server/browse.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,23 @@
import os

from opendevin.core.schema import ActionType
from opendevin.events.observation import BrowserOutputObservation


async def browse(action, browser) -> BrowserOutputObservation: # type: ignore
asked_url = action.url
if not asked_url.startswith('http'):
asked_url = os.path.abspath(os.curdir) + action.url
try:
# action in BrowserGym: see https://github.com/ServiceNow/BrowserGym/blob/main/core/src/browsergym/core/action/functions.py
if action.action == ActionType.BROWSE:
# legacy BrowseURLAction
asked_url = action.url
if not asked_url.startswith('http'):
asked_url = os.path.abspath(os.curdir) + action.url
action_str = f'goto("{asked_url}")'
elif action.action == ActionType.BROWSE_INTERACTIVE:
# new BrowseInteractiveAction, supports full featured BrowserGym actions
# action in BrowserGym: see https://github.com/ServiceNow/BrowserGym/blob/main/core/src/browsergym/core/action/functions.py
action_str = action.browser_actions
else:
raise ValueError(f'Invalid action type: {action.action}')
try:
# obs provided by BrowserGym: see https://github.com/ServiceNow/BrowserGym/blob/main/core/src/browsergym/core/env.py#L396
obs = browser.step(action_str)
return BrowserOutputObservation(
Expand All @@ -21,9 +29,12 @@ async def browse(action, browser) -> BrowserOutputObservation: # type: ignore
last_browser_action=obs['last_action'], # last browser env action performed
focused_element_bid=obs['focused_element_bid'], # focused element bid
screenshot=obs['screenshot'], # base64-encoded screenshot, png
url=asked_url,
url=obs['url'], # URL of the page
)
except Exception as e:
return BrowserOutputObservation(
content=str(e), screenshot='', error=True, url=asked_url
content=str(e),
screenshot='',
error=True,
url=asked_url if action.action == ActionType.BROWSE else '',
)
4 changes: 4 additions & 0 deletions opendevin/runtime/server/runtime.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from opendevin.events.action import (
AgentRecallAction,
BrowseInteractiveAction,
BrowseURLAction,
CmdKillAction,
CmdRunAction,
Expand Down Expand Up @@ -58,6 +59,9 @@ async def write(self, action: FileWriteAction) -> Observation:
async def browse(self, action: BrowseURLAction) -> Observation:
return await browse(action, self.browser)

async def browse_interactive(self, action: BrowseInteractiveAction) -> Observation:
return await browse(action, self.browser)

async def recall(self, action: AgentRecallAction) -> Observation:
return NullObservation('')

Expand Down
21 changes: 19 additions & 2 deletions tests/integration/mock/CodeActAgent/test_edits/prompt_001.log
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ print("Hello World!")
</execute_ipython>
The assistant can execute bash commands on behalf of the user by wrapping them with <execute_bash> and </execute_bash>.
For example, you can list the files in the current directory by <execute_bash> ls </execute_bash>.
The assistant can browse the Internet with commands on behalf of the user by wrapping them with <execute_browse> and </execute_browse>.
For example, you can browse a given URL by <execute_browse> goto("<URL>") </execute_browse>.
The assistant should attempt fewer things at a time instead of putting too much commands OR code in one "execute" block.
The assistant can install Python packages through bash by <execute_bash> pip install [package needed] </execute_bash> and should always import packages and define variables before starting to use them.
The assistant should stop <execute> and provide an answer when they have already obtained the answer from the execution result.
Expand Down Expand Up @@ -38,8 +40,8 @@ end_of_edit - replaces lines <start_line> through <end_line> (inclusive) with th
Please note that THE EDIT COMMAND REQUIRES PROPER INDENTATION. If you'd like to add the line ' print(x)' you must fully write that out, with all those spaces before the code! Indentation is important and code that is not indented correctly will fail and require fixing before it can be run.

The assistant's response should be concise.
You should include <execute_ipython> or <execute_bash> in every one of your responses, unless you are finished with the task or need more input or action from the user in order to proceed.
IMPORTANT: Whenever possible, execute the code for the user using <execute_ipython> or <execute_bash> instead of providing it.
You should include <execute_ipython> or <execute_bash> or <execute_browse> in every one of your responses, unless you are finished with the task or need more input or action from the user in order to proceed.
IMPORTANT: Whenever possible, execute the code for the user using <execute_ipython> or <execute_bash> or <execute_browse> instead of providing it.


----------
Expand Down Expand Up @@ -146,6 +148,21 @@ Press CTRL+C to quit
ASSISTANT:
The server is running on port 5000 with PID 124. You can access the list of numbers by visiting http://127.0.0.1:5000. If you have any further questions, feel free to ask!

USER: Now browse the newly started server's homepage and show me the content.

ASSISTANT:
Sure! Let me browse the server's homepage at http://127.0.0.1:5000:
<execute_browse>
goto("http://127.0.0.1:5000")
</execute_browse>

USER:
Observation:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

ASSISTANT:
The content of the server's homepage is "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]". If you have any further questions, feel free to ask!

USER: Now kill the server, make it display the numbers in a table format.

ASSISTANT:
Expand Down
34 changes: 25 additions & 9 deletions tests/integration/mock/CodeActAgent/test_edits/prompt_002.log
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ print("Hello World!")
</execute_ipython>
The assistant can execute bash commands on behalf of the user by wrapping them with <execute_bash> and </execute_bash>.
For example, you can list the files in the current directory by <execute_bash> ls </execute_bash>.
The assistant can browse the Internet with commands on behalf of the user by wrapping them with <execute_browse> and </execute_browse>.
For example, you can browse a given URL by <execute_browse> goto("<URL>") </execute_browse>.
The assistant should attempt fewer things at a time instead of putting too much commands OR code in one "execute" block.
The assistant can install Python packages through bash by <execute_bash> pip install [package needed] </execute_bash> and should always import packages and define variables before starting to use them.
The assistant should stop <execute> and provide an answer when they have already obtained the answer from the execution result.
Expand Down Expand Up @@ -38,8 +40,8 @@ end_of_edit - replaces lines <start_line> through <end_line> (inclusive) with th
Please note that THE EDIT COMMAND REQUIRES PROPER INDENTATION. If you'd like to add the line ' print(x)' you must fully write that out, with all those spaces before the code! Indentation is important and code that is not indented correctly will fail and require fixing before it can be run.

The assistant's response should be concise.
You should include <execute_ipython> or <execute_bash> in every one of your responses, unless you are finished with the task or need more input or action from the user in order to proceed.
IMPORTANT: Whenever possible, execute the code for the user using <execute_ipython> or <execute_bash> instead of providing it.
You should include <execute_ipython> or <execute_bash> or <execute_browse> in every one of your responses, unless you are finished with the task or need more input or action from the user in order to proceed.
IMPORTANT: Whenever possible, execute the code for the user using <execute_ipython> or <execute_bash> or <execute_browse> instead of providing it.


----------
Expand Down Expand Up @@ -146,6 +148,21 @@ Press CTRL+C to quit
ASSISTANT:
The server is running on port 5000 with PID 124. You can access the list of numbers by visiting http://127.0.0.1:5000. If you have any further questions, feel free to ask!

USER: Now browse the newly started server's homepage and show me the content.

ASSISTANT:
Sure! Let me browse the server's homepage at http://127.0.0.1:5000:
<execute_browse>
goto("http://127.0.0.1:5000")
</execute_browse>

USER:
Observation:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

ASSISTANT:
The content of the server's homepage is "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]". If you have any further questions, feel free to ask!

USER: Now kill the server, make it display the numbers in a table format.

ASSISTANT:
Expand Down Expand Up @@ -228,20 +245,19 @@ ENVIRONMENT REMINDER: You have 9 turns left to complete the task.

----------

Certainly! I will start by opening the file `bad.txt` to see the content and then proceed to fix the typos.
Let's start by examining the contents of `bad.txt` to identify the typos. I'll read the file first.

<execute_bash>
open bad.txt
cat bad.txt
</execute_bash>

----------

OBSERVATION:
[File: /workspace/bad.txt (4 lines total)]
1:This is a stupid typoo.
2:Really?
3:No mor typos!
4:Enjoy!
This is a stupid typoo.
Really?
No mor typos!
Enjoy!
[Command -1 finished with exit code 0]]

ENVIRONMENT REMINDER: You have 8 turns left to complete the task.
Loading