-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Eval,Arch] Update GPTQ eval and add headless_mode
for Controller
#2994
Conversation
@xingyaoww this made my day: 🤣 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good to me and I agree that headless mode is a good idea. I had a few comments though.
evaluation/gpqa/run_infer.py
Outdated
MOST IMPORTANT: Format your response as follows: | ||
<<FINAL_ANSWER|| | ||
<insert correct answer here, must be one of A, B, C, D> (Please dont use any additional characters. Just the letter of the correct answer (A/B/C/D).) | ||
||FINAL_ANSWER>> | ||
|
||
Additional Instructions: | ||
- You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP. | ||
""" | ||
Additional Instructions: | ||
- Do not try to solve the question in a single step. Break it down into smaller steps. | ||
- You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP. | ||
|
||
# NOTE: You can actually set slightly different instruction for different agents | ||
instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__] | ||
- SUPER IMPORTANT: When you have reported the answer to the user in the requested format, (and only once that is done) in the next turn, please run the following command: <execute_bash> exit </execute_bash>. | ||
- Again you are being told a million times to first report the answer in the requested format (see again below for reference) before exiting. DO NOT EXIT WITHOUT REPORTING THE ANSWER FIRST. | ||
That is, when you have decided on the answer report in the following format: | ||
|
||
<<FINAL_ANSWER|| | ||
<insert correct answer here, must be one of A, B, C, D> (Please dont use any additional characters. Just the letter of the correct answer (A/B/C/D).) | ||
||FINAL_ANSWER>> | ||
<execute_bash> exit </execute_bash> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is duplicated with the codeact user response above, could we deduplicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because these were the code we used to run experiments in our paper, maybe we should leave them as is for reproducibility and improve in future iterations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but because the strings are the same, wouldn't deduplicating be functionally equivalent?
Co-authored-by: Graham Neubig <[email protected]>
Co-authored-by: Graham Neubig <[email protected]>
f'[Agent Controller {self.id}] Delegate state: {delegate_state}' | ||
) | ||
if delegate_state == AgentState.ERROR or ( | ||
self.headless_mode and delegate_state == AgentState.PAUSED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we change await self.set_agent_state_to(AgentState.PAUSED)
code to await self.set_agent_state_to(AgentState.ERROR)
in headless mode here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! pushed the changes in the latest commits.
opendevin/core/main.py
Outdated
@@ -40,6 +40,7 @@ async def run_agent_controller( | |||
sandbox: Sandbox | None = None, | |||
runtime_tools_config: dict | None = None, | |||
sid: str | None = None, | |||
headless_mode: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be a good idea to default this to True
, so that we only need to change this to False
in the backend for the application without touching all over the places (e.g., tests, evaluations)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that makes sense. When running OpenDevin via CLI, there's no way to "continue" once the application is paused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also agree that this is probably OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in the latest commit to reduce the amount of changes in this PR
<<FINAL_ANSWER|| | ||
<insert correct answer here, must be one of A, B, C, D> (Please dont use any additional characters. Just the letter of the correct answer (A/B/C/D).) | ||
||FINAL_ANSWER>> | ||
""".strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neubig is this similar to what you have in mind for deduplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK these parts are also duplicated?
'Again you are being told a million times to first report the answer in the requested format (see again below for reference) before exiting. DO NOT EXIT WITHOUT REPORTING THE ANSWER FIRST.\n'
'That is, when you have decided on the answer report in the following format:\n'
But this is not super-important, please don't let me block you.
What is the problem that this fixes or functionality that this introduces? Does it fix any open issues?
Give a summary of what the PR does, explaining any non-trivial design decisions
Add
thought
argument toAgentFinishAction
for CodeAct.Update GPQA inference code to be more aligned with the actual experiment we run in the paper by applying changes from this reproducible patch https://huggingface.co/spaces/OpenDevin/evaluation/blob/main/outputs/gpqa/reproducibility.patch
Add a
headless_mode
for controller (mainly for evaluation, default to False).headless_mode
is True, we will stop the whole run when the delegated agent (e.g., BrowsingAgent in the CodeAct -> Browsing delegation) hits the max iteration limit. Currently, the behavior is the delegated agent will wait for the user to hit the continue button, which puts an evaluation into an infinite loop since it is running headless.run_infer.py
if people like this idea!Other references