Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Eval,Arch] Update GPTQ eval and add headless_mode for Controller #2994

Merged
merged 11 commits into from
Jul 20, 2024

Conversation

xingyaoww
Copy link
Collaborator

What is the problem that this fixes or functionality that this introduces? Does it fix any open issues?


Give a summary of what the PR does, explaining any non-trivial design decisions

  • Add thought argument to AgentFinishAction for CodeAct.

  • Update GPQA inference code to be more aligned with the actual experiment we run in the paper by applying changes from this reproducible patch https://huggingface.co/spaces/OpenDevin/evaluation/blob/main/outputs/gpqa/reproducibility.patch

  • Add a headless_mode for controller (mainly for evaluation, default to False).

    • When headless_mode is True, we will stop the whole run when the delegated agent (e.g., BrowsingAgent in the CodeAct -> Browsing delegation) hits the max iteration limit. Currently, the behavior is the delegated agent will wait for the user to hit the continue button, which puts an evaluation into an infinite loop since it is running headless.
    • I can apply changes to all evaluation's run_infer.py if people like this idea!

Other references

@xingyaoww xingyaoww requested a review from li-boxuan July 18, 2024 13:33
@tobitege
Copy link
Collaborator

tobitege commented Jul 18, 2024

@xingyaoww this made my day: 🤣
"Again you are being told a million times to first report the answer in the requested format"

@xingyaoww
Copy link
Collaborator Author

@tobitege This was original written by @1jsingh in that reproducibility patch 😆

@xingyaoww xingyaoww requested a review from neubig July 18, 2024 15:00
Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me and I agree that headless mode is a good idea. I had a few comments though.

Comment on lines 204 to 220
MOST IMPORTANT: Format your response as follows:
<<FINAL_ANSWER||
<insert correct answer here, must be one of A, B, C, D> (Please dont use any additional characters. Just the letter of the correct answer (A/B/C/D).)
||FINAL_ANSWER>>

Additional Instructions:
- You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.
"""
Additional Instructions:
- Do not try to solve the question in a single step. Break it down into smaller steps.
- You should ONLY interact with the environment provided to you AND NEVER ASK FOR HUMAN HELP.

# NOTE: You can actually set slightly different instruction for different agents
instruction += AGENT_CLS_TO_INST_SUFFIX[agent.__class__.__name__]
- SUPER IMPORTANT: When you have reported the answer to the user in the requested format, (and only once that is done) in the next turn, please run the following command: <execute_bash> exit </execute_bash>.
- Again you are being told a million times to first report the answer in the requested format (see again below for reference) before exiting. DO NOT EXIT WITHOUT REPORTING THE ANSWER FIRST.
That is, when you have decided on the answer report in the following format:

<<FINAL_ANSWER||
<insert correct answer here, must be one of A, B, C, D> (Please dont use any additional characters. Just the letter of the correct answer (A/B/C/D).)
||FINAL_ANSWER>>
<execute_bash> exit </execute_bash>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated with the codeact user response above, could we deduplicate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because these were the code we used to run experiments in our paper, maybe we should leave them as is for reproducibility and improve in future iterations?

Copy link
Contributor

@neubig neubig Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but because the strings are the same, wouldn't deduplicating be functionally equivalent?

xingyaoww and others added 2 commits July 19, 2024 10:53
f'[Agent Controller {self.id}] Delegate state: {delegate_state}'
)
if delegate_state == AgentState.ERROR or (
self.headless_mode and delegate_state == AgentState.PAUSED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we change await self.set_agent_state_to(AgentState.PAUSED) code to await self.set_agent_state_to(AgentState.ERROR) in headless mode here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! pushed the changes in the latest commits.

@@ -40,6 +40,7 @@ async def run_agent_controller(
sandbox: Sandbox | None = None,
runtime_tools_config: dict | None = None,
sid: str | None = None,
headless_mode: bool = False,
Copy link
Collaborator Author

@xingyaoww xingyaoww Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a good idea to default this to True, so that we only need to change this to False in the backend for the application without touching all over the places (e.g., tests, evaluations)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes sense. When running OpenDevin via CLI, there's no way to "continue" once the application is paused.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also agree that this is probably OK.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in the latest commit to reduce the amount of changes in this PR

<<FINAL_ANSWER||
<insert correct answer here, must be one of A, B, C, D> (Please dont use any additional characters. Just the letter of the correct answer (A/B/C/D).)
||FINAL_ANSWER>>
""".strip()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neubig is this similar to what you have in mind for deduplication?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK these parts are also duplicated?

        'Again you are being told a million times to first report the answer in the requested format (see again below for reference) before exiting. DO NOT EXIT WITHOUT REPORTING THE ANSWER FIRST.\n'
        'That is, when you have decided on the answer report in the following format:\n'

But this is not super-important, please don't let me block you.

@xingyaoww xingyaoww enabled auto-merge (squash) July 19, 2024 21:53
@xingyaoww xingyaoww merged commit 6b16a5d into main Jul 20, 2024
@xingyaoww xingyaoww deleted the xw/gpqa branch July 20, 2024 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants