Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenDevin checkout to reproduce CodeActAgent 1.3 ( 25% accuracy on SWE-Bench Lite) #2319

Closed
mihaela-bornea opened this issue Jun 7, 2024 · 16 comments

Comments

@mihaela-bornea
Copy link

Describe your question

Hello. Can you please clarify the version of the code that I need to checkout to run CodeAct 1.3 ?

I would like to reproduce the results in here

Thanks

@mamoodi
Copy link
Collaborator

mamoodi commented Jun 8, 2024

@mihaela-bornea are you looking for the exact OpenDevin version running that CodeAct version? Or would just running the latest 0.6 work for you?
The SWE-bench is a little involved to run.

@tobitege
Copy link
Collaborator

tobitege commented Jun 8, 2024

@mihaela-bornea are you looking for the exact OpenDevin version running that CodeAct version? Or would just running the latest 0.6 work for you? The SWE-bench is a little involved to run.

No, this issue can be closed. Question was actually for execution of CodeAct as a standalone feature, without OpenDevin. Answered in Discord, that that is not supported/possible.

@mamoodi mamoodi closed this as completed Jun 8, 2024
@mihaela-bornea
Copy link
Author

@mamoodi That is correct. I was looking for the OpenDevin version running that CodeAct version. More precisely the OpenDevin code for CodeAct 1.3, the one you used to obtain the 25%. Right now, if I clone the repo I will run CodeAct v1.5. What past version do I need to check out for CodeAct 1.3?

@tobitege
Copy link
Collaborator

tobitege commented Jun 10, 2024

Version 1.3 came out with this commit:

a84d19f

If you click through the files you're interested in, you e.g. can use the "Blame" feature on GitHub to find when a line, like the one with the VERSION in it, last changed. From there you can traverse back, just one option to find this.

@tobitege
Copy link
Collaborator

Example (the red-framed button takes you one commit back):
grafik

@mihaela-bornea
Copy link
Author

OK, thanks. I was not sure if I should use the first or the last the last commit with CodeAct 1.3. Just to confirm, the 25% result was with the commit you posted above. (a84d19f )

@tobitege
Copy link
Collaborator

OK, thanks. I was not sure if I should use the first or the last the last commit with CodeAct 1.3. Just to confirm, the 25% result was with the commit you posted above. (a84d19f )

Hmm... actually, I think I had a misunderstanding:

  • the CodeActAgent does exist as v1.3 in the codebase
  • the CodeActSWEAgent, which is/was used in evaluations, only ever existed as v1.5 in the codebase.
    So it's not clear to me that you can "go back" with it.

@mihaela-bornea
Copy link
Author

OK, so what commit do you recommend I use to get as close as possible to your 25% result?

@tobitege
Copy link
Collaborator

If you intend to run evaluations, then the CodeActSWEAgent commits are the ones to use.

@mihaela-bornea
Copy link
Author

I actually intend to run inference with OpenDevin CodeAct and obtain the same patches as the ones in your 25% experiment.

When I evaluate these patches, I expect to obtain 25%. I am not concerned about the evaluation script as I understand how to run evaluation.

@enyst enyst reopened this Jun 10, 2024
@avisil
Copy link

avisil commented Jun 10, 2024

HI @tobitege ( cc @neubig ) - what do we need to do to reproduce the 25% on SWE-Bench Lite?

We want to do the following:

  1. Run the inference on all the 300 examples with GPT 4o and CodeAct 1.3 (not 1.5 as per the thread as 1.5 yields ~22 as per HF eval here). {Unless you say with the current main and GPT4o I can also get 25% on SWE-Bench Lite}
  2. Run the eval script and then obtain 25% as mentioned in this link.

Thanks!

@xingyaoww
Copy link
Collaborator

xingyaoww commented Jun 10, 2024

  1. Per the hugging face eval benchmark, the commit ID associated with "CodeAct v1.3" is cd18ab215f65d22eafab18ca410c993f1dff8469. So to reproduce, you need to git checkout cd18ab215f65d22eafab18ca410c993f1dff8469 OR git checkout codeact_1.3_swebench.

  2. We haven't stabilized the eval pipeline yet - checkout to that commit will mostly work (you should follow the README at that commit!), but you might need to restart the run_infer.sh script multiple times to get the full results (some buggy docker leaks here and there - but it will finish). Just a few hours ago, we improved the evaluation pipeline based on SWE-bench-docker and ran on the same set of patches actually give us 26.3%: Add SWEBench-docker eval #2085

@avisil
Copy link

avisil commented Jun 10, 2024

I see- and the 26.3 is with hints?

@xingyaoww
Copy link
Collaborator

@avisil Yes - that's correct!

I'm going to close this for now -- feel free to re-open if there's any further questions!

@avisil
Copy link

avisil commented Jun 10, 2024

Thanks so much for your responses!

@xingyaoww
Copy link
Collaborator

BTW, feel free to join our slack (https://bit.ly/OpenDevin-Slack) if you have any other questions! We have a #swe-bench-eval channel for quick discussion there :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants