OpenDevin checkout to reproduce CodeActAgent 1.3 ( 25% accuracy on SWE-Bench Lite) #2319

mihaela-bornea · 2024-06-07T17:59:01Z

Describe your question

Hello. Can you please clarify the version of the code that I need to checkout to run CodeAct 1.3 ?

I would like to reproduce the results in here

Thanks

mamoodi · 2024-06-08T19:09:35Z

@mihaela-bornea are you looking for the exact OpenDevin version running that CodeAct version? Or would just running the latest 0.6 work for you?
The SWE-bench is a little involved to run.

tobitege · 2024-06-08T19:13:15Z

@mihaela-bornea are you looking for the exact OpenDevin version running that CodeAct version? Or would just running the latest 0.6 work for you? The SWE-bench is a little involved to run.

No, this issue can be closed. Question was actually for execution of CodeAct as a standalone feature, without OpenDevin. Answered in Discord, that that is not supported/possible.

mihaela-bornea · 2024-06-10T12:03:17Z

@mamoodi That is correct. I was looking for the OpenDevin version running that CodeAct version. More precisely the OpenDevin code for CodeAct 1.3, the one you used to obtain the 25%. Right now, if I clone the repo I will run CodeAct v1.5. What past version do I need to check out for CodeAct 1.3?

tobitege · 2024-06-10T12:35:04Z

Version 1.3 came out with this commit:

a84d19f

If you click through the files you're interested in, you e.g. can use the "Blame" feature on GitHub to find when a line, like the one with the VERSION in it, last changed. From there you can traverse back, just one option to find this.

tobitege · 2024-06-10T12:41:45Z

Example (the red-framed button takes you one commit back):

mihaela-bornea · 2024-06-10T12:49:29Z

OK, thanks. I was not sure if I should use the first or the last the last commit with CodeAct 1.3. Just to confirm, the 25% result was with the commit you posted above. (a84d19f )

tobitege · 2024-06-10T12:57:52Z

OK, thanks. I was not sure if I should use the first or the last the last commit with CodeAct 1.3. Just to confirm, the 25% result was with the commit you posted above. (a84d19f )

Hmm... actually, I think I had a misunderstanding:

the CodeActAgent does exist as v1.3 in the codebase
the CodeActSWEAgent, which is/was used in evaluations, only ever existed as v1.5 in the codebase.
So it's not clear to me that you can "go back" with it.

mihaela-bornea · 2024-06-10T13:02:51Z

OK, so what commit do you recommend I use to get as close as possible to your 25% result?

tobitege · 2024-06-10T13:16:23Z

If you intend to run evaluations, then the CodeActSWEAgent commits are the ones to use.

mihaela-bornea · 2024-06-10T13:21:47Z

I actually intend to run inference with OpenDevin CodeAct and obtain the same patches as the ones in your 25% experiment.

When I evaluate these patches, I expect to obtain 25%. I am not concerned about the evaluation script as I understand how to run evaluation.

avisil · 2024-06-10T14:22:07Z

HI @tobitege ( cc @neubig ) - what do we need to do to reproduce the 25% on SWE-Bench Lite?

We want to do the following:

Run the inference on all the 300 examples with GPT 4o and CodeAct 1.3 (not 1.5 as per the thread as 1.5 yields ~22 as per HF eval here). {Unless you say with the current main and GPT4o I can also get 25% on SWE-Bench Lite}
Run the eval script and then obtain 25% as mentioned in this link.

Thanks!

xingyaoww · 2024-06-10T17:57:58Z

Per the hugging face eval benchmark, the commit ID associated with "CodeAct v1.3" is cd18ab215f65d22eafab18ca410c993f1dff8469. So to reproduce, you need to git checkout cd18ab215f65d22eafab18ca410c993f1dff8469 OR git checkout codeact_1.3_swebench.
We haven't stabilized the eval pipeline yet - checkout to that commit will mostly work (you should follow the README at that commit!), but you might need to restart the run_infer.sh script multiple times to get the full results (some buggy docker leaks here and there - but it will finish). Just a few hours ago, we improved the evaluation pipeline based on SWE-bench-docker and ran on the same set of patches actually give us 26.3%: Add SWEBench-docker eval #2085

avisil · 2024-06-10T18:09:11Z

I see- and the 26.3 is with hints?

xingyaoww · 2024-06-10T18:14:54Z

@avisil Yes - that's correct!

I'm going to close this for now -- feel free to re-open if there's any further questions!

avisil · 2024-06-10T18:15:46Z

Thanks so much for your responses!

xingyaoww · 2024-06-10T18:20:21Z

BTW, feel free to join our slack (https://bit.ly/OpenDevin-Slack) if you have any other questions! We have a #swe-bench-eval channel for quick discussion there :)

mihaela-bornea added the question label Jun 7, 2024

mamoodi closed this as completed Jun 8, 2024

enyst reopened this Jun 10, 2024

xingyaoww closed this as completed Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenDevin checkout to reproduce CodeActAgent 1.3 ( 25% accuracy on SWE-Bench Lite) #2319

OpenDevin checkout to reproduce CodeActAgent 1.3 ( 25% accuracy on SWE-Bench Lite) #2319

mihaela-bornea commented Jun 7, 2024

mamoodi commented Jun 8, 2024

tobitege commented Jun 8, 2024

mihaela-bornea commented Jun 10, 2024

tobitege commented Jun 10, 2024 •

edited

Loading

tobitege commented Jun 10, 2024

mihaela-bornea commented Jun 10, 2024

tobitege commented Jun 10, 2024

mihaela-bornea commented Jun 10, 2024

tobitege commented Jun 10, 2024

mihaela-bornea commented Jun 10, 2024

avisil commented Jun 10, 2024 •

edited

Loading

xingyaoww commented Jun 10, 2024 •

edited

Loading

avisil commented Jun 10, 2024

xingyaoww commented Jun 10, 2024

avisil commented Jun 10, 2024

xingyaoww commented Jun 10, 2024

OpenDevin checkout to reproduce CodeActAgent 1.3 ( 25% accuracy on SWE-Bench Lite) #2319

OpenDevin checkout to reproduce CodeActAgent 1.3 ( 25% accuracy on SWE-Bench Lite) #2319

Comments

mihaela-bornea commented Jun 7, 2024

Describe your question

mamoodi commented Jun 8, 2024

tobitege commented Jun 8, 2024

mihaela-bornea commented Jun 10, 2024

tobitege commented Jun 10, 2024 • edited Loading

tobitege commented Jun 10, 2024

mihaela-bornea commented Jun 10, 2024

tobitege commented Jun 10, 2024

mihaela-bornea commented Jun 10, 2024

tobitege commented Jun 10, 2024

mihaela-bornea commented Jun 10, 2024

avisil commented Jun 10, 2024 • edited Loading

xingyaoww commented Jun 10, 2024 • edited Loading

avisil commented Jun 10, 2024

xingyaoww commented Jun 10, 2024

avisil commented Jun 10, 2024

xingyaoww commented Jun 10, 2024

tobitege commented Jun 10, 2024 •

edited

Loading

avisil commented Jun 10, 2024 •

edited

Loading

xingyaoww commented Jun 10, 2024 •

edited

Loading