Skip to content

feat: append_file incl. all tests [agentskills] #2346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jun 10, 2024

Conversation

tobitege
Copy link
Collaborator

@tobitege tobitege commented Jun 9, 2024

Replacement for discarded first draft: #2207

Adds append_file to Agentskills plugin. Includes unit tests and adapted integration tests.

This PR adds a new method append_file(content: str) to the agent skills.

In testing OD with some file operations, it is a noticable issue for repeated errors by the model (used Gemini Pro 1.5) that adding content to a file often causes exceptions due to an invalid "start" line number for the edit_file command.

It seems very hard for the model to identify or keep track the total number of lines in a file or it is just not good enough in counting (ask an LLM to count the words of its answer and you know what I mean).
This issue can cause extra cost when it shouldn't.

Such an error then looks like this:
grafik

An example prompt to demonstrate the use of it could be like:
"Write the numbers 5 to 10 into a new file named test.txt"
A followup prompt could then just say:
"Append the numbers 20 to 25 to the same file."
If it were to use the edit_file command, chances would be high that it would use the wrong line number(s) again.

cc @neubig @li-boxuan

(blocked by #2085 as well)

@tobitege
Copy link
Collaborator Author

tobitege commented Jun 9, 2024

It'd be cool if one reviewer could trigger the workflow to start just to see how the tests go, but no merging yet.

def append_file(content: str) -> None:
"""Append content to the open file.

It appends text `content` to the end of the open file. Remember, the file must be open before editing.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will create a follow-up PR to add file_name as part of edit_file's argument anyways, right? Shall we include file_name here, rather than rely on the CURRENT_FILE?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, in a follow-up... just for the fact that integration tests are a PITA to get right, to be blunt. ;)
That both append_/edit_file then get as 1st param a file_name makes total sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sounds fine to me

Copy link
Collaborator

@li-boxuan li-boxuan Jun 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about @neubig 's benchmarking plan. If we decide to do benchmarking before this PR gets merged, then we should probably include the fix to edit_file in this PR too. Otherwise, we could merge this first, and then edit_file, and maybe a few other PRs, then run evaluation at once.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to work on bringing in the new file_name param for both append/edit_file methods.
Core changes and unit tests done. Integration tests next...

if not CURRENT_FILE or not os.path.isfile(CURRENT_FILE):
raise FileNotFoundError('No file open. Use the open_file function first.')

# Use a temporary file to write changes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply append?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was that it would be a change in behavior, as the other methods error if a file wasn't explicitly opened and/or created first.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This behavior will be changed soon anyways, right? I mean, you will revert this part in a follow-up PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same as in edit_file. I didn't know this was about to be changed.

Copy link
Collaborator Author

@tobitege tobitege Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @li-boxuan @xingyaoww
Till now the behavior is that only existing files could be edited and the LLM was supposed to use create_file and/or open_file for that.
I'd just like to confirm before I'd remove that exception for non-existing files, because that would require more changes again to prompts etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was already thinking of refactoring the append/edit methods so that both just call one internal method with an extra bool for append, but wanted to do this in followup PR, no problem.

The append is really just to make it easier for the LLM since there were often cases where it tried to start at "end of file + 1" line, got an exception, and had to do it again, thus waisting time+money.

Copy link
Collaborator

@xingyaoww xingyaoww Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense! My inclination is that ideally, we can do the refractor, if possible, merge it - and when I finish the swe-bench eval, I'll re-run the eval from the start to make sure we are not degrading perf.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just prefer to get this merged before I do a refactor and in the meantime another patch comes along with changed unit/integration tests. That takes a lot of time to redo.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve the PR to unblock you :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a 2nd reviewer 🤣

@li-boxuan
Copy link
Collaborator

li-boxuan commented Jun 9, 2024

Btw why is this blocked by #2085? I think we could finish this PR, merge it, note it down, and do evaluation afterwards.

Doing evaluation for every change to prompt might be too costly.

@tobitege
Copy link
Collaborator Author

tobitege commented Jun 9, 2024

Btw why is this blocked by #2085? I think we could finish this PR, merge it, note it down, and do evaluation afterwards.

Doing evaluation for every change to prompt might be too costly.

I copied that note about #2085 over, just in case.

@tobitege
Copy link
Collaborator Author

tobitege commented Jun 9, 2024

Ok, great to see that besides Mac, all the unit and integration tests worked. 🎸

@neubig please let me know if this stays blocked for 2085

@tobitege
Copy link
Collaborator Author

Commit incoming shortly with file_name added as parameter for append-/edit_file methods incl. updated tests...

@tobitege tobitege requested review from enyst and li-boxuan June 10, 2024 06:58
):
print(f'Setting workspace_base to {config.workspace_base}')
workspace_base = config.workspace_base

Copy link
Collaborator Author

@tobitege tobitege Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@li-boxuan this is one thing that took me hours to find and resolve:
The runtime loads values from the config.toml file, which may have a different workspace_base than what regenerate sets. This resulted in files being read and written in 2 different folders, thus multiple tests failing, like the "bad.txt" one.
Running in WSL here.

Edit: sorry, it's not showing all the lines above, should be 19-28, for context.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very weird, because the script sets environment variable WORKSPACE_BASE, which would override the value in config.toml.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried and it worked as expected. I set workspace_base="./workspace_base_in_toml" in my config.toml and ran the script. I printed out workspace_base after L17 and the value indeed was _test_workspace (defined in the script).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using Mac btw. Not sure if we have some os-dependent behavior somewhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using Mac btw. Not sure if we have some os-dependent behavior somewhere

That is what I fear with WSL right now.
Could you try this on your end if the file-related tests work with the config.toml file pointing to some different folder, please?
TEST_ONLY=true ONLY_TEST_AGENT="CodeActAgent" ./tests/integration/regenerate.sh
In my case, I have in config.toml:
workspace_base="/mnt/d/github/workspace"
which is one folder "below" my OpenDevin repo.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you mention it, I do have that set to true in my config.toml.
I guess I should try with false then? 😁

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's tested with integration tests? I would set it to false, and clean up both containers and (opendevin) images from docker to be sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should start a separate issue about this as this gets a bit long within this PR, what do you think?

Copy link
Collaborator

@li-boxuan li-boxuan Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persist_sandbox is not tested because it doesn't work (fully) yet. E.g. #2176 I don't think it would be even able to pass all tests at the moment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yeah that might very likely be the curse.

@tobitege
Copy link
Collaborator Author

Also, can someone confirm that this should work without a config.toml present?
TEST_ONLY=true ONLY_TEST_AGENT="CodeActAgent" ./tests/integration/regenerate.sh
For me this errors if no config.toml exists.

@li-boxuan
Copy link
Collaborator

Also, can someone confirm that this should work without a config.toml present?

Yes. That's exactly how CI is set up.

@enyst
Copy link
Collaborator

enyst commented Jun 10, 2024

Also, can someone confirm that this should work without a config.toml present? TEST_ONLY=true ONLY_TEST_AGENT="CodeActAgent" ./tests/integration/regenerate.sh For me this errors if no config.toml exists.

It works with no toml for me too. Do you have env vars set, like any of the workspace vars? Any chance you see what error it gives you with no toml?

@tobitege
Copy link
Collaborator Author

Any chance you see what error it gives you with no toml?

My WSL env doesn't have any OpenDevin vars.
I'll give it a try again without toml later today and try to find out more details.

@tobitege
Copy link
Collaborator Author

If you guys are ok with the extra check staying - and avoiding potential WSL issues - then I think this PR is good to go. 🚀

Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall! Approve to unblock the PR so @tobitege can work on a refractored version of the edit_file - or feel free to do it in this PR as well.

I can run a more comprehensive eval once we fix all the kinks in swe-bench eval and revert when necessary.

@tobitege
Copy link
Collaborator Author

I can run a more comprehensive eval once we fix all the kinks in swe-bench eval and revert when necessary.

Ah, excellent, just overlapped with my reply above. Cheers!

@tobitege
Copy link
Collaborator Author

cc @neubig @li-boxuan ✉️ kind invitation for review ✏️ 👨 😃

@tobitege tobitege changed the title feat: append_file incl. all tests [agentskills] (wait for 2085!) feat: append_file incl. all tests [agentskills] Jun 10, 2024
@tobitege
Copy link
Collaborator Author

@yufansong please click 🕹️ 😂

@neubig neubig enabled auto-merge (squash) June 10, 2024 17:05
@neubig neubig merged commit 9605106 into All-Hands-AI:main Jun 10, 2024
2 checks passed
first_error_line = None
for line in error_message.split('\n'):
if line.strip():
# The format of the error message is: <filename>:<line>:<column>: <error code> <error message>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love this!

@tobitege tobitege mentioned this pull request Jun 11, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants