Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prompt caching (Sonnet, Haiku only) #3411

Merged
merged 39 commits into from
Aug 27, 2024

Conversation

Kaushikdkrikhanu
Copy link
Contributor

@Kaushikdkrikhanu Kaushikdkrikhanu commented Aug 16, 2024

What is the problem that this fixes or functionality that this introduces? Does it fix any open issues?
Add prompt caching to codeact agent and log token cached.


Give a summary of what the PR does, explaining any non-trivial design decisions
#3397

  1. Adds the cache_prompt: bool parameter to the Message class.
  2. The Codeact agent checks if the model supports caching by using the supports_caching_prompt attributed created in the llm class.
  3. Added cache breakpoints to the system prompt, example, and the last two user messages.
  4. Moved the reminder to a new message, appended at the end, rather than appending it to the last message (which was preventing the last message from being cached).
  5. Added logs for cached tokens at the end of the response log file.
2024-08-18.23-39-39.mp4

It doesn't seem like Gemini caching support has been added to litellm, so only added support for claude for now. BerriAI/litellm#4284


Other references

@Kaushikdkrikhanu Kaushikdkrikhanu marked this pull request as draft August 16, 2024 01:11
@Kaushikdkrikhanu
Copy link
Contributor Author

Kaushikdkrikhanu commented Aug 16, 2024

image
So basically, we need to be smart with what we want to cache. link. What would be a good way?

@enyst
Copy link
Collaborator

enyst commented Aug 16, 2024

Those first two messages are a great start! The initial prompt is quite large. Thank you for doing this ❤️

Just to note, from Anthropic docs, there is also the fact that caching costs 25% more, while cache hits are 10% of the regular cost. I'm thinking that we may want to have this feature enabled by default, but it should probably be possible to disable easily enough, for various scenarios.

@Kaushikdkrikhanu
Copy link
Contributor Author

Those first two messages are a great start! The initial prompt is quite large. Thank you for doing this ❤️

Just to note, from Anthropic docs, there is also the fact that caching costs 25% more, while cache hits are 10% of the regular cost. I'm thinking that we may want to have this feature enabled by default, but it should probably be possible to disable easily enough, for various scenarios.

Just realized one thing, we can just put all the previous prompts(after the first two) inside just one message and we are done, for example:

Hey agent, this has happend till now: {actions -> .... , obersvation -> ...., action -> ....., obsevation -> .....)

Every 10 messages we update this block.

@neubig neubig added this to the 2024-08 milestone Aug 16, 2024
@enyst
Copy link
Collaborator

enyst commented Aug 17, 2024

A few more from Anthropic docs:

The cache has a 5-minute lifetime, refreshed each time the cached content is used.

Prompt Caching caches the full prefix
Prompt Caching references the entire prompt - tools, system, and messages up to and including the block designated with cache_control.

So we don't need to merge messages ourselves, instead set a checkpoint on the end message. Which is the end message is less clear to me. 😅 I have to test some more moving the messages checkpoint. If I see this right, though, we also don't need to define separate checkpoints for the two initial messages: just on the in-context example, it will include the previous, and it's always there. What do you think?

The response gives a report on the number of tokens cached and cache hits, IMHO if we can add it to logs it would be useful, just like we do cost control.

They also say,

The minimum cacheable prompt length is:
1024 tokens for Claude 3.5 Sonnet and Claude 3 Opus
2048 tokens for Claude 3.0 Haiku

Well, CodeAct prompt is large. 🤣 So that's likely no issue here.

@enyst enyst mentioned this pull request Aug 17, 2024
@Kaushikdkrikhanu
Copy link
Contributor Author

So we don't need to merge messages ourselves, instead set a checkpoint on the end message.

Then, would it not be a miss if new content is added, I will check this out. Thanks for pointing it out, I missed it. Will also add it to the logs.

@Kaushikdkrikhanu
Copy link
Contributor Author

Kaushikdkrikhanu commented Aug 18, 2024

So we don't need to merge messages ourselves, instead set a checkpoint on the end message.

Then, would it not be a miss if new content is added, I will check this out. Thanks for pointing it out, I missed it. Will also add it to the logs.

https://github.com/anthropics/anthropic-cookbook/blob/main/misc/prompt_caching.ipynb. The idea is to add the cache point on the recent two messages. This makes the caching incremental, whatever is missed is added and the earlier part is reused. Also, it will be a good idea to cache the system prompt since the user might reset and start a new conversation.

@Kaushikdkrikhanu
Copy link
Contributor Author

Kaushikdkrikhanu commented Aug 19, 2024

I feel caching works now.

2024-08-18.23-39-39.mp4

@Kaushikdkrikhanu Kaushikdkrikhanu marked this pull request as ready for review August 19, 2024 06:58
@Kaushikdkrikhanu Kaushikdkrikhanu marked this pull request as draft August 19, 2024 07:04
@Kaushikdkrikhanu Kaushikdkrikhanu marked this pull request as ready for review August 19, 2024 07:15
@Kaushikdkrikhanu
Copy link
Contributor Author

Lets see if integration tests are happier now, they were missing the reminder in the prompt.

I am unable to run them.

@enyst enyst force-pushed the add-prompt-caching branch from e6b5b0c to fa516f7 Compare August 24, 2024 01:35
@enyst
Copy link
Collaborator

enyst commented Aug 24, 2024

@Kaushikdkrikhanu Sorry about this. We've been making changes to the workflows these days. We don't need to regenerate, I think the problem now is that a workflow is required, but it has permissions errors on forks. Nothing we can fix in this PR, we need to fix it on the repo.

On the PR, I added some tests, as @anthony-chaudhary suggested, and made some experiments. I hope you don't mind I also committed and tested on the branch. Eventually I came around to your solution (last messages, reminder in new TextContent), after testing some others, with only a small tweak. Please do tell if you think it's not okay as it is now.

@anthony-chaudhary You make a good point that this could be treated as more of an LLM thing - it is an LLM option after all! - but I'm sure, as @Kaushikdkrikhanu mentioned, that litellm will make it easier to use. Moreover, apart from @Kaushikdkrikhanu 's argument about changing after the Message is serialized, is another fun fact: caching goes from the beginning of the system prompt to the marked message, and while agents currently have an almost 100% static history of messages, which allows us to insert caching fairly easily every step as we do here, that might change. Not sure yet how the prompts will look like, but in any case I think the agent should be able to decide when it can cache and when it doesn't.
(also why I think we may want a config option, maybe in a follow-up PR)

I think we could let the LLM class set the headers though, it doesn't feel right in the agent. 😅

@enyst
Copy link
Collaborator

enyst commented Aug 24, 2024

@Kaushikdkrikhanu Also, I think you're right litellm will make this more friendly. Regardless, we can add an option to have the LLM class deal with headers and refactor them out of the agent, but FWIW it's fine with me if it's a subsequent PR. Thank you!

@Kaushikdkrikhanu
Copy link
Contributor Author

Kaushikdkrikhanu commented Aug 24, 2024

@enyst Thank you for taking the time to make the changes, looking into the issues, writing unit test and testing it out.

@neubig
Copy link
Contributor

neubig commented Aug 25, 2024

Hey @enyst and @Kaushikdkrikhanu , just wanted to check, it seems like tests are failing. We're all excited to get this in, so would you be able to try to resolve the tests?

@Kaushikdkrikhanu
Copy link
Contributor Author

image
Is this some permission issue?

@tobitege
Copy link
Collaborator

Since the PR workflow is fixed now, all tests should now finish successfully.

@tobitege tobitege changed the title Add prompt caching Add prompt caching (Sonnet, Haiku only) Aug 26, 2024
@tobitege tobitege removed their request for review August 26, 2024 21:44
@tobitege tobitege added enhancement New feature or request agent quality Related to the agent quality labels Aug 26, 2024
@neubig neubig merged commit 5bb931e into All-Hands-AI:main Aug 27, 2024
@neubig
Copy link
Contributor

neubig commented Aug 27, 2024

Awesome, thanks for this! Exciting feature :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent quality Related to the agent quality enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants