-
Notifications
You must be signed in to change notification settings - Fork 2.6k
fix(core): implement proper token counting for mixed content in getNumTokens #8341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(core): implement proper token counting for mixed content in getNumTokens #8341
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
1 Skipped Deployment
|
…mTokens Previously, getNumTokens would return 0 for any non-string MessageContent, which caused incorrect token counts when messages contained mixed content (e.g., text + image_url arrays). This affected methods like getNumTokensFromMessages that rely on getNumTokens for accurate counting. Changes: - Remove TODO and replace with proper text extraction logic - Extract text content from MessageContentComplex arrays - Join multiple text blocks while ignoring non-text content types - Maintain backward compatibility for string content - Add comprehensive tests covering mixed content scenarios Fixes issue where messages with image_url content would not count text tokens, leading to inaccurate token usage calculations. fixes langchain-ai#8310
d86a6c5
to
90cde92
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better than what we had before but I don't know if this is "end-state" token counting (for instance in python we consider other things like function calls apart of the token count, there's probably other ones we should be considering). Probably at a minimum we re-add the comment to signal a more appropriate solution is needed. It would be interesting to compare different message content instances with gpt-2 tokenizer and this fallback method.
I'll also call out I don't think the blast radius on this is big (I think) since we only use this in the trimMessages utility
Thanks for reviewing!
There have been 2 reported instances of this failing for users, see #8310 and #8336. I agree with you, this is definitely not perfect but addresses the issues reported. I've added a comment to revisit this section at a later time. Let me know if there is anything else I can do. |
thanks @christian-bromann! |
Description
This PR fixes a critical issue in the
BaseLanguageModel.getNumTokens()
method where it would return 0 tokens for any non-stringMessageContent
, effectively ignoring text content in mixed-content messages.Problem
Previously, when calling
getNumTokensFromMessages()
with messages containing mixed content (e.g., text + image_url arrays), the text portions were not being counted, leading to inaccurate token usage calculations. The method had a TODO comment and simply returned 0 for any array content:Solution
Implemented proper text extraction logic that:
MessageContentComplex[]
to find and concatenate all text blocksExample
Changes Made
getNumTokens()
with proper text extractiondescribe/it
blocks for better clarityTesting
getNumTokensFromMessages
verifiedThis fix ensures accurate token counting for modern chat applications that use rich message content with text, images, and other media types.
Fixes #8310
Fixes #8336