Skip to content

fix(core): implement proper token counting for mixed content in getNumTokens #8341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 17, 2025

Conversation

christian-bromann
Copy link
Contributor

@christian-bromann christian-bromann commented Jun 10, 2025

Description

This PR fixes a critical issue in the BaseLanguageModel.getNumTokens() method where it would return 0 tokens for any non-string MessageContent, effectively ignoring text content in mixed-content messages.

Problem

Previously, when calling getNumTokensFromMessages() with messages containing mixed content (e.g., text + image_url arrays), the text portions were not being counted, leading to inaccurate token usage calculations. The method had a TODO comment and simply returned 0 for any array content:

// TODO: Figure out correct value.
if (typeof content !== "string") {
  return 0;
}

Solution

Implemented proper text extraction logic that:

  • Maintains backward compatibility: String content works exactly as before
  • Extracts text from arrays: Processes MessageContentComplex[] to find and concatenate all text blocks
  • Ignores non-text content: Image URLs, audio, and other media content are properly ignored for token counting
  • Handles edge cases: Empty arrays, image-only content, and multiple text blocks

Example

const messages = [
  new HumanMessage({
    content: [
      {
        type: "text",
        text: "What is this image?", // This text is now properly counted
      },
      {
        type: "image_url",
        image_url: {
          url: "https://example.com/image.jpg",
        },
      },
    ],
  }),
];

// Before: getNumTokensFromMessages would undercount due to 0 tokens from getNumTokens
// After: Correctly counts tokens for "What is this image?"

Changes Made

  • Core fix: Replaced TODO implementation in getNumTokens() with proper text extraction
  • Comprehensive tests: Added test suite covering mixed content, edge cases, and compatibility
  • Test restructuring: Improved test organization with describe/it blocks for better clarity

Testing

  • ✅ All existing tests pass (backward compatibility maintained)
  • ✅ New tests cover mixed content scenarios
  • ✅ Edge cases tested (empty arrays, image-only content, multiple text blocks)
  • ✅ Integration with getNumTokensFromMessages verified

This fix ensures accurate token counting for modern chat applications that use rich message content with text, images, and other media types.

Fixes #8310
Fixes #8336

Copy link

vercel bot commented Jun 10, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchainjs-docs ✅ Ready (Inspect) Visit Preview Jun 17, 2025 1:02am
1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchainjs-api-refs ⬜️ Ignored (Inspect) Jun 17, 2025 1:02am

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. auto:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jun 10, 2025
…mTokens

Previously, getNumTokens would return 0 for any non-string MessageContent,
which caused incorrect token counts when messages contained mixed content
(e.g., text + image_url arrays). This affected methods like
getNumTokensFromMessages that rely on getNumTokens for accurate counting.

Changes:
- Remove TODO and replace with proper text extraction logic
- Extract text content from MessageContentComplex arrays
- Join multiple text blocks while ignoring non-text content types
- Maintain backward compatibility for string content
- Add comprehensive tests covering mixed content scenarios

Fixes issue where messages with image_url content would not count
text tokens, leading to inaccurate token usage calculations.

fixes langchain-ai#8310
Copy link
Contributor

@hntrl hntrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better than what we had before but I don't know if this is "end-state" token counting (for instance in python we consider other things like function calls apart of the token count, there's probably other ones we should be considering). Probably at a minimum we re-add the comment to signal a more appropriate solution is needed. It would be interesting to compare different message content instances with gpt-2 tokenizer and this fallback method.

I'll also call out I don't think the blast radius on this is big (I think) since we only use this in the trimMessages utility

https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/messages/utils.py#L86

@christian-bromann
Copy link
Contributor Author

Thanks for reviewing!

I'll also call out I don't think the blast radius on this is big (I think) since we only use this in the trimMessages utility

There have been 2 reported instances of this failing for users, see #8310 and #8336. I agree with you, this is definitely not perfect but addresses the issues reported. I've added a comment to revisit this section at a later time. Let me know if there is anything else I can do.

@dosubot dosubot bot added the lgtm PRs that are ready to be merged as-is label Jun 17, 2025
@hntrl hntrl merged commit 055f972 into langchain-ai:main Jun 17, 2025
36 checks passed
@hntrl
Copy link
Contributor

hntrl commented Jun 17, 2025

thanks @christian-bromann!

@christian-bromann christian-bromann deleted the cb/getNumTokens-fix branch June 17, 2025 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature lgtm PRs that are ready to be merged as-is size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

trimMessages tokenCounter can't process complex messages getNumTokens only handles string values
2 participants