fix(core): implement proper token counting for mixed content in getNumTokens #8341

christian-bromann · 2025-06-10T23:25:25Z

Description

This PR fixes a critical issue in the BaseLanguageModel.getNumTokens() method where it would return 0 tokens for any non-string MessageContent, effectively ignoring text content in mixed-content messages.

Problem

Previously, when calling getNumTokensFromMessages() with messages containing mixed content (e.g., text + image_url arrays), the text portions were not being counted, leading to inaccurate token usage calculations. The method had a TODO comment and simply returned 0 for any array content:

// TODO: Figure out correct value.
if (typeof content !== "string") {
  return 0;
}

Solution

Implemented proper text extraction logic that:

Maintains backward compatibility: String content works exactly as before
Extracts text from arrays: Processes MessageContentComplex[] to find and concatenate all text blocks
Ignores non-text content: Image URLs, audio, and other media content are properly ignored for token counting
Handles edge cases: Empty arrays, image-only content, and multiple text blocks

Example

const messages = [
  new HumanMessage({
    content: [
      {
        type: "text",
        text: "What is this image?", // This text is now properly counted
      },
      {
        type: "image_url",
        image_url: {
          url: "https://example.com/image.jpg",
        },
      },
    ],
  }),
];

// Before: getNumTokensFromMessages would undercount due to 0 tokens from getNumTokens
// After: Correctly counts tokens for "What is this image?"

Changes Made

Core fix: Replaced TODO implementation in getNumTokens() with proper text extraction
Comprehensive tests: Added test suite covering mixed content, edge cases, and compatibility
Test restructuring: Improved test organization with describe/it blocks for better clarity

Testing

✅ All existing tests pass (backward compatibility maintained)
✅ New tests cover mixed content scenarios
✅ Edge cases tested (empty arrays, image-only content, multiple text blocks)
✅ Integration with getNumTokensFromMessages verified

This fix ensures accurate token counting for modern chat applications that use rich message content with text, images, and other media types.

Fixes #8310
Fixes #8336

vercel · 2025-06-10T23:25:30Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchainjs-docs	✅ Ready (Inspect)	Visit Preview		Jun 17, 2025 1:02am

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchainjs-api-refs	⬜️ Ignored (Inspect)			Jun 17, 2025 1:02am

…mTokens Previously, getNumTokens would return 0 for any non-string MessageContent, which caused incorrect token counts when messages contained mixed content (e.g., text + image_url arrays). This affected methods like getNumTokensFromMessages that rely on getNumTokens for accurate counting. Changes: - Remove TODO and replace with proper text extraction logic - Extract text content from MessageContentComplex arrays - Join multiple text blocks while ignoring non-text content types - Maintain backward compatibility for string content - Add comprehensive tests covering mixed content scenarios Fixes issue where messages with image_url content would not count text tokens, leading to inaccurate token usage calculations. fixes langchain-ai#8310

hntrl

Better than what we had before but I don't know if this is "end-state" token counting (for instance in python we consider other things like function calls apart of the token count, there's probably other ones we should be considering). Probably at a minimum we re-add the comment to signal a more appropriate solution is needed. It would be interesting to compare different message content instances with gpt-2 tokenizer and this fallback method.

I'll also call out I don't think the blast radius on this is big (I think) since we only use this in the trimMessages utility

https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/messages/utils.py#L86

christian-bromann · 2025-06-17T00:50:16Z

Thanks for reviewing!

I'll also call out I don't think the blast radius on this is big (I think) since we only use this in the trimMessages utility

There have been 2 reported instances of this failing for users, see #8310 and #8336. I agree with you, this is definitely not perfect but addresses the issues reported. I've added a comment to revisit this section at a later time. Let me know if there is anything else I can do.

hntrl · 2025-06-17T01:04:23Z

thanks @christian-bromann!

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. auto:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jun 10, 2025

christian-bromann mentioned this pull request Jun 10, 2025

trimMessages tokenCounter can't process complex messages #8336

Closed

5 tasks

vercel bot deployed to Preview – langchainjs-docs June 10, 2025 23:38 View deployment

christian-bromann added 2 commits June 16, 2025 10:14

prettier

90cde92

christian-bromann force-pushed the cb/getNumTokens-fix branch from d86a6c5 to 90cde92 Compare June 16, 2025 17:16

vercel bot deployed to Preview – langchainjs-docs June 16, 2025 17:29 View deployment

hntrl requested changes Jun 16, 2025

View reviewed changes

add comment to investigate a better approach

54866fb

hntrl approved these changes Jun 17, 2025

View reviewed changes

dosubot bot added the lgtm PRs that are ready to be merged as-is label Jun 17, 2025

vercel bot deployed to Preview – langchainjs-docs June 17, 2025 01:02 View deployment

hntrl merged commit 055f972 into langchain-ai:main Jun 17, 2025
36 checks passed

christian-bromann deleted the cb/getNumTokens-fix branch June 17, 2025 01:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(core): implement proper token counting for mixed content in getNumTokens #8341

fix(core): implement proper token counting for mixed content in getNumTokens #8341

Uh oh!

christian-bromann commented Jun 10, 2025 •

edited

Loading

Uh oh!

vercel bot commented Jun 10, 2025 •

edited

Loading

Uh oh!

hntrl left a comment

Uh oh!

christian-bromann commented Jun 17, 2025

Uh oh!

Uh oh!

hntrl commented Jun 17, 2025

Uh oh!

Uh oh!

fix(core): implement proper token counting for mixed content in getNumTokens #8341

fix(core): implement proper token counting for mixed content in getNumTokens #8341

Uh oh!

Conversation

christian-bromann commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Example

Changes Made

Testing

Uh oh!

vercel bot commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hntrl left a comment

Choose a reason for hiding this comment

Uh oh!

christian-bromann commented Jun 17, 2025

Uh oh!

Uh oh!

hntrl commented Jun 17, 2025

Uh oh!

Uh oh!

christian-bromann commented Jun 10, 2025 •

edited

Loading

vercel bot commented Jun 10, 2025 •

edited

Loading