Functioning decode on multimodal Gemma3-4b #1689

hengtaoguo · 2025-05-06T18:19:10Z

Description

Insert the vision embeddings into text embeddings and allow a fully functioning decode forward pass on multimodal Gemma3-4b model.

The most critical change here is that merge_mm_embeddings inserts the image_embeddings into the text_embeddings based on the image placeholder token information in the bidirectional_mask:

y = multimodal_utils.merge_mm_embeddings(
      text_embeddings=y,
      vision_embeddings=image_embeddings,
      mask=bidirectional_mask,
)

We have a separate script golden_gemma3-4b-mm-export.py generates golden logits with fields (prompt, tokens, image_array, logits). This has been offline compared with MaxText logits with kl_div<0.01. We are not uploading this script and logits because of b/416290849. Later we need to modify MaxText/tests/forward_pass_logit_checker.py to make it accept full multimodal prompt for logits comparison (and add to end_to_end tests potentially).

Tests

A full decode forward pass command line, using prompt='Describe image <start_of_image> and image:

python -m MaxText.decode MaxText/configs/base.yml model_name=gemma3-4b tokenizer_path=assets/tokenizer.gemma3 load_parameters_path=gs://maxtext-model-checkpoints/gemma3-4b/multimodal/2025-04-25-18-06-04/checkpoints/0/items per_device_batch_size=1 run_name=ht_test max_prefill_predict_length=272 max_target_length=300 steps=1 async_checkpointing=false scan_layers=false use_multimodal=true prompt=\'Describe\ image\ \<start_of_image\>\' image_path=\'/home/hengtaoguo/projects/maxtext/MaxText/test_assets/test_image.jpg\' attention=\'dot_product\'

This yields the outcome image description logs:

RAMstats: After load_params:
        Using (GB) 31.82 / 1417.33 (2.245066%) -->  Available:1377.1
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
Input `<start_of_turn>user
Describe image <start_of_image><end_of_turn>
<start_of_turn>model
` -> `Here's a description of the image:

**Overall Impression:** The image is a bright, expansive cityscape view of Seattle, Washington, with`

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

aireenmei

Great work, thanks Hengtao!

aireenmei · 2025-05-07T00:18:02Z

MaxText/layers/models.py

@@ -703,5 +718,6 @@ def __call__(
        slot=slot,
        page_state=page_state,
        bidirectional_mask=bidirectional_mask,
+        image_embeddings=image_embeddings if self.config.use_multimodal else None,


Nit: line 702 set image_embeddings to None. We can keep one of them. And we can use similar way for bidirectional mask to be consistent.

Done, just pass image_embeddings=image_embeddings to align with bidirectional_mask.

aireenmei · 2025-05-07T00:24:48Z

MaxText/layers/models.py

@@ -34,6 +34,7 @@
 from MaxText.layers import normalizations, quantizations
 from MaxText.layers import pipeline
 from MaxText import maxtext_utils
+from MaxText import multimodal_utils


I think this will cause circular dependency, since I added this in BUILD file (https://source.corp.google.com/piper///depot/google3/third_party/py/maxtext/BUILD;l=360?q=BUILD%20f:maxtext). Since I only used some image related variables in gemma3.py (https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/layers/gemma3.py#L48-L53), maybe better move them to multimodal_utils.

Thank you Aireen for the catch!

I have moved all the gemma-related static values from gemma3.py to multimodal_utils.py. Alongside this PR, I will amend in copybara to remove multimodal_utils.py's dependency on :layers. Let me know if this sounds good to you!

Sounds good, thank you!

MaxText/multimodal_utils.py

gobbleturk

Awesome, thanks hengtao!!

hengtaoguo force-pushed the hengtaoguo-forward branch from cbf1907 to c90b032 Compare May 6, 2025 18:21

hengtaoguo changed the title ~~Functioning decode on Multimodal Gemma3-4b~~ Functioning decode on multimodal Gemma3-4b May 6, 2025

hengtaoguo self-assigned this May 6, 2025

hengtaoguo marked this pull request as ready for review May 6, 2025 22:09

hengtaoguo requested review from gobbleturk, khatwanimohit, bvandermoon, vipannalla, RissyRan, richjames0, rni418, gagika, shralex, yangyuwei, SurbhiJainUSC, A9isha, wang2yn84, wyzhang, mitalisi, gpolovets1, mailvijayasingh, jrplatin, patemotter and Lumosis as code owners May 6, 2025 22:09

hengtaoguo assigned aireenmei, gagika and ZhaoyueCheng May 6, 2025

aireenmei approved these changes May 7, 2025

View reviewed changes

ZhaoyueCheng approved these changes May 7, 2025

View reviewed changes

hengtaoguo force-pushed the hengtaoguo-forward branch from cab335f to 67ca20c Compare May 7, 2025 19:58

hengtaoguo force-pushed the hengtaoguo-forward branch from 57ac3f5 to f146e84 Compare May 8, 2025 17:20

hengtaoguo assigned gobbleturk May 8, 2025

gobbleturk reviewed May 8, 2025

View reviewed changes

MaxText/multimodal_utils.py Show resolved Hide resolved

gobbleturk approved these changes May 8, 2025

View reviewed changes

gagika approved these changes May 8, 2025

View reviewed changes

github-actions bot added the pull ready label May 8, 2025

hengtaoguo force-pushed the hengtaoguo-forward branch from aea520b to c32e302 Compare May 8, 2025 18:42

Fully working multimodal decode

76924ac

hengtaoguo force-pushed the hengtaoguo-forward branch from 5e67aec to 76924ac Compare May 8, 2025 18:48

copybara-service bot merged commit e930373 into main May 8, 2025
15 of 17 checks passed

copybara-service bot deleted the hengtaoguo-forward branch May 8, 2025 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functioning decode on multimodal Gemma3-4b #1689

Functioning decode on multimodal Gemma3-4b #1689

hengtaoguo commented May 6, 2025 •

edited

Loading

aireenmei left a comment

aireenmei May 7, 2025

hengtaoguo May 7, 2025

aireenmei May 7, 2025

hengtaoguo May 7, 2025

aireenmei May 8, 2025

gobbleturk left a comment

Functioning decode on multimodal Gemma3-4b #1689

Functioning decode on multimodal Gemma3-4b #1689

Conversation

hengtaoguo commented May 6, 2025 • edited Loading

Description

Tests

Checklist

aireenmei left a comment

Choose a reason for hiding this comment

aireenmei May 7, 2025

Choose a reason for hiding this comment

hengtaoguo May 7, 2025

Choose a reason for hiding this comment

aireenmei May 7, 2025

Choose a reason for hiding this comment

hengtaoguo May 7, 2025

Choose a reason for hiding this comment

aireenmei May 8, 2025

Choose a reason for hiding this comment

gobbleturk left a comment

Choose a reason for hiding this comment

hengtaoguo commented May 6, 2025 •

edited

Loading