Add Idefics3Model for ColSmol #2996

akshayballal95 · 2025-06-16T20:31:29Z

This PR adds the Idefics Model mainly for use with ColSmol, but the support can be extended to use as a CausalVLM for use with models like Idefics3-Llama.

Although the model works well as intended, and the output tensors are acceptably close to the Transformer library counterpart, the execution speed is quite low (about 75% slower than the transformer). Upon inspection, I found that one particular operation takes an inordinately long time. It's this particular operation:

https://github.com/akshayballal95/candle/blob/e71c90e8d1db5ed84ff65cef8cc4f9f3301ff30e/candle-transformers/src/models/idefics3/model.rs#L596-L599

  let special_image_token_indices = input_ids
      .eq(self.image_token_id as f64)?
      .nonzero()?
      .repeat((1, embed_dim))?;

Specifically, the nonzero operation contributes to the time. But I don't think the issue is with nonzero. I have used this operation in other parts of the code, and it runs quite fast everywhere else—just not in this particular case. I understand that this involves moving the tensors to the CPU, but when I keep input_ids on the CPU, this operation becomes faster. However, strangely, the inference time for Llama increases. Also, replacing nonzero with other operations like cumsum has the same problem.

Here are the times for major parts of the model:

TextEmbed time: 22.219µs
Vision Transformer time: 34.432157ms

Input Merging time: 466.175661ms
-> Special Image Token Indices time: 466.096411ms ❌

LLama Inference time: 15.215606ms

Let me know if more information is needed or if this is normal behavior.

- Introduced the Idefics3 model architecture with vision and text configurations. - Added processing utilities for image handling, including resizing, normalization, and padding. - Implemented a main example to demonstrate model usage with image input. - Included tests for image processing and model functionality.

- Removed the batched unfold helper function to streamline the code. - Enhanced the unfold function to correctly handle tensor dimensions and permutations. - Updated index tensor creation to accommodate varying dimensions. - Cleaned up debug print statements for clarity.

- Streamlined the unfold function by removing unnecessary comments and enhancing index tensor creation for better dimension handling. - Improved broadcasting logic for index tensors and reshaping to accommodate varying tensor dimensions. - Updated test cases to reflect changes in the unfolding logic and added debug print statements for output verification.

- Simplified the inverse permutation logic by directly swapping the unfolding dimension. - Updated test cases to ensure correct handling of tensor dimensions and improved output verification. - Enhanced debug print statements for clarity in 2D and 3D test cases.

- Updated the Idefics3 example to utilize the new Idefics3VisionTransformer for improved image processing. - Refactored the main function to include patch extraction and attention mask generation. - Simplified the loading of model components and adjusted tensor handling for better clarity and performance. - Added new utility functions for tensor unfolding and processing, ensuring compatibility with varying input dimensions. - Improved test cases to validate the new transformer architecture and its functionality.

- Introduced a new ColSmol example demonstrating image processing with the ColIdefics3 model. - Added processing utilities for image handling, including resizing, normalization, and padding. - Implemented a command-line interface for user interaction with the model. - Updated README with usage instructions and example output. - Enhanced the main function to include image retrieval and processing logic.

- Introduced a new example for ColSmol demonstrating image processing capabilities. - Added a new `tokenize_batch` function to streamline prompt tokenization for multiple images. - Refactored the `Idefics3ImageProcessor` to improve image handling, including resizing and padding. - Simplified the `preprocess` method to handle batches of images more efficiently. - Updated the main function in the Idefics3 example to support multiple image inputs. - Cleaned up unused code and comments for better readability and maintainability.

- Added timing measurement for the retrieval process in the main function to evaluate performance. - Refactored the image processing logic in `Idefics3ImageProcessor` to combine rescaling and normalization steps for improved efficiency. - Simplified the creation of padded image and mask lists using a more concise mapping approach. - Cleaned up code for better readability and maintainability.

- Removed unnecessary whitespace in `main.rs` files for cleaner code. - Adjusted the `PageRetriever` instantiation in `colsmol/main.rs` to improve retrieval efficiency. - Enhanced readability in `Idefics3ImageProcessor` by aligning method calls and improving code structure. - Updated README to reflect changes in example usage and clarify command-line instructions.

- Deleted the test module from `candle-examples/examples/colsmol/processing.rs` to streamline the codebase. - This change focuses on cleaning up the example by removing unused test cases that are no longer relevant.

akshayballal95 added 9 commits June 14, 2025 21:38

akshayballal95 marked this pull request as draft June 16, 2025 20:35

akshayballal95 marked this pull request as ready for review June 16, 2025 20:35

Remove tests from ColSmol example processing file

1e6f5ea

- Deleted the test module from `candle-examples/examples/colsmol/processing.rs` to streamline the codebase. - This change focuses on cleaning up the example by removing unused test cases that are no longer relevant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Idefics3Model for ColSmol #2996

Add Idefics3Model for ColSmol #2996

Uh oh!

akshayballal95 commented Jun 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add Idefics3Model for ColSmol #2996

Are you sure you want to change the base?

Add Idefics3Model for ColSmol #2996

Uh oh!

Conversation

akshayballal95 commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

akshayballal95 commented Jun 16, 2025 •

edited

Loading