Skip to content

Add Idefics3Model for ColSmol #2996

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

akshayballal95
Copy link
Contributor

@akshayballal95 akshayballal95 commented Jun 16, 2025

This PR adds the Idefics Model mainly for use with ColSmol, but the support can be extended to use as a CausalVLM for use with models like Idefics3-Llama.

Although the model works well as intended, and the output tensors are acceptably close to the Transformer library counterpart, the execution speed is quite low (about 75% slower than the transformer). Upon inspection, I found that one particular operation takes an inordinately long time. It's this particular operation:

https://github.com/akshayballal95/candle/blob/e71c90e8d1db5ed84ff65cef8cc4f9f3301ff30e/candle-transformers/src/models/idefics3/model.rs#L596-L599

  let special_image_token_indices = input_ids
      .eq(self.image_token_id as f64)?
      .nonzero()?
      .repeat((1, embed_dim))?;

Specifically, the nonzero operation contributes to the time. But I don't think the issue is with nonzero. I have used this operation in other parts of the code, and it runs quite fast everywhere else—just not in this particular case. I understand that this involves moving the tensors to the CPU, but when I keep input_ids on the CPU, this operation becomes faster. However, strangely, the inference time for Llama increases. Also, replacing nonzero with other operations like cumsum has the same problem.

Here are the times for major parts of the model:

TextEmbed time: 22.219µs
Vision Transformer time: 34.432157ms

Input Merging time: 466.175661ms
-> Special Image Token Indices time: 466.096411ms ❌

LLama Inference time: 15.215606ms

Let me know if more information is needed or if this is normal behavior.

- Introduced the Idefics3 model architecture with vision and text configurations.
- Added processing utilities for image handling, including resizing, normalization, and padding.
- Implemented a main example to demonstrate model usage with image input.
- Included tests for image processing and model functionality.
- Removed the batched unfold helper function to streamline the code.
- Enhanced the unfold function to correctly handle tensor dimensions and permutations.
- Updated index tensor creation to accommodate varying dimensions.
- Cleaned up debug print statements for clarity.
- Streamlined the unfold function by removing unnecessary comments and enhancing index tensor creation for better dimension handling.
- Improved broadcasting logic for index tensors and reshaping to accommodate varying tensor dimensions.
- Updated test cases to reflect changes in the unfolding logic and added debug print statements for output verification.
- Simplified the inverse permutation logic by directly swapping the unfolding dimension.
- Updated test cases to ensure correct handling of tensor dimensions and improved output verification.
- Enhanced debug print statements for clarity in 2D and 3D test cases.
- Updated the Idefics3 example to utilize the new Idefics3VisionTransformer for improved image processing.
- Refactored the main function to include patch extraction and attention mask generation.
- Simplified the loading of model components and adjusted tensor handling for better clarity and performance.
- Added new utility functions for tensor unfolding and processing, ensuring compatibility with varying input dimensions.
- Improved test cases to validate the new transformer architecture and its functionality.
- Introduced a new ColSmol example demonstrating image processing with the ColIdefics3 model.
- Added processing utilities for image handling, including resizing, normalization, and padding.
- Implemented a command-line interface for user interaction with the model.
- Updated README with usage instructions and example output.
- Enhanced the main function to include image retrieval and processing logic.
- Introduced a new example for ColSmol demonstrating image processing capabilities.
- Added a new `tokenize_batch` function to streamline prompt tokenization for multiple images.
- Refactored the `Idefics3ImageProcessor` to improve image handling, including resizing and padding.
- Simplified the `preprocess` method to handle batches of images more efficiently.
- Updated the main function in the Idefics3 example to support multiple image inputs.
- Cleaned up unused code and comments for better readability and maintainability.
- Added timing measurement for the retrieval process in the main function to evaluate performance.
- Refactored the image processing logic in `Idefics3ImageProcessor` to combine rescaling and normalization steps for improved efficiency.
- Simplified the creation of padded image and mask lists using a more concise mapping approach.
- Cleaned up code for better readability and maintainability.
- Removed unnecessary whitespace in `main.rs` files for cleaner code.
- Adjusted the `PageRetriever` instantiation in `colsmol/main.rs` to improve retrieval efficiency.
- Enhanced readability in `Idefics3ImageProcessor` by aligning method calls and improving code structure.
- Updated README to reflect changes in example usage and clarify command-line instructions.
@akshayballal95 akshayballal95 marked this pull request as draft June 16, 2025 20:35
@akshayballal95 akshayballal95 marked this pull request as ready for review June 16, 2025 20:35
- Deleted the test module from `candle-examples/examples/colsmol/processing.rs` to streamline the codebase.
- This change focuses on cleaning up the example by removing unused test cases that are no longer relevant.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant