-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add Idefics3Model for ColSmol #2996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
akshayballal95
wants to merge
10
commits into
huggingface:main
Choose a base branch
from
akshayballal95:idefics3
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Introduced the Idefics3 model architecture with vision and text configurations. - Added processing utilities for image handling, including resizing, normalization, and padding. - Implemented a main example to demonstrate model usage with image input. - Included tests for image processing and model functionality.
- Removed the batched unfold helper function to streamline the code. - Enhanced the unfold function to correctly handle tensor dimensions and permutations. - Updated index tensor creation to accommodate varying dimensions. - Cleaned up debug print statements for clarity.
- Streamlined the unfold function by removing unnecessary comments and enhancing index tensor creation for better dimension handling. - Improved broadcasting logic for index tensors and reshaping to accommodate varying tensor dimensions. - Updated test cases to reflect changes in the unfolding logic and added debug print statements for output verification.
- Simplified the inverse permutation logic by directly swapping the unfolding dimension. - Updated test cases to ensure correct handling of tensor dimensions and improved output verification. - Enhanced debug print statements for clarity in 2D and 3D test cases.
- Updated the Idefics3 example to utilize the new Idefics3VisionTransformer for improved image processing. - Refactored the main function to include patch extraction and attention mask generation. - Simplified the loading of model components and adjusted tensor handling for better clarity and performance. - Added new utility functions for tensor unfolding and processing, ensuring compatibility with varying input dimensions. - Improved test cases to validate the new transformer architecture and its functionality.
- Introduced a new ColSmol example demonstrating image processing with the ColIdefics3 model. - Added processing utilities for image handling, including resizing, normalization, and padding. - Implemented a command-line interface for user interaction with the model. - Updated README with usage instructions and example output. - Enhanced the main function to include image retrieval and processing logic.
- Introduced a new example for ColSmol demonstrating image processing capabilities. - Added a new `tokenize_batch` function to streamline prompt tokenization for multiple images. - Refactored the `Idefics3ImageProcessor` to improve image handling, including resizing and padding. - Simplified the `preprocess` method to handle batches of images more efficiently. - Updated the main function in the Idefics3 example to support multiple image inputs. - Cleaned up unused code and comments for better readability and maintainability.
- Added timing measurement for the retrieval process in the main function to evaluate performance. - Refactored the image processing logic in `Idefics3ImageProcessor` to combine rescaling and normalization steps for improved efficiency. - Simplified the creation of padded image and mask lists using a more concise mapping approach. - Cleaned up code for better readability and maintainability.
- Removed unnecessary whitespace in `main.rs` files for cleaner code. - Adjusted the `PageRetriever` instantiation in `colsmol/main.rs` to improve retrieval efficiency. - Enhanced readability in `Idefics3ImageProcessor` by aligning method calls and improving code structure. - Updated README to reflect changes in example usage and clarify command-line instructions.
- Deleted the test module from `candle-examples/examples/colsmol/processing.rs` to streamline the codebase. - This change focuses on cleaning up the example by removing unused test cases that are no longer relevant.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds the Idefics Model mainly for use with ColSmol, but the support can be extended to use as a CausalVLM for use with models like Idefics3-Llama.
Although the model works well as intended, and the output tensors are acceptably close to the Transformer library counterpart, the execution speed is quite low (about 75% slower than the transformer). Upon inspection, I found that one particular operation takes an inordinately long time. It's this particular operation:
https://github.com/akshayballal95/candle/blob/e71c90e8d1db5ed84ff65cef8cc4f9f3301ff30e/candle-transformers/src/models/idefics3/model.rs#L596-L599
Specifically, the
nonzero
operation contributes to the time. But I don't think the issue is withnonzero
. I have used this operation in other parts of the code, and it runs quite fast everywhere else—just not in this particular case. I understand that this involves moving the tensors to the CPU, but when I keepinput_ids
on the CPU, this operation becomes faster. However, strangely, the inference time for Llama increases. Also, replacingnonzero
with other operations likecumsum
has the same problem.Here are the times for major parts of the model:
Let me know if more information is needed or if this is normal behavior.