GitHub - codefanw/FlashSloth: [CVPR2025] FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression

⚡FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression ⚡

👣Introduction

This repository implements FlashSloth, an innovative lightweight multimodal large language model (MLLM). Through a unique embedded visual compression design, FlashSloth significantly enhances the descriptive capabilities of visual tokens while maintaining exceptional performance, even with a substantial reduction in visual token count.

Key advantages:

Enhanced Visual Feature Description: Even when compressing visual features by 9 times, the model's performance remains comparable to its original version, demonstrating extraordinary feature extraction and compression capabilities.
Extremely Low Training Costs: Compared to similar models, FlashSloth demonstrates remarkable computational efficiency during training. Requiring only 6.4 GPU hours and 12GB of memory per GPU card to complete pre-training represents a significant resource optimization.
Superior Inference Efficiency: In critical metrics such as TFLOPs, GPU memory usage, response time, and throughput, FlashSloth dramatically outperforms other lightweight MLLMs. Check out our paper.
Outstanding Model Performance: Despite using relatively limited training data, FlashSloth maintains competitive performance with current state-of-the-art (SOTA) methods and even achieves superior results in several visual language (VL) tasks. For instance, scoring 75.7 on the MMB benchmark and 75.3 on the AI2D dataset. Check out our Model Zoo

Model Architecture Updates:

FlashSloth enhances visual feature description by ingeniously integrating Spatial Attention Pooling (SAP) and Embedded Query Module (EmbQ), enabling precise and efficient visual information extraction:

Spatial Attention Pooling (SAP): This module employs an intelligent attention weight allocation strategy, selectively aggregating visual features within image regions. SAP effectively compresses redundant visual tokens while capturing the most salient and meaningful visual characteristics, achieving information distillation and concentration.
Embedded Query Module (EmbQ): A lightweight and highly integrated module within the FlashSloth architecture, EmbQ directly extracts instruction-relevant information from images. By eliminating the need for additional language modeling or complex alignment pre-training, the module simplifies the model design while significantly improving multi-modal input processing efficiency and accuracy.

The synergistic interaction between these two modules enables FlashSloth to achieve precise visual understanding and efficient representation while maintaining model lightweight characteristics.

📣 News

[2024.12.05] 🔥🔥🔥 We release the model architecture and training code for FlashSloth, and provide two evaluation methods.
[2024.12.06] 🚀 We release our paper on arXiv: FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression.
[2024.12.28] 💫 We release our model weights on HuggingFace. Enjoy it!
[2025.02.27] 🎉 Our FlashSloth has been accepted by CVPR2025.

🗓️TODO

Release training codes.
Release evaluating and inferencing codes.
Release paper.
Release checkpoints.
Release demo.
Release the model with stronger Chinese language capabilities.
Deployment on edge devices.
Support more modalities, e.g., audio and video.

🥤Contents

🛠️ Setup

Clone this repository and navigate to FlashSloth folder

git clone https://github.com/codefanw/FlashSloth.git
cd FlashSloth

Install Package

conda create -n flashsloth python=3.10 -y
conda activate flashsloth
pip install -r requirements.txt
pip install flash-attn==2.4.2 --no-build-isolation

🚀Training

The training procedure for FlashSloth consists of two stages:

Stage I: Pretraining
- ❄️ Vision Encoder + 🔥 Projector & Spatial Attention Pooling + ❄️ Query Tokens+ ❄️ Embedded Query Module + ❄️ LLM
- This stage takes approximately 0.8 hours with a batch size of 256 and requires around 12 GB of GPU memory on average.
Stage II: Instruction Tuning
- FlashSloth: ❄️ Vision Encoder + 🔥 Projector & Spatial Attention Pooling + 🔥 Query Tokens+🔥 Embedded Query Module + 🔥 LLM
- FlashSloth_HD: 🔥 Vision Encoder + 🔥 Projector & Spatial Attention Pooling + 🔥 Query Tokens+🔥 Embedded Query Module + 🔥 LLM
- This stage takes about 8 hours for FlashSloth, utilizing 8x A800 (80 GB) GPUs, with a batch size of 128 and an average GPU memory requirement of 47 GB.

Note: To train with fewer GPUs or less memory, you can decrease the per_device_train_batch_size and increase gradient_accumulation_steps accordingly. However, always ensure that the global batch size remains consistent: per_device_train_batch_size × gradient_accumulation_steps × number of GPUs.

Prepare checkpoints

export HF_ENDPOINT=https://hf-mirror.com  
#(option) If you cannot access Hugging Face directly, you can use this command to access it via a mirror.
python scripts/download_models.py

Prepare datas

If you want to train using the data provided by LLaVA-v1.5, you can follow the steps below:

prepare alignment pre-training data

Please download the caption annotations blip_laion_cc_sbu_558k.json and images from here.
prepare instruction tuning data

Please download the annotation file of the mixed instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, save all files as .jpg
- TextVQA: train_val_images
- VisualGenome: part1, part2

Training

Save model: This step combines the vision encoder with the large language model to form FlashSloth and initializes the newly added parameters to ensure the proper functioning of Zero3.

bash scripts/save.sh

Multimodal pretraining:

bash scripts/pretrain.sh

Multimodal instruction tuning: If you want to train a high-resolution version of FlashSloth, set image_hd to True.

bash scripts/finetune.sh

⚡Evaluation

We provide two methods to evaluate FlashSloth:

Way-1: LLaVA-v1.5

We follow the evaluation protocol of LLaVA-v1.5 and conduct experiments on GQA, MMBench, MME, POPE, and TextVQA. All evaluation scripts are available in the scripts/eval directory. To prepare the task-specific data, download eval.zip and extract it to the ./playground/data/eval directory. For detailed instructions, please refer to LLaVA's Evaluation.md. You can easily run the following script to evaluate across five tasks:

bash scripts/benchmark.sh

Way-2: lmms-eval (recommend)

For formal usage, you can install the package from PyPI by running the following command:

pip install lmms-eval

For development, you can install the package by cloning the repository and running the following command:

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip install -e .

Then, you can run the following script with a single command to evaluate the 14 tasks listed in our paper. Note that some tasks require configuring openai_api.

bash eval.sh

Inference

You can run the following command to start the model demo.

pip install gradio==3.43.2
python demo.py

We have also deployed flashsloth demo on Hugging Face space, but it takes a longer time to wake up. If you need to use the online demo, please wait for about 5 minutes.

🦥Model Zoo

Model	Checkpoint	POPE	MME	MMB	MM-Vet	SEED-Image	MMMU	MathVista	GQA	SQA	TextVQA	AI2D	ChartQA	DocVQA	RealWorldQA
FlashSloth	FlashSloth	86.3	1702.0	73.0	41.9	68.0	39.7	42.5	61.1	88.6	64.6	72.5	51.0	48.6	54.8
FlashSloth_HD	FlashSloth_HD	87.2	1745.0	75.7	49.0	71.2	37.8	40.6	62.5	91.1	71.0	75.3	69.8	74.8	59.9

In order to help reproduce our results, we also provide the weights after the first-stage training:FlashSloth-stage1

🤝 Acknowledgments

LLaVA and IMP: For providing the codebase we built upon. Thanks for their excellent work!
Phi and Siglip: For the amazing open-sourced base models used in our work.
lmms-eval: For providing the great open-sourced evaluation framework.

🌟 Star History

✨ Example

✏️ Citation

If you find our paper and code helpful, we kindly invite you to give it a star and consider citing our work.

@article{tong2024flashsloth,
  title={FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression},
  author={Tong, Bo and Lai, Bokai and Zhou, Yiyi and Luo, Gen and Shen, Yunhang and Li, Ke and Sun, Xiaoshuai and Ji, Rongrong},
  journal={arXiv preprint arXiv:2412.04317},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
flashsloth		flashsloth
images		images
lmms_eval		lmms_eval
scripts		scripts
tmp		tmp
README.md		README.md
all.sh		all.sh
demo.py		demo.py
eval.sh		eval.sh
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

⚡FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression ⚡

👣Introduction

Key advantages:

Model Architecture Updates:

📣 News

🗓️TODO

🥤Contents

🛠️ Setup

🚀Training

Prepare checkpoints

Prepare datas

Training

⚡Evaluation

Way-1: LLaVA-v1.5

Way-2: lmms-eval (recommend)

Inference

🦥Model Zoo

🤝 Acknowledgments

🌟 Star History

✨ Example

✏️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

codefanw/FlashSloth

Folders and files

Latest commit

History

Repository files navigation

⚡FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression ⚡

👣Introduction

Key advantages:

Model Architecture Updates:

📣 News

🗓️TODO

🥤Contents

🛠️ Setup

🚀Training

Prepare checkpoints

Prepare datas

Training

⚡Evaluation

Way-1: LLaVA-v1.5

Way-2: lmms-eval (recommend)

Inference

🦥Model Zoo

🤝 Acknowledgments

🌟 Star History

✨ Example

✏️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages