Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images

[Paper | BibTex | 🤗Dataset | 📂Logs]

Official Implementation for "Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images".

Lingao Xiao, Songhua Liu, Yang He*, Xinchao Wang

Abstract: Dataset distillation and dataset pruning are two prominent techniques for compressing datasets to improve computational and storage efficiency. Despite their overlapping objectives, these approaches are rarely compared directly. Even within each field, the evaluation protocols are inconsistent across various methods, which complicates fair comparisons and hinders reproducibility. Considering these limitations, we introduce in this paper a benchmark that equitably evaluates methodologies across both distillation and pruning literatures. Notably, our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, which heavily rely on soft labels from pre-trained models, even randomly selected subsets can achieve surprisingly competitive performance. This finding suggests that an overemphasis on soft labels may be diverting attention from the intrinsic value of the image data, while also imposing additional burdens in terms of generation, storage, and application. To address these issues, we propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively, relies solely on hard labels for evaluation, and achieves state-of-the-art performance in this setup. By shifting the emphasis back to the images, our benchmark and PCA framework pave the way for more balanced and accessible techniques in dataset compression research.

TODOs

release large-scale benchmark
release SOTA datasets
release PCA framework
release PCA datasets

*Note: for soft label benchmark, we use fast evaluation code without relabeling.

Datasets (🤗Hugging Face)

SOTA datasets used in our experiments are available at 🤗Hugging Face. We have preprocessed all images into fixed 224x224 resolutioins and creates the datasets for a fair storage comparison.

Type	`DD`	`DP`
Explain	Dataset Distillation	Dataset Pruning

Method	Type	Venue	Dataset Key	Avaiable IPCs
random	-	-	he-yang/2025-rethinkdc-imagenet-random-ipc-`[IPC]`	`[1,10,20,50,100,200]`
SRe2L	`DD`	NeurIPS'23	he-yang/2025-rethinkdc-imagenet-sre2l-ipc-`[IPC]`	`[10,50,100]`
CDA	`DD`	TMLR'24	he-yang/2025-rethinkdc-imagenet-cda-ipc-`[IPC]`	`[10,50,100]`
G-VBSM	`DD`	CVPR'24	he-yang/2025-rethinkdc-imagenet-gvbsm-ipc-`[IPC]`	`[10,50,100]`
LPLD	`DD`	NeurIPS'24	he-yang/2025-rethinkdc-imagenet-lpld-ipc-`[IPC]`	`[10,50,100]`
RDED	`DD`	CVPR'24	he-yang/2025-rethinkdc-imagenet-rded-ipc-`[IPC]`	`[10,50,100]`
DWA	`DD`	NeurIPS'24	he-yang/2025-rethinkdc-imagenet-dwa-ipc-`[IPC]`	`[10,50,100]`
Forgetting	`DP`	ICLR'19	he-yang/2025-rethinkdc-imagenet-forgetting-ipc-`[IPC]`	`[10,50,100]`
EL2N	`DP`	NeurIPS'21	he-yang/2025-rethinkdc-imagenet-el2n-ipc-`[IPC]`	`[10,50,100]`
AUM	`DP`	NeurIPS'20	he-yang/2025-rethinkdc-imagenet-aum-ipc-`[IPC]`	`[10,50,100]`
CCS	`DP`	ICLR'23	he-yang/2025-rethinkdc-imagenet-ccs-ipc-`[IPC]`	`[10,50,100]`

Installation

1. Install from pip (tested on python=3.12)

pip install rethinkdc

2. Or install from source

Step 1: Clone Repo,

git clone https://github.com/ArmandXiao/Rethinking-Dataset-Compression.git
cd Rethinking-Dataset-Compression

Step 2: Create Environment,

conda env create -f environment.yml
conda activate rethinkdc

Step 3: Install Benchmark,

make build
make install

Usage

1. Prepare ImageNet Validation Folder:

# download and prepare ImageNet Val (skip if you have)
wget -qO- https://github.com/ArmandXiao/Rethinking-Dataset-Compression/script/download_val.sh | bash

# set environment (IMPORTANT!)
export IMAGENET_VAL_DIR="Your ImageNet Val Path"

2. Hyper-parameter for "rethinkdc"

rethinkdc --help

📘 Manual

Rethinking Large-scale Dataset Compression
usage: rethinkdc [-h] [--soft | --hard | --yaml YAML] [--batch-size BATCH_SIZE] [--gradient-accumulation-steps GRADIENT_ACCUMULATION_STEPS] [-j WORKERS]
                 [--val-dir VAL_DIR] [--output-dir OUTPUT_DIR] [--hf-cache-dir HF_CACHE_DIR] [--mode MODE] [--cos] [--adamw-lr ADAMW_LR]
                 [--adamw-weight-decay ADAMW_WEIGHT_DECAY] [--sgd-setting] [--hard-label] [--start-epoch START_EPOCH] [--epochs EPOCHS] [--model MODEL]
                 [--teacher-model TEACHER_MODEL] [-T TEMPERATURE] [--mix-type MIX_TYPE] [--mixup MIXUP] [--cutmix CUTMIX] [--ipc IPC] [--wandb-project WANDB_PROJECT]
                 [--wandb-api-key WANDB_API_KEY]
                 

Example Usage:
        rethinkdc he-yang/2025-rethinkdc-imagenet-random-ipc-10 --soft --ipc 10 --output-dir ./random_ipc10_soft 

                                                      options                                                       
-h, --help          ┃ show this help message and exit                            ┃ str    ┃ ==SUPPRESS==            
                                               Configuration Options                                                
--soft              ┃ Use standard_soft_config.yaml (Example: rethinkdc PATH     ┃ str    ┃ False                   
                    ┃ --soft)                                                    ┃        ┃                         
--hard              ┃ Use standard_hard_config.yaml (Example: rethinkdc PATH     ┃ str    ┃ False                   
                    ┃ --hard)                                                    ┃        ┃                         
--yaml              ┃ Custom config file (Exmpale: rethinkdc                     ┃ str    ┃                         
                    ┃ YOUR_PATH_TO_CONFIG.yaml)                                  ┃        ┃                         
                                                    Data Options                                                    
train_dir           ┃ path to training dataset or huggingface dataset key        ┃ str    ┃                         
--batch-size        ┃ batch size                                                 ┃ int    ┃ 1024                    
--gradient-accumul… ┃ gradient accumulation steps for small gpu memory           ┃ int    ┃ 1                       
-j, --workers       ┃ number of data loading workers                             ┃ int    ┃ 16                      
--val-dir           ┃ path to validation dataset                                 ┃ str    ┃ /path/to/imagenet/val   
--output-dir        ┃ path to output dir                                         ┃ str    ┃ ./save/1024             
--hf-cache-dir      ┃ cache dir for huggingface dataset                          ┃ str    ┃ ./hf_cache              
--mode              ┃ mode for training                                          ┃ str    ┃ fkd_save                
                                                  Training Options                                                  
--cos               ┃ cosine lr scheduler                                        ┃ str    ┃ False                   
--adamw-lr          ┃ adamw learning rate                                        ┃ float  ┃ 0.001                   
--adamw-weight-dec… ┃ adamw weight decay                                         ┃ float  ┃ 0.01                    
--sgd-setting       ┃ using sgd evaluation settting (lr=0.1, scheduler=cos)      ┃ str    ┃ False                   
--hard-label        ┃ use hard label                                             ┃ str    ┃ False                   
--start-epoch       ┃ start epoch                                                ┃ int    ┃ 0                       
--epochs            ┃ total epoch                                                ┃ int    ┃ 300                     
                                                   Model Options                                                    
--model             ┃ student model name                                         ┃ str    ┃ resnet18                
--teacher-model     ┃ teacher model name                                         ┃ str    ┃                         
-T, --temperature   ┃ temperature for distillation loss                          ┃ float  ┃ 3.0                     
                                                Mixup/CutMix Options                                                
--mix-type          ┃ choices in {mixup, cutmix, None}                           ┃ str    ┃                         
--mixup             ┃ mixup alpha, mixup enabled if > 0. (default: 0.8)          ┃ float  ┃ 0.8                     
--cutmix            ┃ cutmix alpha, cutmix enabled if > 0. (default: 1.0)        ┃ float  ┃ 1.0                     
--ipc               ┃ number of images per class                                 ┃ int    ┃ 50                      
                                                   Wandb Options                                                    
--wandb-project     ┃ wandb project name                                         ┃ str    ┃ Temperature             
--wandb-api-key     ┃ wandb api key                                              ┃ str    ┃                         

For more information, please visit the project repository: https://github.com/ArmandXiao/Rethinking-Dataset-Compression

3. Example Usage (more examples can be found in folder script):

rethinkdc [YOUR_PATH_TO_DATASET] [*ARGS]

# example (test random subset)
rethinkdc he-yang/2025-rethinkdc-imagenet-random-ipc-10 --soft --ipc 10 --output-dir ./random_ipc10_soft

Main Table Result (📂Google Drive)

Logs for main tables are results provided in google drive for reference.

Table	Explanation
Table 3	Random baselines in soft label setting with standard evaluation
Table 4 & Table 18	SOTA methods in soft label setting with std
Table 5 & Table 19	SOTA methods in hard label setting with std
Table 6	SOTA Pruning Rules
Table 7	Ablation Study of PCA
Table 8	Cross-architecture Performance of PCA
Table 12 & Table 22	Regularization-based Data Augmentation
Table 20	Pure Noise as Input
Table 24	PCA using Different Pruning Methods

Related Repos

Citation

@article{xiao2025rethinkdc,
  title={Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images},
  author={Xiao, Lingao and Liu, Songhua and He, Yang and Wang, Xinchao},
  journal={arXiv preprint arXiv:2502.06434},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
rethinkdc		rethinkdc
script		script
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images

TODOs

Datasets (🤗Hugging Face)

Installation

Usage

Main Table Result (📂Google Drive)

Related Repos

Citation

About

Languages

ArmandXiao/Rethinking-Dataset-Compression

Folders and files

Latest commit

History

Repository files navigation

Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images

TODOs

Datasets (🤗Hugging Face)

Installation

Usage

Main Table Result (📂Google Drive)

Related Repos

Citation

About

Resources

Stars

Watchers

Forks

Languages