Skip to content

Commit 5ea0457

Browse files
authored
Prepare for v0.1.0 release (#322)
* update * update * update * update
1 parent ff8fddc commit 5ea0457

File tree

5 files changed

+298
-7
lines changed

5 files changed

+298
-7
lines changed

README.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ Finetrainers is a work-in-progress library to support (accessible) training of d
1616
## Table of Contents
1717

1818
- [Quickstart](#quickstart)
19+
- [Features](#features)
1920
- [News](#news)
2021
- [Support Matrix](#support-matrix)
2122
- [Featured Projects](#featured-projects-)
@@ -25,11 +26,11 @@ Finetrainers is a work-in-progress library to support (accessible) training of d
2526

2627
Clone the repository and make sure the requirements are installed: `pip install -r requirements.txt` and install `diffusers` from source by `pip install git+https://github.com/huggingface/diffusers`. The requirements specify `diffusers>=0.32.1`, but it is always recommended to use the `main` branch of Diffusers for the latest features and bugfixes. Note that the `main` branch for `finetrainers` is also the development branch, and stable support should be expected from the release tags.
2728

28-
Checkout to the latest release tag:
29+
Checkout to the latest stable release tag:
2930

3031
```bash
3132
git fetch --all --tags
32-
git checkout tags/v0.0.1
33+
git checkout tags/v0.1.0
3334
```
3435

3536
Follow the instructions mentioned in the [README](https://github.com/a-r-r-o-w/finetrainers/tree/v0.0.1) for the latest stable release.
@@ -51,6 +52,16 @@ Please checkout [`docs/models`](./docs/models/) and [`examples/training`](./exam
5152
> [!IMPORTANT]
5253
> It is recommended to use Pytorch 2.5.1 or above for training. Previous versions can lead to completely black videos, OOM errors, or other issues and are not tested. For fully reproducible training, please use the same environment as mentioned in [environment.md](./docs/environment.md).
5354
55+
## Features
56+
57+
- DDP, FSDP-2 & HSDP support for all models with low-rank and full-rank training
58+
- Memory-efficient single-GPU training
59+
- Auto-detection of commonly used dataset formats
60+
- Combined image/video datasets, multiple chainable local/remote datasets, multi-resolution bucketing & more
61+
- Memory-efficient precomputation support with/without on-the-fly precomputation for large scale datasets
62+
- Standardized model specification format for training arbitrary models
63+
- Fake FP8 training (QAT upcoming!)
64+
5465
## News
5566

5667
- 🔥 **2025-03-07**: CogView4 support added!

docs/args.md

Lines changed: 275 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,275 @@
1+
# Arguments
2+
3+
This document lists all the arguments that can be passed to the `train.py` script. For more information, please take a look at the `finetrainers/args.py` file.
4+
5+
## Table of contents
6+
7+
- [General arguments](#general)
8+
- [SFT training arguments](#sft-training)
9+
10+
## General
11+
12+
<!-- TODO(aryan): write a github workflow that automatically updates this page -->
13+
14+
```
15+
PARALLEL ARGUMENTS
16+
------------------
17+
parallel_backend (`str`, defaults to `accelerate`):
18+
The parallel backend to use for training. Choose between ['accelerate', 'ptd'].
19+
pp_degree (`int`, defaults to `1`):
20+
The degree of pipeline parallelism.
21+
dp_degree (`int`, defaults to `1`):
22+
The degree of data parallelism (number of model replicas).
23+
dp_shards (`int`, defaults to `-1`):
24+
The number of data parallel shards (number of model partitions).
25+
cp_degree (`int`, defaults to `1`):
26+
The degree of context parallelism.
27+
28+
MODEL ARGUMENTS
29+
---------------
30+
model_name (`str`):
31+
Name of model to train. To get a list of models, run `python train.py --list_models`.
32+
pretrained_model_name_or_path (`str`):
33+
Path to pretrained model or model identifier from https://huggingface.co/models. The model should be
34+
loadable based on specified `model_name`.
35+
revision (`str`, defaults to `None`):
36+
If provided, the model will be loaded from a specific branch of the model repository.
37+
variant (`str`, defaults to `None`):
38+
Variant of model weights to use. Some models provide weight variants, such as `fp16`, to reduce disk
39+
storage requirements.
40+
cache_dir (`str`, defaults to `None`):
41+
The directory where the downloaded models and datasets will be stored, or loaded from.
42+
tokenizer_id (`str`, defaults to `None`):
43+
Identifier for the tokenizer model. This is useful when using a different tokenizer than the default from `pretrained_model_name_or_path`.
44+
tokenizer_2_id (`str`, defaults to `None`):
45+
Identifier for the second tokenizer model. This is useful when using a different tokenizer than the default from `pretrained_model_name_or_path`.
46+
tokenizer_3_id (`str`, defaults to `None`):
47+
Identifier for the third tokenizer model. This is useful when using a different tokenizer than the default from `pretrained_model_name_or_path`.
48+
text_encoder_id (`str`, defaults to `None`):
49+
Identifier for the text encoder model. This is useful when using a different text encoder than the default from `pretrained_model_name_or_path`.
50+
text_encoder_2_id (`str`, defaults to `None`):
51+
Identifier for the second text encoder model. This is useful when using a different text encoder than the default from `pretrained_model_name_or_path`.
52+
text_encoder_3_id (`str`, defaults to `None`):
53+
Identifier for the third text encoder model. This is useful when using a different text encoder than the default from `pretrained_model_name_or_path`.
54+
transformer_id (`str`, defaults to `None`):
55+
Identifier for the transformer model. This is useful when using a different transformer model than the default from `pretrained_model_name_or_path`.
56+
vae_id (`str`, defaults to `None`):
57+
Identifier for the VAE model. This is useful when using a different VAE model than the default from `pretrained_model_name_or_path`.
58+
text_encoder_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
59+
Data type for the text encoder when generating text embeddings.
60+
text_encoder_2_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
61+
Data type for the text encoder 2 when generating text embeddings.
62+
text_encoder_3_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
63+
Data type for the text encoder 3 when generating text embeddings.
64+
transformer_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
65+
Data type for the transformer model.
66+
vae_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
67+
Data type for the VAE model.
68+
layerwise_upcasting_modules (`List[str]`, defaults to `[]`):
69+
Modules that should have fp8 storage weights but higher precision computation. Choose between ['transformer'].
70+
layerwise_upcasting_storage_dtype (`torch.dtype`, defaults to `float8_e4m3fn`):
71+
Data type for the layerwise upcasting storage. Choose between ['float8_e4m3fn', 'float8_e5m2'].
72+
layerwise_upcasting_skip_modules_pattern (`List[str]`, defaults to `["patch_embed", "pos_embed", "x_embedder", "context_embedder", "^proj_in$", "^proj_out$", "norm"]`):
73+
Modules to skip for layerwise upcasting. Layers such as normalization and modulation, when casted to fp8 precision
74+
naively (as done in layerwise upcasting), can lead to poorer training and inference quality. We skip these layers
75+
by default, and recommend adding more layers to the default list based on the model architecture.
76+
77+
DATASET ARGUMENTS
78+
-----------------
79+
dataset_config (`str`):
80+
File to a dataset file containing information about training data. This file can contain information about one or
81+
more datasets in JSON format. The file must have a key called "datasets", which is a list of dictionaries. Each
82+
dictionary must contain the following keys:
83+
- "data_root": (`str`)
84+
The root directory containing the dataset. This parameter must be provided if `dataset_file` is not provided.
85+
- "dataset_file": (`str`)
86+
Path to a CSV/JSON/JSONL/PARQUET/ARROW/HF_HUB_DATASET file containing metadata for training. This parameter
87+
must be provided if `data_root` is not provided.
88+
- "dataset_type": (`str`)
89+
Type of dataset. Choose between ['image', 'video'].
90+
- "id_token": (`str`)
91+
Identifier token appended to the start of each prompt if provided. This is useful for LoRA-type training
92+
for single subject/concept/style training, but is not necessary.
93+
- "image_resolution_buckets": (`List[Tuple[int, int]]`)
94+
Resolution buckets for image. This should be a list of tuples containing 2 values, where each tuple
95+
represents the resolution (height, width). All images will be resized to the nearest bucket resolution.
96+
This parameter must be provided if `dataset_type` is 'image'.
97+
- "video_resolution_buckets": (`List[Tuple[int, int, int]]`)
98+
Resolution buckets for video. This should be a list of tuples containing 3 values, where each tuple
99+
represents the resolution (num_frames, height, width). All videos will be resized to the nearest bucket
100+
resolution. This parameter must be provided if `dataset_type` is 'video'.
101+
- "reshape_mode": (`str`)
102+
All input images/videos are reshaped using this mode. Choose between the following:
103+
["center_crop", "random_crop", "bicubic"].
104+
- "remove_common_llm_caption_prefixes": (`boolean`)
105+
Whether or not to remove common LLM caption prefixes. See `~constants.py` for the list of common prefixes.
106+
dataset_shuffle_buffer_size (`int`, defaults to `1`):
107+
The buffer size for shuffling the dataset. This is useful for shuffling the dataset before training. The default
108+
value of `1` means that the dataset will not be shuffled.
109+
precomputation_items (`int`, defaults to `512`):
110+
Number of data samples to precompute at once for memory-efficient training. The higher this value,
111+
the more disk memory will be used to save the precomputed samples (conditions and latents).
112+
precomputation_dir (`str`, defaults to `None`):
113+
The directory where the precomputed samples will be stored. If not provided, the precomputed samples
114+
will be stored in a temporary directory of the output directory.
115+
precomputation_once (`bool`, defaults to `False`):
116+
Precompute embeddings from all datasets at once before training. This is useful to save time during training
117+
with smaller datasets. If set to `False`, will save disk space by precomputing embeddings on-the-fly during
118+
training when required. Make sure to set `precomputation_items` to a reasonable value in line with the size
119+
of your dataset(s).
120+
121+
DATALOADER_ARGUMENTS
122+
--------------------
123+
See https://pytorch.org/docs/stable/data.html for more information.
124+
125+
dataloader_num_workers (`int`, defaults to `0`):
126+
Number of subprocesses to use for data loading. `0` means that the data will be loaded in a blocking manner
127+
on the main process.
128+
pin_memory (`bool`, defaults to `False`):
129+
Whether or not to use the pinned memory setting in PyTorch dataloader. This is useful for faster data loading.
130+
131+
DIFFUSION ARGUMENTS
132+
-------------------
133+
flow_resolution_shifting (`bool`, defaults to `False`):
134+
Resolution-dependent shifting of timestep schedules.
135+
[Scaling Rectified Flow Transformers for High-Resolution Image Synthesis](https://arxiv.org/abs/2403.03206).
136+
TODO(aryan): We don't support this yet.
137+
flow_base_seq_len (`int`, defaults to `256`):
138+
Base number of tokens for images/video when applying resolution-dependent shifting.
139+
flow_max_seq_len (`int`, defaults to `4096`):
140+
Maximum number of tokens for images/video when applying resolution-dependent shifting.
141+
flow_base_shift (`float`, defaults to `0.5`):
142+
Base shift for timestep schedules when applying resolution-dependent shifting.
143+
flow_max_shift (`float`, defaults to `1.15`):
144+
Maximum shift for timestep schedules when applying resolution-dependent shifting.
145+
flow_shift (`float`, defaults to `1.0`):
146+
Instead of training with uniform/logit-normal sigmas, shift them as (shift * sigma) / (1 + (shift - 1) * sigma).
147+
Setting it higher is helpful when trying to train models for high-resolution generation or to produce better
148+
samples in lower number of inference steps.
149+
flow_weighting_scheme (`str`, defaults to `none`):
150+
We default to the "none" weighting scheme for uniform sampling and uniform loss.
151+
Choose between ['sigma_sqrt', 'logit_normal', 'mode', 'cosmap', 'none'].
152+
flow_logit_mean (`float`, defaults to `0.0`):
153+
Mean to use when using the `'logit_normal'` weighting scheme.
154+
flow_logit_std (`float`, defaults to `1.0`):
155+
Standard deviation to use when using the `'logit_normal'` weighting scheme.
156+
flow_mode_scale (`float`, defaults to `1.29`):
157+
Scale of mode weighting scheme. Only effective when using the `'mode'` as the `weighting_scheme`.
158+
159+
TRAINING ARGUMENTS
160+
------------------
161+
training_type (`str`, defaults to `None`):
162+
Type of training to perform. Choose between ['lora'].
163+
seed (`int`, defaults to `42`):
164+
A seed for reproducible training.
165+
batch_size (`int`, defaults to `1`):
166+
Per-device batch size.
167+
train_steps (`int`, defaults to `1000`):
168+
Total number of training steps to perform.
169+
max_data_samples (`int`, defaults to `2**64`):
170+
Maximum number of data samples observed during training training. If lesser than that required by `train_steps`,
171+
the training will stop early.
172+
gradient_accumulation_steps (`int`, defaults to `1`):
173+
Number of gradients steps to accumulate before performing an optimizer step.
174+
gradient_checkpointing (`bool`, defaults to `False`):
175+
Whether or not to use gradient/activation checkpointing to save memory at the expense of slower
176+
backward pass.
177+
checkpointing_steps (`int`, defaults to `500`):
178+
Save a checkpoint of the training state every X training steps. These checkpoints can be used both
179+
as final checkpoints in case they are better than the last checkpoint, and are also suitable for
180+
resuming training using `resume_from_checkpoint`.
181+
checkpointing_limit (`int`, defaults to `None`):
182+
Max number of checkpoints to store.
183+
resume_from_checkpoint (`str`, defaults to `None`):
184+
Whether training should be resumed from a previous checkpoint. Use a path saved by `checkpointing_steps`,
185+
or `"latest"` to automatically select the last available checkpoint.
186+
187+
OPTIMIZER ARGUMENTS
188+
-------------------
189+
optimizer (`str`, defaults to `adamw`):
190+
The optimizer type to use. Choose between the following:
191+
- Torch optimizers: ["adam", "adamw"]
192+
- Bitsandbytes optimizers: ["adam-bnb", "adamw-bnb", "adam-bnb-8bit", "adamw-bnb-8bit"]
193+
lr (`float`, defaults to `1e-4`):
194+
Initial learning rate (after the potential warmup period) to use.
195+
lr_scheduler (`str`, defaults to `cosine_with_restarts`):
196+
The scheduler type to use. Choose between ['linear', 'cosine', 'cosine_with_restarts', 'polynomial',
197+
'constant', 'constant_with_warmup'].
198+
lr_warmup_steps (`int`, defaults to `500`):
199+
Number of steps for the warmup in the lr scheduler.
200+
lr_num_cycles (`int`, defaults to `1`):
201+
Number of hard resets of the lr in cosine_with_restarts scheduler.
202+
lr_power (`float`, defaults to `1.0`):
203+
Power factor of the polynomial scheduler.
204+
beta1 (`float`, defaults to `0.9`):
205+
beta2 (`float`, defaults to `0.95`):
206+
beta3 (`float`, defaults to `0.999`):
207+
weight_decay (`float`, defaults to `0.0001`):
208+
Penalty for large weights in the model.
209+
epsilon (`float`, defaults to `1e-8`):
210+
Small value to avoid division by zero in the optimizer.
211+
max_grad_norm (`float`, defaults to `1.0`):
212+
Maximum gradient norm to clip the gradients.
213+
214+
VALIDATION ARGUMENTS
215+
--------------------
216+
validation_dataset_file (`str`, defaults to `None`):
217+
Path to a CSV/JSON/PARQUET/ARROW file containing information for validation. The file must contain atleast the
218+
"caption" column. Other columns such as "image_path" and "video_path" can be provided too. If provided, "image_path"
219+
will be used to load a PIL.Image.Image and set the "image" key in the sample dictionary. Similarly, "video_path"
220+
will be used to load a List[PIL.Image.Image] and set the "video" key in the sample dictionary.
221+
The validation dataset file may contain other attributes specific to inference/validation such as:
222+
- "height" and "width" and "num_frames": Resolution
223+
- "num_inference_steps": Number of inference steps
224+
- "guidance_scale": Classifier-free Guidance Scale
225+
- ... (any number of additional attributes can be provided. The ModelSpecification::validate method will be
226+
invoked with the sample dictionary to validate the sample.)
227+
validation_steps (`int`, defaults to `500`):
228+
Number of training steps after which a validation step is performed.
229+
enable_model_cpu_offload (`bool`, defaults to `False`):
230+
Whether or not to offload different modeling components to CPU during validation.
231+
232+
MISCELLANEOUS ARGUMENTS
233+
-----------------------
234+
tracker_name (`str`, defaults to `finetrainers`):
235+
Name of the tracker/project to use for logging training metrics.
236+
push_to_hub (`bool`, defaults to `False`):
237+
Whether or not to push the model to the Hugging Face Hub.
238+
hub_token (`str`, defaults to `None`):
239+
The API token to use for pushing the model to the Hugging Face Hub.
240+
hub_model_id (`str`, defaults to `None`):
241+
The model identifier to use for pushing the model to the Hugging Face Hub.
242+
output_dir (`str`, defaults to `None`):
243+
The directory where the model checkpoints and logs will be stored.
244+
logging_dir (`str`, defaults to `logs`):
245+
The directory where the logs will be stored.
246+
logging_steps (`int`, defaults to `1`):
247+
Training logs will be tracked every `logging_steps` steps.
248+
allow_tf32 (`bool`, defaults to `False`):
249+
Whether or not to allow the use of TF32 matmul on compatible hardware.
250+
nccl_timeout (`int`, defaults to `1800`):
251+
Timeout for the NCCL communication.
252+
report_to (`str`, defaults to `wandb`):
253+
The name of the logger to use for logging training metrics. Choose between ['wandb'].
254+
verbose (`int`, defaults to `1`):
255+
Whether or not to print verbose logs.
256+
- 0: Diffusers/Transformers warning logging on local main process only
257+
- 1: Diffusers/Transformers info logging on local main process only
258+
- 2: Diffusers/Transformers debug logging on local main process only
259+
- 3: Diffusers/Transformers debug logging on all processes
260+
```
261+
262+
## SFT training
263+
264+
If using `--training_type lora`, these arguments can be specified.
265+
266+
```
267+
rank (int):
268+
Rank of the low rank approximation.
269+
lora_alpha (int):
270+
The lora_alpha parameter to compute scaling factor (lora_alpha / rank) for low-rank matrices.
271+
target_modules (`str` or `List[str]`):
272+
Target modules for the low rank approximation. Can be a regex string or a list of regex strings.
273+
```
274+
275+
No additional arguments are required for `--training_type full-finetune`.

docs/models/README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# FineTrainers training documentation
1+
# Finetrainers training documentation
22

33
This directory contains the training-related specifications for all the models we support in `finetrainers`. Each model page has:
44
- an example training command
@@ -20,9 +20,11 @@ The following table shows the algorithms supported for training and the models t
2020

2121
| Model | SFT | Control | ControlNet | Distillation |
2222
|:-----------------------------------------:|:---:|:-------:|:----------:|:------------:|
23-
| [CogVideoX](./cogvideox.md) | 🤗 | 😡 | 😡 | 😡 |
24-
| [LTX-Video](./ltx_video.md) | 🤗 | 😡 | 😡 | 😡 |
25-
| [HunyuanVideo](./hunyuan_video.md)) | 🤗 | 😡 | 😡 | 😡 |
23+
| [CogVideoX](./cogvideox.md) | 🤗 | 😡 | 😡 | 😡 |
24+
| [CogView4](./cogview4.md) | 🤗 | 😡 | 😡 | 😡 |
25+
| [HunyuanVideo](./hunyuan_video.md) | 🤗 | 😡 | 😡 | 😡 |
26+
| [LTX-Video](./ltx_video.md) | 🤗 | 😡 | 😡 | 😡 |
27+
| [Wan](./wan.md) | 🤗 | 😡 | 😡 | 😡 |
2628

2729
For launching SFT Training:
2830
- `--training_type lora`: Trains a new set of low-rank weights of the model, yielding a smaller adapter model. Currently, only LoRA is supported from [🤗 PEFT](https://github.com/huggingface/peft)

finetrainers/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,6 @@
33
from .logging import get_logger
44
from .models import ModelSpecification
55
from .trainer import SFTTrainer
6+
7+
8+
__version__ = "0.1.0"

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99

1010
setup(
1111
name="finetrainers",
12-
version="0.0.1",
12+
version="0.1.0",
1313
description="Finetrainers is a work-in-progress library to support (accessible) training of diffusion models",
1414
long_description=long_description,
1515
long_description_content_type="text/markdown",

0 commit comments

Comments
 (0)