Prepare for v0.1.0 release (#322)

a-r-r-o-w · web-flow · commit 5ea04575c0a1 · 2025-03-21T16:38:13.000+05:30
* update

* update

* update

* update
diff --git a/README.md b/README.md
@@ -16,6 +16,7 @@ Finetrainers is a work-in-progress library to support (accessible) training of d
 ## Table of Contents
 
 - [Quickstart](#quickstart)
+- [Features](#features)
 - [News](#news)
 - [Support Matrix](#support-matrix)
 - [Featured Projects](#featured-projects-)
@@ -25,11 +26,11 @@ Finetrainers is a work-in-progress library to support (accessible) training of d
 
 Clone the repository and make sure the requirements are installed: `pip install -r requirements.txt` and install `diffusers` from source by `pip install git+https://github.com/huggingface/diffusers`. The requirements specify `diffusers>=0.32.1`, but it is always recommended to use the `main` branch of Diffusers for the latest features and bugfixes. Note that the `main` branch for `finetrainers` is also the development branch, and stable support should be expected from the release tags.
 
-Checkout to the latest release tag:
+Checkout to the latest stable release tag:
 
 ```bash
 git fetch --all --tags
-git checkout tags/v0.0.1
+git checkout tags/v0.1.0
 ```
 
 Follow the instructions mentioned in the [README](https://github.com/a-r-r-o-w/finetrainers/tree/v0.0.1) for the latest stable release.
@@ -51,6 +52,16 @@ Please checkout [`docs/models`](./docs/models/) and [`examples/training`](./exam
 > [!IMPORTANT] 
 > It is recommended to use Pytorch 2.5.1 or above for training. Previous versions can lead to completely black videos, OOM errors, or other issues and are not tested. For fully reproducible training, please use the same environment as mentioned in [environment.md](./docs/environment.md).
 
+## Features
+
+- DDP, FSDP-2 & HSDP support for all models with low-rank and full-rank training
+- Memory-efficient single-GPU training
+- Auto-detection of commonly used dataset formats
+- Combined image/video datasets, multiple chainable local/remote datasets, multi-resolution bucketing & more
+- Memory-efficient precomputation support with/without on-the-fly precomputation for large scale datasets
+- Standardized model specification format for training arbitrary models
+- Fake FP8 training (QAT upcoming!)
+
 ## News
 
 - 🔥 **2025-03-07**: CogView4 support added!
diff --git a/docs/args.md b/docs/args.md
@@ -0,0 +1,275 @@
+# Arguments
+
+This document lists all the arguments that can be passed to the `train.py` script. For more information, please take a look at the `finetrainers/args.py` file.
+
+## Table of contents
+
+- [General arguments](#general)
+- [SFT training arguments](#sft-training)
+
+## General
+
+<!-- TODO(aryan): write a github workflow that automatically updates this page -->
+
+```
+PARALLEL ARGUMENTS
+------------------
+parallel_backend (`str`, defaults to `accelerate`):
+    The parallel backend to use for training. Choose between ['accelerate', 'ptd'].
+pp_degree (`int`, defaults to `1`):
+    The degree of pipeline parallelism.
+dp_degree (`int`, defaults to `1`):
+    The degree of data parallelism (number of model replicas).
+dp_shards (`int`, defaults to `-1`):
+    The number of data parallel shards (number of model partitions).
+cp_degree (`int`, defaults to `1`):
+    The degree of context parallelism.
+
+MODEL ARGUMENTS
+---------------
+model_name (`str`):
+    Name of model to train. To get a list of models, run `python train.py --list_models`.
+pretrained_model_name_or_path (`str`):
+    Path to pretrained model or model identifier from https://huggingface.co/models. The model should be
+    loadable based on specified `model_name`.
+revision (`str`, defaults to `None`):
+    If provided, the model will be loaded from a specific branch of the model repository.
+variant (`str`, defaults to `None`):
+    Variant of model weights to use. Some models provide weight variants, such as `fp16`, to reduce disk
+    storage requirements.
+cache_dir (`str`, defaults to `None`):
+    The directory where the downloaded models and datasets will be stored, or loaded from.
+tokenizer_id (`str`, defaults to `None`):
+    Identifier for the tokenizer model. This is useful when using a different tokenizer than the default from `pretrained_model_name_or_path`.
+tokenizer_2_id (`str`, defaults to `None`):
+    Identifier for the second tokenizer model. This is useful when using a different tokenizer than the default from `pretrained_model_name_or_path`.
+tokenizer_3_id (`str`, defaults to `None`):
+    Identifier for the third tokenizer model. This is useful when using a different tokenizer than the default from `pretrained_model_name_or_path`.
+text_encoder_id (`str`, defaults to `None`):
+    Identifier for the text encoder model. This is useful when using a different text encoder than the default from `pretrained_model_name_or_path`.
+text_encoder_2_id (`str`, defaults to `None`):
+    Identifier for the second text encoder model. This is useful when using a different text encoder than the default from `pretrained_model_name_or_path`.
+text_encoder_3_id (`str`, defaults to `None`):
+    Identifier for the third text encoder model. This is useful when using a different text encoder than the default from `pretrained_model_name_or_path`.
+transformer_id (`str`, defaults to `None`):
+    Identifier for the transformer model. This is useful when using a different transformer model than the default from `pretrained_model_name_or_path`.
+vae_id (`str`, defaults to `None`):
+    Identifier for the VAE model. This is useful when using a different VAE model than the default from `pretrained_model_name_or_path`.
+text_encoder_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
+    Data type for the text encoder when generating text embeddings.
+text_encoder_2_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
+    Data type for the text encoder 2 when generating text embeddings.
+text_encoder_3_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
+    Data type for the text encoder 3 when generating text embeddings.
+transformer_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
+    Data type for the transformer model.
+vae_dtype (`torch.dtype`, defaults to `torch.bfloat16`):
+    Data type for the VAE model.
+layerwise_upcasting_modules (`List[str]`, defaults to `[]`):
+    Modules that should have fp8 storage weights but higher precision computation. Choose between ['transformer'].
+layerwise_upcasting_storage_dtype (`torch.dtype`, defaults to `float8_e4m3fn`):
+    Data type for the layerwise upcasting storage. Choose between ['float8_e4m3fn', 'float8_e5m2'].
+layerwise_upcasting_skip_modules_pattern (`List[str]`, defaults to `["patch_embed", "pos_embed", "x_embedder", "context_embedder", "^proj_in$", "^proj_out$", "norm"]`):
+    Modules to skip for layerwise upcasting. Layers such as normalization and modulation, when casted to fp8 precision
+    naively (as done in layerwise upcasting), can lead to poorer training and inference quality. We skip these layers
+    by default, and recommend adding more layers to the default list based on the model architecture.
+
+DATASET ARGUMENTS
+-----------------
+dataset_config (`str`):
+    File to a dataset file containing information about training data. This file can contain information about one or
+    more datasets in JSON format. The file must have a key called "datasets", which is a list of dictionaries. Each
+    dictionary must contain the following keys:
+        - "data_root": (`str`)
+            The root directory containing the dataset. This parameter must be provided if `dataset_file` is not provided.
+        - "dataset_file": (`str`)
+            Path to a CSV/JSON/JSONL/PARQUET/ARROW/HF_HUB_DATASET file containing metadata for training. This parameter
+            must be provided if `data_root` is not provided.
+        - "dataset_type": (`str`)
+            Type of dataset. Choose between ['image', 'video'].
+        - "id_token": (`str`)
+            Identifier token appended to the start of each prompt if provided. This is useful for LoRA-type training
+            for single subject/concept/style training, but is not necessary.
+        - "image_resolution_buckets": (`List[Tuple[int, int]]`)
+            Resolution buckets for image. This should be a list of tuples containing 2 values, where each tuple
+            represents the resolution (height, width). All images will be resized to the nearest bucket resolution.
+            This parameter must be provided if `dataset_type` is 'image'.
+        - "video_resolution_buckets": (`List[Tuple[int, int, int]]`)
+            Resolution buckets for video. This should be a list of tuples containing 3 values, where each tuple
+            represents the resolution (num_frames, height, width). All videos will be resized to the nearest bucket
+            resolution. This parameter must be provided if `dataset_type` is 'video'.
+        - "reshape_mode": (`str`)
+            All input images/videos are reshaped using this mode. Choose between the following:
+            ["center_crop", "random_crop", "bicubic"].
+        - "remove_common_llm_caption_prefixes": (`boolean`)
+            Whether or not to remove common LLM caption prefixes. See `~constants.py` for the list of common prefixes.
+dataset_shuffle_buffer_size (`int`, defaults to `1`):
+    The buffer size for shuffling the dataset. This is useful for shuffling the dataset before training. The default
+    value of `1` means that the dataset will not be shuffled.
+precomputation_items (`int`, defaults to `512`):
+    Number of data samples to precompute at once for memory-efficient training. The higher this value,
+    the more disk memory will be used to save the precomputed samples (conditions and latents).
+precomputation_dir (`str`, defaults to `None`):
+    The directory where the precomputed samples will be stored. If not provided, the precomputed samples
+    will be stored in a temporary directory of the output directory.
+precomputation_once (`bool`, defaults to `False`):
+    Precompute embeddings from all datasets at once before training. This is useful to save time during training
+    with smaller datasets. If set to `False`, will save disk space by precomputing embeddings on-the-fly during
+    training when required. Make sure to set `precomputation_items` to a reasonable value in line with the size
+    of your dataset(s).
+
+DATALOADER_ARGUMENTS
+--------------------
+See https://pytorch.org/docs/stable/data.html for more information.
+
+dataloader_num_workers (`int`, defaults to `0`):
+    Number of subprocesses to use for data loading. `0` means that the data will be loaded in a blocking manner
+    on the main process.
+pin_memory (`bool`, defaults to `False`):
+    Whether or not to use the pinned memory setting in PyTorch dataloader. This is useful for faster data loading.
+
+DIFFUSION ARGUMENTS
+-------------------
+flow_resolution_shifting (`bool`, defaults to `False`):
+    Resolution-dependent shifting of timestep schedules.
+    [Scaling Rectified Flow Transformers for High-Resolution Image Synthesis](https://arxiv.org/abs/2403.03206).
+    TODO(aryan): We don't support this yet.
+flow_base_seq_len (`int`, defaults to `256`):
+    Base number of tokens for images/video when applying resolution-dependent shifting.
+flow_max_seq_len (`int`, defaults to `4096`):
+    Maximum number of tokens for images/video when applying resolution-dependent shifting.
+flow_base_shift (`float`, defaults to `0.5`):
+    Base shift for timestep schedules when applying resolution-dependent shifting.
+flow_max_shift (`float`, defaults to `1.15`):
+    Maximum shift for timestep schedules when applying resolution-dependent shifting.
+flow_shift (`float`, defaults to `1.0`):
+    Instead of training with uniform/logit-normal sigmas, shift them as (shift * sigma) / (1 + (shift - 1) * sigma).
+    Setting it higher is helpful when trying to train models for high-resolution generation or to produce better
+    samples in lower number of inference steps.
+flow_weighting_scheme (`str`, defaults to `none`):
+    We default to the "none" weighting scheme for uniform sampling and uniform loss.
+    Choose between ['sigma_sqrt', 'logit_normal', 'mode', 'cosmap', 'none'].
+flow_logit_mean (`float`, defaults to `0.0`):
+    Mean to use when using the `'logit_normal'` weighting scheme.
+flow_logit_std (`float`, defaults to `1.0`):
+    Standard deviation to use when using the `'logit_normal'` weighting scheme.
+flow_mode_scale (`float`, defaults to `1.29`):
+    Scale of mode weighting scheme. Only effective when using the `'mode'` as the `weighting_scheme`.
+
+TRAINING ARGUMENTS
+------------------
+training_type (`str`, defaults to `None`):
+    Type of training to perform. Choose between ['lora'].
+seed (`int`, defaults to `42`):
+    A seed for reproducible training.
+batch_size (`int`, defaults to `1`):
+    Per-device batch size.
+train_steps (`int`, defaults to `1000`):
+    Total number of training steps to perform.
+max_data_samples (`int`, defaults to `2**64`):
+    Maximum number of data samples observed during training training. If lesser than that required by `train_steps`,
+    the training will stop early.
+gradient_accumulation_steps (`int`, defaults to `1`):
+    Number of gradients steps to accumulate before performing an optimizer step.
+gradient_checkpointing (`bool`, defaults to `False`):
+    Whether or not to use gradient/activation checkpointing to save memory at the expense of slower
+    backward pass.
+checkpointing_steps (`int`, defaults to `500`):
+    Save a checkpoint of the training state every X training steps. These checkpoints can be used both
+    as final checkpoints in case they are better than the last checkpoint, and are also suitable for
+    resuming training using `resume_from_checkpoint`.
+checkpointing_limit (`int`, defaults to `None`):
+    Max number of checkpoints to store.
+resume_from_checkpoint (`str`, defaults to `None`):
+    Whether training should be resumed from a previous checkpoint. Use a path saved by `checkpointing_steps`,
+    or `"latest"` to automatically select the last available checkpoint.
+
+OPTIMIZER ARGUMENTS
+-------------------
+optimizer (`str`, defaults to `adamw`):
+    The optimizer type to use. Choose between the following:
+        - Torch optimizers: ["adam", "adamw"]
+        - Bitsandbytes optimizers: ["adam-bnb", "adamw-bnb", "adam-bnb-8bit", "adamw-bnb-8bit"]
+lr (`float`, defaults to `1e-4`):
+    Initial learning rate (after the potential warmup period) to use.
+lr_scheduler (`str`, defaults to `cosine_with_restarts`):
+    The scheduler type to use. Choose between ['linear', 'cosine', 'cosine_with_restarts', 'polynomial',
+    'constant', 'constant_with_warmup'].
+lr_warmup_steps (`int`, defaults to `500`):
+    Number of steps for the warmup in the lr scheduler.
+lr_num_cycles (`int`, defaults to `1`):
+    Number of hard resets of the lr in cosine_with_restarts scheduler.
+lr_power (`float`, defaults to `1.0`):
+    Power factor of the polynomial scheduler.
+beta1 (`float`, defaults to `0.9`):
+beta2 (`float`, defaults to `0.95`):
+beta3 (`float`, defaults to `0.999`):
+weight_decay (`float`, defaults to `0.0001`):
+    Penalty for large weights in the model.
+epsilon (`float`, defaults to `1e-8`):
+    Small value to avoid division by zero in the optimizer.
+max_grad_norm (`float`, defaults to `1.0`):
+    Maximum gradient norm to clip the gradients.
+
+VALIDATION ARGUMENTS
+--------------------
+validation_dataset_file (`str`, defaults to `None`):
+    Path to a CSV/JSON/PARQUET/ARROW file containing information for validation. The file must contain atleast the
+    "caption" column. Other columns such as "image_path" and "video_path" can be provided too. If provided, "image_path"
+    will be used to load a PIL.Image.Image and set the "image" key in the sample dictionary. Similarly, "video_path"
+    will be used to load a List[PIL.Image.Image] and set the "video" key in the sample dictionary.
+    The validation dataset file may contain other attributes specific to inference/validation such as:
+        - "height" and "width" and "num_frames": Resolution
+        - "num_inference_steps": Number of inference steps
+        - "guidance_scale": Classifier-free Guidance Scale
+        - ... (any number of additional attributes can be provided. The ModelSpecification::validate method will be
+          invoked with the sample dictionary to validate the sample.)
+validation_steps (`int`, defaults to `500`):
+    Number of training steps after which a validation step is performed.
+enable_model_cpu_offload (`bool`, defaults to `False`):
+    Whether or not to offload different modeling components to CPU during validation.
+
+MISCELLANEOUS ARGUMENTS
+-----------------------
+tracker_name (`str`, defaults to `finetrainers`):
+    Name of the tracker/project to use for logging training metrics.
+push_to_hub (`bool`, defaults to `False`):
+    Whether or not to push the model to the Hugging Face Hub.
+hub_token (`str`, defaults to `None`):
+    The API token to use for pushing the model to the Hugging Face Hub.
+hub_model_id (`str`, defaults to `None`):
+    The model identifier to use for pushing the model to the Hugging Face Hub.
+output_dir (`str`, defaults to `None`):
+    The directory where the model checkpoints and logs will be stored.
+logging_dir (`str`, defaults to `logs`):
+    The directory where the logs will be stored.
+logging_steps (`int`, defaults to `1`):
+    Training logs will be tracked every `logging_steps` steps.
+allow_tf32 (`bool`, defaults to `False`):
+    Whether or not to allow the use of TF32 matmul on compatible hardware.
+nccl_timeout (`int`, defaults to `1800`):
+    Timeout for the NCCL communication.
+report_to (`str`, defaults to `wandb`):
+    The name of the logger to use for logging training metrics. Choose between ['wandb'].
+verbose (`int`, defaults to `1`):
+    Whether or not to print verbose logs.
+        - 0: Diffusers/Transformers warning logging on local main process only
+        - 1: Diffusers/Transformers info logging on local main process only
+        - 2: Diffusers/Transformers debug logging on local main process only
+        - 3: Diffusers/Transformers debug logging on all processes
+```
+
+## SFT training
+
+If using `--training_type lora`, these arguments can be specified.
+
+```
+rank (int):
+    Rank of the low rank approximation.
+lora_alpha (int):
+    The lora_alpha parameter to compute scaling factor (lora_alpha / rank) for low-rank matrices.
+target_modules (`str` or `List[str]`):
+    Target modules for the low rank approximation. Can be a regex string or a list of regex strings.
+```
+
+No additional arguments are required for `--training_type full-finetune`.
diff --git a/docs/models/README.md b/docs/models/README.md
@@ -1,4 +1,4 @@
-# FineTrainers training documentation
+# Finetrainers training documentation
 
 This directory contains the training-related specifications for all the models we support in `finetrainers`. Each model page has:
 - an example training command
@@ -20,9 +20,11 @@ The following table shows the algorithms supported for training and the models t
 
 | Model                                     | SFT | Control | ControlNet | Distillation |
 |:-----------------------------------------:|:---:|:-------:|:----------:|:------------:|
-| [CogVideoX](./cogvideox.md)             | 🤗 | 😡 | 😡 | 😡 |
-| [LTX-Video](./ltx_video.md)             | 🤗 | 😡 | 😡 | 😡 |
-| [HunyuanVideo](./hunyuan_video.md))     | 🤗 | 😡 | 😡 | 😡 |
+| [CogVideoX](./cogvideox.md)               | 🤗 | 😡 | 😡 | 😡 |
+| [CogView4](./cogview4.md)                 | 🤗 | 😡 | 😡 | 😡 |
+| [HunyuanVideo](./hunyuan_video.md)        | 🤗 | 😡 | 😡 | 😡 |
+| [LTX-Video](./ltx_video.md)               | 🤗 | 😡 | 😡 | 😡 |
+| [Wan](./wan.md)                           | 🤗 | 😡 | 😡 | 😡 |
 
 For launching SFT Training:
 - `--training_type lora`: Trains a new set of low-rank weights of the model, yielding a smaller adapter model. Currently, only LoRA is supported from [🤗 PEFT](https://github.com/huggingface/peft)
diff --git a/finetrainers/__init__.py b/finetrainers/__init__.py
@@ -3,3 +3,6 @@
 from .logging import get_logger
 from .models import ModelSpecification
 from .trainer import SFTTrainer
+
+
+__version__ = "0.1.0"
diff --git a/setup.py b/setup.py
@@ -9,7 +9,7 @@
 
 setup(
     name="finetrainers",
-    version="0.0.1",
+    version="0.1.0",
     description="Finetrainers is a work-in-progress library to support (accessible) training of diffusion models",
     long_description=long_description,
     long_description_content_type="text/markdown",