|
| 1 | +# Arguments |
| 2 | + |
| 3 | +This document lists all the arguments that can be passed to the `train.py` script. For more information, please take a look at the `finetrainers/args.py` file. |
| 4 | + |
| 5 | +## Table of contents |
| 6 | + |
| 7 | +- [General arguments](#general) |
| 8 | +- [SFT training arguments](#sft-training) |
| 9 | + |
| 10 | +## General |
| 11 | + |
| 12 | +<!-- TODO(aryan): write a github workflow that automatically updates this page --> |
| 13 | + |
| 14 | +``` |
| 15 | +PARALLEL ARGUMENTS |
| 16 | +------------------ |
| 17 | +parallel_backend (`str`, defaults to `accelerate`): |
| 18 | + The parallel backend to use for training. Choose between ['accelerate', 'ptd']. |
| 19 | +pp_degree (`int`, defaults to `1`): |
| 20 | + The degree of pipeline parallelism. |
| 21 | +dp_degree (`int`, defaults to `1`): |
| 22 | + The degree of data parallelism (number of model replicas). |
| 23 | +dp_shards (`int`, defaults to `-1`): |
| 24 | + The number of data parallel shards (number of model partitions). |
| 25 | +cp_degree (`int`, defaults to `1`): |
| 26 | + The degree of context parallelism. |
| 27 | +
|
| 28 | +MODEL ARGUMENTS |
| 29 | +--------------- |
| 30 | +model_name (`str`): |
| 31 | + Name of model to train. To get a list of models, run `python train.py --list_models`. |
| 32 | +pretrained_model_name_or_path (`str`): |
| 33 | + Path to pretrained model or model identifier from https://huggingface.co/models. The model should be |
| 34 | + loadable based on specified `model_name`. |
| 35 | +revision (`str`, defaults to `None`): |
| 36 | + If provided, the model will be loaded from a specific branch of the model repository. |
| 37 | +variant (`str`, defaults to `None`): |
| 38 | + Variant of model weights to use. Some models provide weight variants, such as `fp16`, to reduce disk |
| 39 | + storage requirements. |
| 40 | +cache_dir (`str`, defaults to `None`): |
| 41 | + The directory where the downloaded models and datasets will be stored, or loaded from. |
| 42 | +tokenizer_id (`str`, defaults to `None`): |
| 43 | + Identifier for the tokenizer model. This is useful when using a different tokenizer than the default from `pretrained_model_name_or_path`. |
| 44 | +tokenizer_2_id (`str`, defaults to `None`): |
| 45 | + Identifier for the second tokenizer model. This is useful when using a different tokenizer than the default from `pretrained_model_name_or_path`. |
| 46 | +tokenizer_3_id (`str`, defaults to `None`): |
| 47 | + Identifier for the third tokenizer model. This is useful when using a different tokenizer than the default from `pretrained_model_name_or_path`. |
| 48 | +text_encoder_id (`str`, defaults to `None`): |
| 49 | + Identifier for the text encoder model. This is useful when using a different text encoder than the default from `pretrained_model_name_or_path`. |
| 50 | +text_encoder_2_id (`str`, defaults to `None`): |
| 51 | + Identifier for the second text encoder model. This is useful when using a different text encoder than the default from `pretrained_model_name_or_path`. |
| 52 | +text_encoder_3_id (`str`, defaults to `None`): |
| 53 | + Identifier for the third text encoder model. This is useful when using a different text encoder than the default from `pretrained_model_name_or_path`. |
| 54 | +transformer_id (`str`, defaults to `None`): |
| 55 | + Identifier for the transformer model. This is useful when using a different transformer model than the default from `pretrained_model_name_or_path`. |
| 56 | +vae_id (`str`, defaults to `None`): |
| 57 | + Identifier for the VAE model. This is useful when using a different VAE model than the default from `pretrained_model_name_or_path`. |
| 58 | +text_encoder_dtype (`torch.dtype`, defaults to `torch.bfloat16`): |
| 59 | + Data type for the text encoder when generating text embeddings. |
| 60 | +text_encoder_2_dtype (`torch.dtype`, defaults to `torch.bfloat16`): |
| 61 | + Data type for the text encoder 2 when generating text embeddings. |
| 62 | +text_encoder_3_dtype (`torch.dtype`, defaults to `torch.bfloat16`): |
| 63 | + Data type for the text encoder 3 when generating text embeddings. |
| 64 | +transformer_dtype (`torch.dtype`, defaults to `torch.bfloat16`): |
| 65 | + Data type for the transformer model. |
| 66 | +vae_dtype (`torch.dtype`, defaults to `torch.bfloat16`): |
| 67 | + Data type for the VAE model. |
| 68 | +layerwise_upcasting_modules (`List[str]`, defaults to `[]`): |
| 69 | + Modules that should have fp8 storage weights but higher precision computation. Choose between ['transformer']. |
| 70 | +layerwise_upcasting_storage_dtype (`torch.dtype`, defaults to `float8_e4m3fn`): |
| 71 | + Data type for the layerwise upcasting storage. Choose between ['float8_e4m3fn', 'float8_e5m2']. |
| 72 | +layerwise_upcasting_skip_modules_pattern (`List[str]`, defaults to `["patch_embed", "pos_embed", "x_embedder", "context_embedder", "^proj_in$", "^proj_out$", "norm"]`): |
| 73 | + Modules to skip for layerwise upcasting. Layers such as normalization and modulation, when casted to fp8 precision |
| 74 | + naively (as done in layerwise upcasting), can lead to poorer training and inference quality. We skip these layers |
| 75 | + by default, and recommend adding more layers to the default list based on the model architecture. |
| 76 | +
|
| 77 | +DATASET ARGUMENTS |
| 78 | +----------------- |
| 79 | +dataset_config (`str`): |
| 80 | + File to a dataset file containing information about training data. This file can contain information about one or |
| 81 | + more datasets in JSON format. The file must have a key called "datasets", which is a list of dictionaries. Each |
| 82 | + dictionary must contain the following keys: |
| 83 | + - "data_root": (`str`) |
| 84 | + The root directory containing the dataset. This parameter must be provided if `dataset_file` is not provided. |
| 85 | + - "dataset_file": (`str`) |
| 86 | + Path to a CSV/JSON/JSONL/PARQUET/ARROW/HF_HUB_DATASET file containing metadata for training. This parameter |
| 87 | + must be provided if `data_root` is not provided. |
| 88 | + - "dataset_type": (`str`) |
| 89 | + Type of dataset. Choose between ['image', 'video']. |
| 90 | + - "id_token": (`str`) |
| 91 | + Identifier token appended to the start of each prompt if provided. This is useful for LoRA-type training |
| 92 | + for single subject/concept/style training, but is not necessary. |
| 93 | + - "image_resolution_buckets": (`List[Tuple[int, int]]`) |
| 94 | + Resolution buckets for image. This should be a list of tuples containing 2 values, where each tuple |
| 95 | + represents the resolution (height, width). All images will be resized to the nearest bucket resolution. |
| 96 | + This parameter must be provided if `dataset_type` is 'image'. |
| 97 | + - "video_resolution_buckets": (`List[Tuple[int, int, int]]`) |
| 98 | + Resolution buckets for video. This should be a list of tuples containing 3 values, where each tuple |
| 99 | + represents the resolution (num_frames, height, width). All videos will be resized to the nearest bucket |
| 100 | + resolution. This parameter must be provided if `dataset_type` is 'video'. |
| 101 | + - "reshape_mode": (`str`) |
| 102 | + All input images/videos are reshaped using this mode. Choose between the following: |
| 103 | + ["center_crop", "random_crop", "bicubic"]. |
| 104 | + - "remove_common_llm_caption_prefixes": (`boolean`) |
| 105 | + Whether or not to remove common LLM caption prefixes. See `~constants.py` for the list of common prefixes. |
| 106 | +dataset_shuffle_buffer_size (`int`, defaults to `1`): |
| 107 | + The buffer size for shuffling the dataset. This is useful for shuffling the dataset before training. The default |
| 108 | + value of `1` means that the dataset will not be shuffled. |
| 109 | +precomputation_items (`int`, defaults to `512`): |
| 110 | + Number of data samples to precompute at once for memory-efficient training. The higher this value, |
| 111 | + the more disk memory will be used to save the precomputed samples (conditions and latents). |
| 112 | +precomputation_dir (`str`, defaults to `None`): |
| 113 | + The directory where the precomputed samples will be stored. If not provided, the precomputed samples |
| 114 | + will be stored in a temporary directory of the output directory. |
| 115 | +precomputation_once (`bool`, defaults to `False`): |
| 116 | + Precompute embeddings from all datasets at once before training. This is useful to save time during training |
| 117 | + with smaller datasets. If set to `False`, will save disk space by precomputing embeddings on-the-fly during |
| 118 | + training when required. Make sure to set `precomputation_items` to a reasonable value in line with the size |
| 119 | + of your dataset(s). |
| 120 | +
|
| 121 | +DATALOADER_ARGUMENTS |
| 122 | +-------------------- |
| 123 | +See https://pytorch.org/docs/stable/data.html for more information. |
| 124 | +
|
| 125 | +dataloader_num_workers (`int`, defaults to `0`): |
| 126 | + Number of subprocesses to use for data loading. `0` means that the data will be loaded in a blocking manner |
| 127 | + on the main process. |
| 128 | +pin_memory (`bool`, defaults to `False`): |
| 129 | + Whether or not to use the pinned memory setting in PyTorch dataloader. This is useful for faster data loading. |
| 130 | +
|
| 131 | +DIFFUSION ARGUMENTS |
| 132 | +------------------- |
| 133 | +flow_resolution_shifting (`bool`, defaults to `False`): |
| 134 | + Resolution-dependent shifting of timestep schedules. |
| 135 | + [Scaling Rectified Flow Transformers for High-Resolution Image Synthesis](https://arxiv.org/abs/2403.03206). |
| 136 | + TODO(aryan): We don't support this yet. |
| 137 | +flow_base_seq_len (`int`, defaults to `256`): |
| 138 | + Base number of tokens for images/video when applying resolution-dependent shifting. |
| 139 | +flow_max_seq_len (`int`, defaults to `4096`): |
| 140 | + Maximum number of tokens for images/video when applying resolution-dependent shifting. |
| 141 | +flow_base_shift (`float`, defaults to `0.5`): |
| 142 | + Base shift for timestep schedules when applying resolution-dependent shifting. |
| 143 | +flow_max_shift (`float`, defaults to `1.15`): |
| 144 | + Maximum shift for timestep schedules when applying resolution-dependent shifting. |
| 145 | +flow_shift (`float`, defaults to `1.0`): |
| 146 | + Instead of training with uniform/logit-normal sigmas, shift them as (shift * sigma) / (1 + (shift - 1) * sigma). |
| 147 | + Setting it higher is helpful when trying to train models for high-resolution generation or to produce better |
| 148 | + samples in lower number of inference steps. |
| 149 | +flow_weighting_scheme (`str`, defaults to `none`): |
| 150 | + We default to the "none" weighting scheme for uniform sampling and uniform loss. |
| 151 | + Choose between ['sigma_sqrt', 'logit_normal', 'mode', 'cosmap', 'none']. |
| 152 | +flow_logit_mean (`float`, defaults to `0.0`): |
| 153 | + Mean to use when using the `'logit_normal'` weighting scheme. |
| 154 | +flow_logit_std (`float`, defaults to `1.0`): |
| 155 | + Standard deviation to use when using the `'logit_normal'` weighting scheme. |
| 156 | +flow_mode_scale (`float`, defaults to `1.29`): |
| 157 | + Scale of mode weighting scheme. Only effective when using the `'mode'` as the `weighting_scheme`. |
| 158 | +
|
| 159 | +TRAINING ARGUMENTS |
| 160 | +------------------ |
| 161 | +training_type (`str`, defaults to `None`): |
| 162 | + Type of training to perform. Choose between ['lora']. |
| 163 | +seed (`int`, defaults to `42`): |
| 164 | + A seed for reproducible training. |
| 165 | +batch_size (`int`, defaults to `1`): |
| 166 | + Per-device batch size. |
| 167 | +train_steps (`int`, defaults to `1000`): |
| 168 | + Total number of training steps to perform. |
| 169 | +max_data_samples (`int`, defaults to `2**64`): |
| 170 | + Maximum number of data samples observed during training training. If lesser than that required by `train_steps`, |
| 171 | + the training will stop early. |
| 172 | +gradient_accumulation_steps (`int`, defaults to `1`): |
| 173 | + Number of gradients steps to accumulate before performing an optimizer step. |
| 174 | +gradient_checkpointing (`bool`, defaults to `False`): |
| 175 | + Whether or not to use gradient/activation checkpointing to save memory at the expense of slower |
| 176 | + backward pass. |
| 177 | +checkpointing_steps (`int`, defaults to `500`): |
| 178 | + Save a checkpoint of the training state every X training steps. These checkpoints can be used both |
| 179 | + as final checkpoints in case they are better than the last checkpoint, and are also suitable for |
| 180 | + resuming training using `resume_from_checkpoint`. |
| 181 | +checkpointing_limit (`int`, defaults to `None`): |
| 182 | + Max number of checkpoints to store. |
| 183 | +resume_from_checkpoint (`str`, defaults to `None`): |
| 184 | + Whether training should be resumed from a previous checkpoint. Use a path saved by `checkpointing_steps`, |
| 185 | + or `"latest"` to automatically select the last available checkpoint. |
| 186 | +
|
| 187 | +OPTIMIZER ARGUMENTS |
| 188 | +------------------- |
| 189 | +optimizer (`str`, defaults to `adamw`): |
| 190 | + The optimizer type to use. Choose between the following: |
| 191 | + - Torch optimizers: ["adam", "adamw"] |
| 192 | + - Bitsandbytes optimizers: ["adam-bnb", "adamw-bnb", "adam-bnb-8bit", "adamw-bnb-8bit"] |
| 193 | +lr (`float`, defaults to `1e-4`): |
| 194 | + Initial learning rate (after the potential warmup period) to use. |
| 195 | +lr_scheduler (`str`, defaults to `cosine_with_restarts`): |
| 196 | + The scheduler type to use. Choose between ['linear', 'cosine', 'cosine_with_restarts', 'polynomial', |
| 197 | + 'constant', 'constant_with_warmup']. |
| 198 | +lr_warmup_steps (`int`, defaults to `500`): |
| 199 | + Number of steps for the warmup in the lr scheduler. |
| 200 | +lr_num_cycles (`int`, defaults to `1`): |
| 201 | + Number of hard resets of the lr in cosine_with_restarts scheduler. |
| 202 | +lr_power (`float`, defaults to `1.0`): |
| 203 | + Power factor of the polynomial scheduler. |
| 204 | +beta1 (`float`, defaults to `0.9`): |
| 205 | +beta2 (`float`, defaults to `0.95`): |
| 206 | +beta3 (`float`, defaults to `0.999`): |
| 207 | +weight_decay (`float`, defaults to `0.0001`): |
| 208 | + Penalty for large weights in the model. |
| 209 | +epsilon (`float`, defaults to `1e-8`): |
| 210 | + Small value to avoid division by zero in the optimizer. |
| 211 | +max_grad_norm (`float`, defaults to `1.0`): |
| 212 | + Maximum gradient norm to clip the gradients. |
| 213 | +
|
| 214 | +VALIDATION ARGUMENTS |
| 215 | +-------------------- |
| 216 | +validation_dataset_file (`str`, defaults to `None`): |
| 217 | + Path to a CSV/JSON/PARQUET/ARROW file containing information for validation. The file must contain atleast the |
| 218 | + "caption" column. Other columns such as "image_path" and "video_path" can be provided too. If provided, "image_path" |
| 219 | + will be used to load a PIL.Image.Image and set the "image" key in the sample dictionary. Similarly, "video_path" |
| 220 | + will be used to load a List[PIL.Image.Image] and set the "video" key in the sample dictionary. |
| 221 | + The validation dataset file may contain other attributes specific to inference/validation such as: |
| 222 | + - "height" and "width" and "num_frames": Resolution |
| 223 | + - "num_inference_steps": Number of inference steps |
| 224 | + - "guidance_scale": Classifier-free Guidance Scale |
| 225 | + - ... (any number of additional attributes can be provided. The ModelSpecification::validate method will be |
| 226 | + invoked with the sample dictionary to validate the sample.) |
| 227 | +validation_steps (`int`, defaults to `500`): |
| 228 | + Number of training steps after which a validation step is performed. |
| 229 | +enable_model_cpu_offload (`bool`, defaults to `False`): |
| 230 | + Whether or not to offload different modeling components to CPU during validation. |
| 231 | +
|
| 232 | +MISCELLANEOUS ARGUMENTS |
| 233 | +----------------------- |
| 234 | +tracker_name (`str`, defaults to `finetrainers`): |
| 235 | + Name of the tracker/project to use for logging training metrics. |
| 236 | +push_to_hub (`bool`, defaults to `False`): |
| 237 | + Whether or not to push the model to the Hugging Face Hub. |
| 238 | +hub_token (`str`, defaults to `None`): |
| 239 | + The API token to use for pushing the model to the Hugging Face Hub. |
| 240 | +hub_model_id (`str`, defaults to `None`): |
| 241 | + The model identifier to use for pushing the model to the Hugging Face Hub. |
| 242 | +output_dir (`str`, defaults to `None`): |
| 243 | + The directory where the model checkpoints and logs will be stored. |
| 244 | +logging_dir (`str`, defaults to `logs`): |
| 245 | + The directory where the logs will be stored. |
| 246 | +logging_steps (`int`, defaults to `1`): |
| 247 | + Training logs will be tracked every `logging_steps` steps. |
| 248 | +allow_tf32 (`bool`, defaults to `False`): |
| 249 | + Whether or not to allow the use of TF32 matmul on compatible hardware. |
| 250 | +nccl_timeout (`int`, defaults to `1800`): |
| 251 | + Timeout for the NCCL communication. |
| 252 | +report_to (`str`, defaults to `wandb`): |
| 253 | + The name of the logger to use for logging training metrics. Choose between ['wandb']. |
| 254 | +verbose (`int`, defaults to `1`): |
| 255 | + Whether or not to print verbose logs. |
| 256 | + - 0: Diffusers/Transformers warning logging on local main process only |
| 257 | + - 1: Diffusers/Transformers info logging on local main process only |
| 258 | + - 2: Diffusers/Transformers debug logging on local main process only |
| 259 | + - 3: Diffusers/Transformers debug logging on all processes |
| 260 | +``` |
| 261 | + |
| 262 | +## SFT training |
| 263 | + |
| 264 | +If using `--training_type lora`, these arguments can be specified. |
| 265 | + |
| 266 | +``` |
| 267 | +rank (int): |
| 268 | + Rank of the low rank approximation. |
| 269 | +lora_alpha (int): |
| 270 | + The lora_alpha parameter to compute scaling factor (lora_alpha / rank) for low-rank matrices. |
| 271 | +target_modules (`str` or `List[str]`): |
| 272 | + Target modules for the low rank approximation. Can be a regex string or a list of regex strings. |
| 273 | +``` |
| 274 | + |
| 275 | +No additional arguments are required for `--training_type full-finetune`. |
0 commit comments