huggingface
diff --git a/‎README.md
Lines changed: 3 additions & 0 deletions b/‎README.md
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/args.md
Lines changed: 20 additions & 0 deletions b/‎docs/args.md
Lines changed: 20 additions & 0 deletions
diff --git a/‎docs/environment.md
Lines changed: 8 additions & 0 deletions b/‎docs/environment.md
Lines changed: 8 additions & 0 deletions
diff --git a/‎docs/models/attention.md
Lines changed: 87 additions & 0 deletions b/‎docs/models/attention.md
Lines changed: 87 additions & 0 deletions
diff --git a/‎finetrainers/args.py
Lines changed: 108 additions & 19 deletions b/‎finetrainers/args.py
Lines changed: 108 additions & 19 deletions
diff --git a/‎finetrainers/constants.py
Lines changed: 6 additions & 3 deletions b/‎finetrainers/constants.py
Lines changed: 6 additions & 3 deletions
diff --git a/‎finetrainers/models/__init__.py
Lines changed: 1 addition & 0 deletions b/‎finetrainers/models/__init__.py
Lines changed: 1 addition & 0 deletions
@@ -57,6 +57,7 @@ Please checkout [`docs/models`](./docs/models/) and [`examples/training`](./exam
 - DDP, FSDP-2 & HSDP support for all models
 - LoRA and full-rank finetuning; Conditional Control training
 - Memory-efficient single-GPU training
+- Multiple attention backends supported - `flash`, `flex`, `sage`, `xformers` (see [attention](./docs/models/attention.md) docs)
 - Auto-detection of commonly used dataset formats
 - Combined image/video datasets, multiple chainable local/remote datasets, multi-resolution bucketing & more
 - Memory-efficient precomputation support with/without on-the-fly precomputation for large scale datasets
@@ -65,6 +66,8 @@ Please checkout [`docs/models`](./docs/models/) and [`examples/training`](./exam
 
 ## News
 
+- 🔥 **2025-04-25**: Support for different attention providers added!
+- 🔥 **2025-04-21**: Wan I2V supported added!
 - 🔥 **2025-04-12**: Channel-concatenated control conditioning support added for CogView4 and Wan!
 - 🔥 **2025-04-08**: `torch.compile` support added!
 - 🔥 **2025-04-06**: Flux support added!
 
@@ -270,6 +270,26 @@ float32_matmul_precision (`str`, defaults to `highest`):
     The precision to use for float32 matmul. Choose between ['highest', 'high', 'medium'].
 ```
 
+### Attention Provider
+
+These arguments are relevant to setting attention provider for different modeling components. The attention providers may be set differently for training and validation/inference.
+
+```
+attn_provider_training (`str`, defaults to "native"):
+    The attention provider to use for training. Choose between
+    [
+        'flash', 'flash_varlen', 'flex', 'native', '_native_cudnn', '_native_efficient', '_native_flash',
+        '_native_math'
+    ]
+attn_provider_inference (`str`, defaults to "native"):
+    The attention provider to use for validation. Choose between
+    [
+        'flash', 'flash_varlen', 'flex', 'native', '_native_cudnn', '_native_efficient', '_native_flash',
+        '_native_math', 'sage', 'sage_varlen', '_sage_qk_int8_pv_fp8_cuda', '_sage_qk_int8_pv_fp8_cuda_sm90',
+        '_sage_qk_int8_pv_fp16_cuda', '_sage_qk_int8_pv_fp16_triton', 'xformers'
+    ]
+```
+
 ## SFT training
 
 If using `--training_type lora`, these arguments can be specified.
 
@@ -26,3 +26,11 @@ NVIDIA A100-SXM4-80GB, 81920 MiB
 ```
 
 Other versions of dependencies may or may not work as expected. We would like to make finetrainers work on a wider range of environments, but due to the complexity of testing at the early stages of development, we are unable to do so. The long term goals include compatibility with most pytorch versions on CUDA, MPS, ROCm and XLA devices.
+
+
+## Configuration
+
+The following environment variables may be configured to change the default behaviour of finetrainers:
+
+`FINETRAINERS_ATTN_PROVIDER`: Sets the default attention provider for training/validation. Defaults to `native`, as in native PyTorch SDPA. See [attention docs](./models/attention.md) for more information.
+`FINETRAINERS_ATTN_CHECKS`: Whether or not to run basic sanity checks when using different attention providers. This is useful for debugging but you should leave it disabled for longer training runs. Defaults to `"0"`. Can be set to a truthy env value.
@@ -0,0 +1,87 @@
+# Attention backends
+
+Finetrainers supports multiple attention backends to support different hardware and tradeoff between speed and memory usage. The following attention implementations are supported:
+- Training:
+  - If model uses attention masks: `flash_varlen`, `flex`, `native`
+  - If model does not use attention masks: `flash`, `flex`, `native`, `xformers`
+- Inference:
+  - If model uses attention masks: `flash_varlen`, `flex`, `native`, `sage_varlen`
+  - If model does not use attention masks: `flash`, `flash_varlen`, `flex`, `native`, `sage`, `sage_varlen`, `xformers`
+
+Additionally, some specialized methods are available for debugging-specific purposes: `_native_cudnn`, `_native_efficient`, `_native_flash`, `_native_math`, `_sage_qk_int8_pv_fp8_cuda`, `_sage_qk_int8_pv_fp8_cuda_sm90`, `_sage_qk_int8_pv_fp16_cuda`, `_sage_qk_int8_pv_fp16_triton`. With time, more attention-specific optimizations and custom implementations will be supported. Contributions are welcome!
+
+Unfortunately, due to limited time for testing, only specific versions of packages that provide these implementations are supported. Other versions may work. The supported versions will be gradually made lower for more flexibility, but for now, please use the following versions:
+- `flash-attn>=2.6.3`
+- `sageattention>=2.1.1`
+- `xformers>=0.0.29.post3`
+
+This guide will help you quickly install flash-attn, sageattention, and xformers to make your models run faster and use less memory for training/inference. We'll cover installation on Linux (Ubuntu 22.04) and Windows (using WSL).
+
+Before you start, make sure to use a clean python virtual environment to not mess up your system seriously, or to avoid conflicting dependencies leading to failed installations which might leave the environment in hard-to-recover state.
+
+### Flash attention
+
+Providers covered: `flash`, `flash_varlen`
+
+The installation steps have only been tested with Ubuntu 22.04; CUDA version higher than 12.2 and 12.6.
+- Check your CUDA version: look at the output of `nvidia-smi` or run `nvcc --version`.
+- You might need the following packages: `pip install packaging ninja`
+- Linux: Run: `pip install flash-attn --no-build-isolation`. Verify the version with `pip show flash-attn`
+- WSL: Same instruction as above should work. Native Windows might require building from source - check community guiders and follow the instruction [here](https://github.com/Dao-AILab/flash-attention).
+
+### Sage attention
+
+Providers covered: `sage`, `sage_varlen`, `_sage_qk_int8_pv_fp8_cuda`, `_sage_qk_int8_pv_fp8_cuda_sm90`, `_sage_qk_int8_pv_fp16_cuda`, `_sage_qk_int8_pv_fp16_triton`
+
+FP8 implementations will require CUDA compute capability of 90 or higher (H100, RTX 5090, etc.). Some may work on compute capability 89 as well (RTX 4090, for example). For FP16 implementations, compute capability of atleast 80 is required (A100, RTX 3090, etc.). For other GPUs, FP16 implementations may or may not work (this is untested by me).
+
+- Check your compute capability with the following command:
+  ```bash
+  python -c "import torch; print(torch.cuda.get_device_capability())"
+  ```
+- Check your CUDA version: look at the output of `nvidia-smi` or run `nvcc --version`.
+- You might need the following packages: `pip install triton`. For Windows, check out the [triton-windows](https://github.com/woct0rdho/triton-windows) project.
+- Linux/WSL: Run: `pip install git+https://github.com/thu-ml/SageAttention`. Verify the version with `pip show sageattention`.
+- Make sure to look at the official installation guide in [SageAttention](https://github.com/thu-ml/SageAttention) too!
+
+### xformers
+
+Providers covered: `xformers`
+
+- Check your CUDA version: look at the output of `nvidia-smi` or run `nvcc --version`.
+- Linux/WSL: Run: `pip install -U xformers --index-url https://download.pytorch.org/whl/cu126` (assuming CUDA 12.6). Verify the version with `pip show xformers`.
+- Make sure to look at the official installation guide in [xformers](https://github.com/facebookresearch/xformers) too!
+
+----------
+
+All other providers are either native PyTorch implementations or require a specific PyTorch version (for example, Flex Attention requires torch version of atleast 2.5.0).
+
+----------
+
+## Usage
+
+There are two ways to use the attention dispatcher mechanism:
+- Replace `scaled_dot_product_attention` globally:
+  ```python
+  import torch.nn.functional as F
+  from finetrainers.models.attention_dispatch import attention_dispatch
+
+  F.scaled_dot_product_attention = attention_dispatch
+  ```
+- Replace all occurrences of `scaled_dot_product_attention` in your code with `attention_dispatch`.
+
+```python
+# Use dispatcher directly
+from finetrainers.models.attention_dispatch import attention_provider, AttentionProvider
+
+with attention_provider(AttentionProvider.FLASH_VARLEN):
+    model(...)
+
+# or,
+with attention_provider("sage_varlen"):
+    model(...)
+```
+
+## Context Parallel
+
+TODO
@@ -2,22 +2,88 @@
 import os
 import pathlib
 import sys
-from typing import Any, Dict, List, Optional
+from typing import Any, Dict, List, Literal, Optional, Union
 
 import torch
 
 from .config import SUPPORTED_MODEL_CONFIGS, ModelType, TrainingType
 from .logging import get_logger
 from .parallel import ParallelBackendEnum
-from .trainer.config_utils import ConfigMixin
-from .utils import get_non_null_items
+from .utils import ArgsConfigMixin, get_non_null_items
 
 
 logger = get_logger()
 
+# fmt: off
+# Must match src/finetrainers/models/attention_dispatch.py
+AttentionProviderTraining = Literal["flash", "flash_varlen", "flex", "native", "_native_cudnn", "_native_efficient", "_native_flash", "_native_math", "xformers"]
+AttentionProviderValidation = Literal["flash", "flash_varlen", "flex", "native", "_native_cudnn", "_native_efficient", "_native_flash", "_native_math", "sage", "sage_varlen", "_sage_qk_int8_pv_fp8_cuda", "_sage_qk_int8_pv_fp8_cuda_sm90", "_sage_qk_int8_pv_fp16_cuda", "_sage_qk_int8_pv_fp16_triton", "xformers"]
+
+# We do a union because every ArgsConfigMixin registered to BaseArgs can be looked up using the `__getattribute__` override
+BaseArgsType = Union["BaseArgs", "AttentionProviderArgs"]
+# fmt: on
+
+
+class AttentionProviderArgs(ArgsConfigMixin):
+    """
+    Args:
+        attn_provider_training (`List[str]`, defaults to `None`):
+            Must be a string of the form `"<component_name>:<attention_provider>"`. For example, if you want to use
+            flash varlen attention implementation on the `transformer` module, you can set this argument to
+            `"transformer:flash_varlen"`. The attention provider will be used for both training and validation.
+            Options for `<attention_provider>` are:
+                flash, flash_varlen, flex, native, _native_cudnn, _native_efficient, _native_flash, _native_math, xformers
+        attn_provider_inference (`List[str]`, defaults to `None`):
+            Must be a string of the form `"<component_name>:<attention_provider>"`. For example, if you want to use
+            flash varlen attention implementation on the `transformer` module, you can set this argument to
+            `"transformer:flash_varlen"`. The attention provider will be used for both training and validation.
+            Options for `<attention_provider>` are:
+                flash, flash_varlen, flex, native, _native_cudnn, _native_efficient, _native_flash, _native_math,
+                _native_math, sage, sage_varlen, _sage_qk_int8_pv_fp8_cuda, _sage_qk_int8_pv_fp8_cuda_sm90,
+                _sage_qk_int8_pv_fp16_cuda, _sage_qk_int8_pv_fp16_triton, xformers
+    """
+
+    attn_provider_training: List[AttentionProviderTraining] = None
+    attn_provider_inference: List[AttentionProviderValidation] = None
+
+    def add_args(self, parser: argparse.ArgumentParser) -> None:
+        parser.add_argument(
+            "--attn_provider_training",
+            type=str,
+            default=None,
+            nargs="+",
+            help="Attention provider for training. Must be a string of the form `<component_name>:<attention_provider>`.",
+        )
+        parser.add_argument(
+            "--attn_provider_inference",
+            type=str,
+            default=None,
+            nargs="+",
+            help="Attention provider for inference. Must be a string of the form `<component_name>:<attention_provider>`.",
+        )
+
+    def map_args(self, argparse_args: argparse.Namespace, mapped_args: "BaseArgs"):
+        attn_training = argparse_args.attn_provider_training
+        attn_inference = argparse_args.attn_provider_inference
+        if attn_training is None:
+            attn_training = []
+        if attn_inference is None:
+            attn_inference = []
+        mapped_args.attn_provider_training = attn_training
+        mapped_args.attn_provider_inference = attn_inference
+
+    def validate_args(self, args: "BaseArgs"):
+        pass
+
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "attn_provider_training": self.attn_provider_training,
+            "attn_provider_inference": self.attn_provider_inference,
+        }
+
 
 class BaseArgs:
-    r"""
+    """
     The arguments for the finetrainers training script.
 
     For helpful information about arguments, run `python train.py --help`.
@@ -314,16 +380,9 @@ class BaseArgs:
     vae_dtype: torch.dtype = torch.bfloat16
     layerwise_upcasting_modules: List[str] = []
     layerwise_upcasting_storage_dtype: torch.dtype = torch.float8_e4m3fn
-    layerwise_upcasting_skip_modules_pattern: List[str] = [
-        "patch_embed",
-        "pos_embed",
-        "x_embedder",
-        "context_embedder",
-        "time_embed",
-        "^proj_in$",
-        "^proj_out$",
-        "norm",
-    ]
+    # fmt: off
+    layerwise_upcasting_skip_modules_pattern: List[str] = ["patch_embed", "pos_embed", "x_embedder", "context_embedder", "time_embed", "^proj_in$", "^proj_out$", "norm"]
+    # fmt: on
 
     # Dataset arguments
     dataset_config: str = None
@@ -399,10 +458,21 @@ class BaseArgs:
     compile_modules: List[str] = []
     compile_scopes: List[str] = None
     allow_tf32: bool = False
-    float32_matmul_precision: Optional[str] = None
+    float32_matmul_precision: str = "highest"
 
-    # Additional registered arguments
-    _registered_config_mixins: List[ConfigMixin] = []
+    # Attention provider arguments
+    attention_provider_args: AttentionProviderArgs = AttentionProviderArgs()
+
+    _registered_config_mixins: List[ArgsConfigMixin] = []
+    _arg_group_map: Dict[str, ArgsConfigMixin] = {}
+
+    def __init__(self):
+        self._arg_group_map: Dict[str, ArgsConfigMixin] = {
+            "attention_provider_args": self.attention_provider_args,
+        }
+
+        for arg_config_mixin in self._arg_group_map.values():
+            self.register_args(arg_config_mixin)
 
     def to_dict(self) -> Dict[str, Any]:
         parallel_arguments = {
@@ -545,7 +615,7 @@ def to_dict(self) -> Dict[str, Any]:
             "torch_config_arguments": torch_config_arguments,
         }
 
-    def register_args(self, config: ConfigMixin) -> None:
+    def register_args(self, config: ArgsConfigMixin) -> None:
         if not hasattr(self, "_extended_add_arguments"):
             self._extended_add_arguments = []
         self._extended_add_arguments.append((config.add_args, config.validate_args, config.map_args))
@@ -583,6 +653,25 @@ def parse_args(self):
 
             return mapped_args
 
+    def __getattribute__(self, name: str):
+        try:
+            return object.__getattribute__(self, name)
+        except AttributeError:
+            for arg_group in self._arg_group_map.values():
+                if hasattr(arg_group, name):
+                    return getattr(arg_group, name)
+            raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
+
+    def __setattr__(self, name: str, value: Any):
+        if name in self.__dict__:
+            object.__setattr__(self, name, value)
+            return
+        for arg_group in self._arg_group_map.values():
+            if hasattr(arg_group, name):
+                setattr(arg_group, name, value)
+                return
+        object.__setattr__(self, name, value)
+
 
 def _add_args(parser: argparse.ArgumentParser) -> None:
     _add_parallel_arguments(parser)
@@ -749,7 +838,7 @@ def _add_torch_config_arguments(parser: argparse.ArgumentParser) -> None:
     parser.add_argument(
         "--float32_matmul_precision",
         type=str,
-        default=None,
+        default="highest",
         choices=["highest", "high", "medium"],
         help="The precision to use for float32 matmul. Choose between ['highest', 'high', 'medium'].",
     )
 
@@ -1,6 +1,12 @@
 import os
 
 
+ENV_VARS_TRUE_VALUES = {"1", "ON", "YES", "TRUE"}
+
+FINETRAINERS_LOG_LEVEL = os.environ.get("FINETRAINERS_LOG_LEVEL", "INFO")
+FINETRAINERS_ATTN_PROVIDER = os.environ.get("FINETRAINERS_ATTN_PROVIDER", "native")
+FINETRAINERS_ATTN_CHECKS = os.getenv("FINETRAINERS_ATTN_CHECKS", "0") in ENV_VARS_TRUE_VALUES
+
 DEFAULT_HEIGHT_BUCKETS = [256, 320, 384, 480, 512, 576, 720, 768, 960, 1024, 1280, 1536]
 DEFAULT_WIDTH_BUCKETS = [256, 320, 384, 480, 512, 576, 720, 768, 960, 1024, 1280, 1536]
 DEFAULT_FRAME_BUCKETS = [49]
@@ -16,9 +22,6 @@
         for width in DEFAULT_WIDTH_BUCKETS:
             DEFAULT_VIDEO_RESOLUTION_BUCKETS.append((frames, height, width))
 
-
-FINETRAINERS_LOG_LEVEL = os.environ.get("FINETRAINERS_LOG_LEVEL", "INFO")
-
 PRECOMPUTED_DIR_NAME = "precomputed"
 PRECOMPUTED_CONDITIONS_DIR_NAME = "conditions"
 PRECOMPUTED_LATENTS_DIR_NAME = "latents"
 
@@ -1 +1,2 @@
+from .attention_dispatch import AttentionProvider, attention_dispatch, attention_provider
 from .modeling_utils import ControlModelSpecification, ModelSpecification
Original file line number	Diff line number	Diff line change
`@@ -1 +1,2 @@`
	`1`	`+from .attention_dispatch import AttentionProvider, attention_dispatch, attention_provider`
`1`	`2`	`from .modeling_utils import ControlModelSpecification, ModelSpecification`