torchtitan is a reference architecture for large-scale LLM training using native PyTorch. It aims to showcase PyTorch's latest distributed training features in a clean, minimal code base. The library is designed to be simple to understand, use, and extend for different training purposes, with minimal changes required to the model code when applying various parallel processing techniques.
torchtitan offers several advanced capabilities:
- FSDP2 with per-parameter sharding
- FP8 Support
- Async Tensor Parallelism in PyTorch
- Optimized Checkpointing Efficiency with PyTorch DCP
- Zero-Bubble Pipeline Parallelism
- Context Parallelism for training long context LLMs (with 1M Sequence Length)