Name		Name	Last commit message	Last commit date
parent directory ..
slurm		slurm
README.md		README.md

README.md

torchtitan Training on AWS

torchtitan is a reference architecture for large-scale LLM training using native PyTorch. It aims to showcase PyTorch's latest distributed training features in a clean, minimal code base. The library is designed to be simple to understand, use, and extend for different training purposes, with minimal changes required to the model code when applying various parallel processing techniques.

Key Features

torchtitan offers several advanced capabilities:

FSDP2 with per-parameter sharding
FP8 Support
Async Tensor Parallelism in PyTorch
Optimized Checkpointing Efficiency with PyTorch DCP
Zero-Bubble Pipeline Parallelism
Context Parallelism for training long context LLMs (with 1M Sequence Length)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torchtitan

torchtitan

README.md

torchtitan Training on AWS

Key Features

Files

torchtitan

Directory actions

More options

Directory actions

More options

Latest commit

History

torchtitan

Folders and files

parent directory

README.md

torchtitan Training on AWS

Key Features