Skip to content

Commit 0c8a193

Browse files
authored
(1/n) Support 2D Parallelism (#19846)
1 parent 0f12271 commit 0c8a193

File tree

17 files changed

+1821
-71
lines changed

17 files changed

+1821
-71
lines changed

docs/source-pytorch/conf.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -356,8 +356,6 @@ def _load_py_module(name: str, location: str) -> ModuleType:
356356
"torchmetrics": ("https://lightning.ai/docs/torchmetrics/stable/", None),
357357
"lightning_habana": ("https://lightning-ai.github.io/lightning-Habana/", None),
358358
"tensorboardX": ("https://tensorboardx.readthedocs.io/en/stable/", None),
359-
# needed for referencing App from lightning scope
360-
"lightning.app": ("https://lightning.ai/docs/app/stable/", None),
361359
# needed for referencing Fabric from lightning scope
362360
"lightning.fabric": ("https://lightning.ai/docs/fabric/stable/", None),
363361
# TODO: these are missing objects.inv
@@ -637,4 +635,6 @@ def package_list_from_file(file):
637635
"https://www.intel.com/content/www/us/en/products/docs/processors/what-is-a-gpu.html",
638636
"https://www.microsoft.com/en-us/research/blog/zero-infinity-and-deepspeed-unlocking-unprecedented-model-scale-for-deep-learning-training/", # noqa: E501
639637
"https://stackoverflow.com/questions/66640705/how-can-i-install-grpcio-on-an-apple-m1-silicon-laptop",
638+
"https://openai.com/blog/.*",
639+
"https://tinyurl.com/.*", # has a human verification check on redirect
640640
]
+45
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
## Tensor Parallel and 2D Parallel
2+
3+
This example shows how to apply tensor-parallelism to your model (here Llama 2 7B) with the `ModelParallelStrategy`, and how it can be combined with FSDP (2D parallelism).
4+
PyTorch 2.3+ and a machine with at least 4 GPUs and 24 GB memory each are required to run this example.
5+
6+
```bash
7+
pip install 'torch>=2.3'
8+
```
9+
10+
Navigate to this example folder and run the training script:
11+
12+
```bash
13+
cd examples/fabric/tensor_parallel
14+
python train.py
15+
```
16+
17+
You should see an output like this:
18+
19+
```
20+
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
21+
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
22+
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
23+
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
24+
----------------------------------------------------------------------------------------------------
25+
distributed_backend=nccl
26+
All distributed processes registered. Starting with 4 processes
27+
----------------------------------------------------------------------------------------------------
28+
29+
Number of model parameters: 6.7 B
30+
Starting training ...
31+
Iteration 0 complete
32+
Iteration 1 complete
33+
Iteration 2 complete
34+
Iteration 3 complete
35+
Iteration 4 complete
36+
Iteration 5 complete
37+
Iteration 6 complete
38+
Iteration 7 complete
39+
Saving a (distributed) checkpoint ...
40+
Training successfully completed!
41+
Peak memory usage: 17.95 GB
42+
```
43+
44+
> \[!NOTE\]
45+
> The `ModelParallelStrategy` is experimental and subject to change. Report issues on [GitHub](https://github.com/Lightning-AI/pytorch-lightning/issues).
+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
import torch
2+
from torch.utils.data import Dataset
3+
4+
5+
class RandomTokenDataset(Dataset):
6+
def __init__(self, vocab_size: int, seq_length: int):
7+
self.vocab_size = vocab_size
8+
self.seq_length = seq_length
9+
self.tokens = torch.randint(
10+
self.vocab_size,
11+
size=(len(self), self.seq_length + 1),
12+
# Set a seed to make this toy dataset the same on each rank
13+
# Fabric will add a `DistributedSampler` to shard the data correctly
14+
generator=torch.Generator().manual_seed(42),
15+
)
16+
17+
def __len__(self) -> int:
18+
return 128
19+
20+
def __getitem__(self, item: int):
21+
return self.tokens[item]

0 commit comments

Comments
 (0)