Skip to content

Commit 395a044

Browse files
authored
Merge branch 'master' into carmocca/bwd-sync-enabled
2 parents e52f688 + f23b3b1 commit 395a044

File tree

34 files changed

+743
-169
lines changed

34 files changed

+743
-169
lines changed

.github/workflows/ci-pkg-install.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ jobs:
115115
done
116116
- name: Install pytest doctest extension
117117
run: |
118-
pip install -q "pytest-doctestplus>=0.9.0"
118+
pip install -q -r requirements/doctests.txt
119119
pip list
120120
121121
- name: DocTest package

docs/source-fabric/advanced/compile.rst

Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ Speed up models by compiling them
33
#################################
44

55
Compiling your PyTorch model can result in significant speedups, especially on the latest generations of GPUs.
6-
This guide shows you how to apply ``torch.compile`` correctly in your code.
6+
This guide shows you how to apply `torch.compile <https://pytorch.org/docs/2.2/generated/torch.compile.html>`_ correctly in your code.
77

88
.. note::
99

@@ -132,7 +132,7 @@ If there are regions in the code that it doesn't understand, it will introduce a
132132
Graph breaks aren't a deal breaker, since the optimized parts should still run faster.
133133
But if you want to get the most out of ``torch.compile``, you might want to invest rewriting the problematic section of the code that produce the breaks.
134134

135-
You can check whether your model produces graph breaks by calling ``torch.compile`` with ``fullraph=True``:
135+
You can check whether your model produces graph breaks by calling ``torch.compile`` with ``fullgraph=True``:
136136

137137
.. code-block:: python
138138
@@ -223,6 +223,9 @@ On PyTorch 2.2 and later, ``torch.compile`` will detect dynamism automatically a
223223
Numbers produced with NVIDIA A100 SXM4 40GB, PyTorch 2.2.0, CUDA 12.1.
224224

225225

226+
If you still see recompilation issues after dealing with the aforementioned cases, there is a `Compile Profiler in PyTorch <https://pytorch.org/docs/stable/torch.compiler_troubleshooting.html#excessive-recompilation>`_ for further investigation.
227+
228+
226229
----
227230

228231

@@ -266,11 +269,11 @@ You can find a full list of compile options in the `PyTorch documentation <https
266269
A note about torch.compile in practice
267270
**************************************
268271

269-
In practice, you will find that ``torch.compile`` often doesn't work well and can even be counter-productive.
270-
Compilation may fail with cryptic error messages that are impossible to debug without help from the PyTorch team.
271-
It is also not uncommon that ``torch.compile`` will produce a significantly *slower* model or one with much higher memory usage.
272-
On top of that, the compilation phase itself can be incredibly slow, taking several minutes to finish.
273-
For these reasons, we recommend that you don't waste too much time trying to apply ``torch.compile`` during development, and rather evaluate its effectiveness toward the end when you are about to launch long-running, expensive experiments.
272+
In practice, you will find that ``torch.compile`` may not work well at first or may be counter-productive to performance.
273+
Compilation may fail with cryptic error messages that are hard to debug, luckily the PyTorch team is responsive and it's likely that messaging will improve in time.
274+
It is not uncommon that ``torch.compile`` will produce a significantly *slower* model or one with higher memory usage. You'll need to invest time in this phase if the model is not among the ones that have a happy path.
275+
As a note, the compilation phase itself will take some time, taking up to several minutes.
276+
For these reasons, we recommend that you don't invest too much time trying to apply ``torch.compile`` during development, and rather evaluate its effectiveness toward the end when you are about to launch long-running, expensive experiments.
274277
Always compare the speed and memory usage of the compiled model against the original model!
275278

276279

@@ -301,4 +304,18 @@ However, should you have issues compiling DDP and FSDP models, you can opt out o
301304
model = fabric.setup(model, _reapply_compile=False)
302305
303306
307+
----
308+
309+
310+
********************
311+
Additional Resources
312+
********************
313+
314+
Here are a few resources for further reading after you complete this tutorial:
315+
316+
- `PyTorch 2.0 Paper <https://pytorch.org/blog/pytorch-2-paper-tutorial/>`_
317+
- `GenAI with PyTorch 2.0 blog post series <https://pytorch.org/blog/accelerating-generative-ai-4/>`_
318+
- `Training Production AI Models with PyTorch 2.0 <https://pytorch.org/blog/training-production-ai-models/>`_
319+
- `Empowering Models with Performance: The Art of Generalized Model Transformation Approach <https://pytorch.org/blog/empowering-models-performance/>`_
320+
304321
|

docs/source-fabric/api/fabric_args.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,13 +36,16 @@ See also: :doc:`../fundamentals/accelerators`
3636
strategy
3737
========
3838

39-
Choose a training strategy: ``"dp"``, ``"ddp"``, ``"ddp_spawn"``, ``"xla"``, ``"deepspeed"``, ``"fsdp"````.
39+
Choose a training strategy: ``"dp"``, ``"ddp"``, ``"ddp_spawn"``, ``"ddp_find_unused_parameters_true"``, ``"xla"``, ``"deepspeed"``, ``"fsdp"``.
4040

4141
.. code-block:: python
4242
4343
# Running with the DistributedDataParallel strategy on 4 GPUs
4444
fabric = Fabric(strategy="ddp", accelerator="gpu", devices=4)
4545
46+
# Running with the DDP strategy with find unused parameters enabled on 4 GPUs
47+
fabric = Fabric(strategy="ddp_find_unused_parameters_true", accelerator="gpu", devices=4)
48+
4649
# Running with the DDP Spawn strategy using 4 CPU processes
4750
fabric = Fabric(strategy="ddp_spawn", accelerator="cpu", devices=4)
4851

docs/source-fabric/guide/checkpoint/distributed_checkpoint.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -187,7 +187,7 @@ It is possible to convert a distributed checkpoint to a regular, single-file che
187187

188188
.. code-block:: bash
189189
190-
python -m lightning.fabric.utilities.consolidate_checkpoint path/to/my/checkpoint
190+
fabric consolidate path/to/my/checkpoint
191191
192192
You will need to do this for example if you want to load the checkpoint into a script that doesn't use FSDP, or need to export the checkpoint to a different format for deployment, evaluation, etc.
193193

@@ -202,7 +202,7 @@ You will need to do this for example if you want to load the checkpoint into a s
202202

203203
.. code-block:: bash
204204
205-
python -m lightning.fabric.utilities.consolidate_checkpoint my-checkpoint.ckpt
205+
fabric consolidate my-checkpoint.ckpt
206206
207207
This saves a new file ``my-checkpoint.ckpt.consolidated`` next to the sharded checkpoint which you can load normally in PyTorch:
208208

Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
#################################
2+
Speed up models by compiling them
3+
#################################
4+
5+
Compiling your LightningModule can result in significant speedups, especially on the latest generations of GPUs.
6+
This guide shows you how to apply `torch.compile <https://pytorch.org/docs/2.2/generated/torch.compile.html>`_ correctly in your code.
7+
8+
.. note::
9+
10+
This requires PyTorch >= 2.0.
11+
12+
13+
----
14+
15+
16+
*******************************************
17+
Apply torch.compile to your LightningModule
18+
*******************************************
19+
20+
Compiling a LightningModule is as simple as adding one line of code, calling :func:`torch.compile`:
21+
22+
.. code-block:: python
23+
24+
import torch
25+
import lightning as L
26+
27+
# Define the model
28+
model = MyLightningModule()
29+
30+
# Compile the model
31+
model = torch.compile(model)
32+
33+
# Run with the Trainer
34+
trainer = L.Trainer()
35+
trainer.fit(model)
36+
37+
38+
.. important::
39+
40+
You should compile the model **before** calling ``trainer.fit()`` as shown above for an optimal integration with features in Trainer.
41+
42+
The newly added call to ``torch.compile()`` by itself doesn't do much. It just wraps the model in a "compiled model".
43+
The actual optimization will start when calling the ``forward()`` method for the first time:
44+
45+
.. code-block:: python
46+
47+
# 1st execution compiles the model (slow)
48+
output = model(input)
49+
50+
# All future executions will be fast (for inputs of the same size)
51+
output = model(input)
52+
output = model(input)
53+
...
54+
55+
**When you pass the LightningModule to the Trainer, it will automatically also compile the ``*_step()`` methods.**
56+
57+
When measuring the speed of a compiled model and comparing it to a regular model, it is important to
58+
always exclude the first call to ``forward()``/``*_step()`` from your measurements, since it includes the compilation time.
59+
60+
61+
.. collapse:: Full example with benchmark
62+
63+
Below is an example that measures the speedup you get when compiling the InceptionV3 from TorchVision.
64+
65+
.. code-block:: python
66+
67+
import statistics
68+
import torch
69+
import torchvision.models as models
70+
import lightning as L
71+
from torch.utils.data import DataLoader
72+
73+
74+
class MyLightningModule(L.LightningModule):
75+
def __init__(self):
76+
super().__init__()
77+
self.model = models.inception_v3()
78+
79+
def training_step(self, batch):
80+
return self.model(batch).logits.sum()
81+
82+
def train_dataloader(self):
83+
return DataLoader([torch.randn(3, 512, 512) for _ in range(256)], batch_size=16)
84+
85+
def configure_optimizers(self):
86+
return torch.optim.SGD(self.parameters(), lr=0.01)
87+
88+
89+
class Benchmark(L.Callback):
90+
"""A callback that measures the median execution time between the start and end of a batch."""
91+
def __init__(self):
92+
self.start = torch.cuda.Event(enable_timing=True)
93+
self.end = torch.cuda.Event(enable_timing=True)
94+
self.times = []
95+
96+
def median_time(self):
97+
return statistics.median(self.times)
98+
99+
def on_train_batch_start(self, trainer, *args, **kwargs):
100+
self.start.record()
101+
102+
def on_train_batch_end(self, trainer, *args, **kwargs):
103+
# Exclude the first iteration to let the model warm up
104+
if trainer.global_step > 1:
105+
self.end.record()
106+
torch.cuda.synchronize()
107+
self.times.append(self.start.elapsed_time(self.end) / 1000)
108+
109+
110+
model = MyLightningModule()
111+
112+
# Compile!
113+
compiled_model = torch.compile(model)
114+
115+
# Measure the median iteration time with uncompiled model
116+
benchmark = Benchmark()
117+
trainer = L.Trainer(accelerator="cuda", devices=1, max_steps=10, callbacks=[benchmark])
118+
trainer.fit(model)
119+
eager_time = benchmark.median_time()
120+
121+
# Measure the median iteration time with compiled model
122+
benchmark = Benchmark()
123+
trainer = L.Trainer(accelerator="cuda", devices=1, max_steps=10, callbacks=[benchmark])
124+
trainer.fit(compiled_model)
125+
compile_time = benchmark.median_time()
126+
127+
# Compare the speedup for the compiled execution
128+
speedup = eager_time / compile_time
129+
print(f"Eager median time: {eager_time:.4f} seconds")
130+
print(f"Compile median time: {compile_time:.4f} seconds")
131+
print(f"Speedup: {speedup:.1f}x")
132+
133+
134+
On an NVIDIA A100 SXM4 40GB with PyTorch 2.2.0, CUDA 12.1, we get the following speedup:
135+
136+
.. code-block:: text
137+
138+
Eager median time: 0.0863 seconds
139+
Compile median time: 0.0709 seconds
140+
Speedup: 1.2x
141+
142+
143+
----
144+
145+
146+
******************
147+
Avoid graph breaks
148+
******************
149+
150+
When ``torch.compile`` looks at the code in your model's ``forward()`` or ``*_step()`` method, it will try to compile as much of the code as possible.
151+
If there are regions in the code that it doesn't understand, it will introduce a so-called "graph break" that essentially splits the code in optimized and unoptimized parts.
152+
Graph breaks aren't a deal breaker, since the optimized parts should still run faster.
153+
But if you want to get the most out of ``torch.compile``, you might want to invest rewriting the problematic section of the code that produce the breaks.
154+
155+
You can check whether your model produces graph breaks by calling ``torch.compile`` with ``fullgraph=True``:
156+
157+
.. code-block:: python
158+
159+
# Force an error if there is a graph break in the model
160+
model = torch.compile(model, fullgraph=True)
161+
162+
Be aware that the error messages produced here are often quite cryptic, so you will likely have to do some `troubleshooting <https://pytorch.org/docs/stable/torch.compiler_troubleshooting.html>`_ to fully optimize your model.
163+
164+
165+
----
166+
167+
168+
*******************
169+
Avoid recompilation
170+
*******************
171+
172+
As mentioned before, the compilation of the model happens the first time you call ``forward()`` or the first time the Trainer calls the ``*_step()`` methods.
173+
At this point, PyTorch will inspect the input tensor(s) and optimize the compiled code for the particular shape, data type and other properties the input has.
174+
If the shape of the input remains the same across all calls, PyTorch will reuse the compiled code it generated and you will get the best speedup.
175+
However, if these properties change across subsequent calls to ``forward()``/``*_step()``, PyTorch will be forced to recompile the model for the new shapes, and this will significantly slow down your training if it happens on every iteration.
176+
177+
**When your training suddenly becomes slow, it's probably because PyTorch is recompiling the model!**
178+
Here are some common scenarios when this can happen:
179+
180+
- You are using dataset with different inputs or shapes for validation than for training, causing a recompilation whenever the Trainer switches between training and validation.
181+
- Your dataset size is not divisible by the batch size, and the dataloader has ``drop_last=False`` (the default).
182+
The last batch in your training loop will be smaller and trigger a recompilation.
183+
184+
Ideally, you should try to make the input shape(s) to ``forward()`` static.
185+
However, when this is not possible, you can request PyTorch to compile the code by taking into account possible changes to the input shapes.
186+
187+
.. code-block:: python
188+
189+
# On PyTorch < 2.2
190+
model = torch.compile(model, dynamic=True)
191+
192+
A model compiled with ``dynamic=True`` will typically be slower than a model compiled with static shapes, but it will avoid the extreme cost of recompilation every iteration.
193+
On PyTorch 2.2 and later, ``torch.compile`` will detect dynamism automatically and you should no longer need to set this.
194+
195+
If you still see recompilation issues after dealing with the aforementioned cases, there is a `Compile Profiler in PyTorch <https://pytorch.org/docs/stable/torch.compiler_troubleshooting.html#excessive-recompilation>`_ for further investigation.
196+
197+
198+
----
199+
200+
201+
***********************************
202+
Experiment with compilation options
203+
***********************************
204+
205+
There are optional settings that, depending on your model, can give additional speedups.
206+
207+
**CUDA Graphs:** By enabling CUDA Graphs, CUDA will record all computations in a graph and replay it every time forward and backward is called.
208+
The requirement is that your model must be static, i.e., the input shape must not change and your model must execute the same operations every time.
209+
Enabling CUDA Graphs often results in a significant speedup, but sometimes also increases the memory usage of your model.
210+
211+
.. code-block:: python
212+
213+
# Enable CUDA Graphs
214+
compiled_model = torch.compile(model, mode="reduce-overhead")
215+
216+
# This does the same
217+
compiled_model = torch.compile(model, options={"triton.cudagraphs": True})
218+
219+
|
220+
221+
**Shape padding:** The specific shape/size of the tensors involved in the computation of your model (input, activations, weights, gradients, etc.) can have an impact on the performance.
222+
With shape padding enabled, ``torch.compile`` can extend the tensors by padding to a size that gives a better memory alignment.
223+
Naturally, the tradoff here is that it will consume a bit more memory.
224+
225+
.. code-block:: python
226+
227+
# Default is False
228+
compiled_model = torch.compile(model, options={"shape_padding": True})
229+
230+
231+
You can find a full list of compile options in the `PyTorch documentation <https://pytorch.org/docs/stable/generated/torch.compile.html>`_.
232+
233+
234+
----
235+
236+
237+
**************************************
238+
A note about torch.compile in practice
239+
**************************************
240+
241+
In practice, you will find that ``torch.compile`` may not work well at first or may be counter-productive to performance.
242+
Compilation may fail with cryptic error messages that are hard to debug, luckily the PyTorch team is responsive and it's likely that messaging will improve in time.
243+
It is not uncommon that ``torch.compile`` will produce a significantly *slower* model or one with higher memory usage. You'll need to invest time in this phase if the model is not among the ones that have a happy path.
244+
As a note, the compilation phase itself will take some time, taking up to several minutes.
245+
For these reasons, we recommend that you don't invest too much time trying to apply ``torch.compile`` during development, and rather evaluate its effectiveness toward the end when you are about to launch long-running, expensive experiments.
246+
Always compare the speed and memory usage of the compiled model against the original model!
247+
248+
249+
----
250+
251+
252+
***********
253+
Limitations
254+
***********
255+
256+
There are a few limitations you should be aware of when using ``torch.compile`` **in conjunction with the Trainer**:
257+
258+
* The Trainer currently does not reapply ``torch.compile`` over DDP/FSDP, meaning distributed operations can't benefit from speed ups at the moment.
259+
This limitation will be lifted in the future.
260+
261+
* In some cases, using ``self.log()`` in your LightningModule will cause compilation errors.
262+
Until addressed, you can work around these issues by applying ``torch.compile`` to the submodule(s) of your LightningModule rather than to the entire LightningModule at once.
263+
264+
.. code-block:: python
265+
266+
import lightning as L
267+
268+
class MyLightningModule(L.LightningModule):
269+
def __init__(self):
270+
super().__init__()
271+
self.model = MySubModule()
272+
self.model = torch.compile(self.model)
273+
...
274+
275+
276+
----
277+
278+
279+
********************
280+
Additional Resources
281+
********************
282+
283+
Here are a few resources for further reading after you complete this tutorial:
284+
285+
- `PyTorch 2.0 Paper <https://pytorch.org/blog/pytorch-2-paper-tutorial/>`_
286+
- `GenAI with PyTorch 2.0 blog post series <https://pytorch.org/blog/accelerating-generative-ai-4/>`_
287+
- `Training Production AI Models with PyTorch 2.0 <https://pytorch.org/blog/training-production-ai-models/>`_
288+
- `Empowering Models with Performance: The Art of Generalized Model Transformation Approach <https://pytorch.org/blog/empowering-models-performance/>`_
289+
290+
|

0 commit comments

Comments
 (0)