huggingface
diff --git a/‎.github/workflows/ci.yml
+4-3 b/‎.github/workflows/ci.yml
+4-3
diff --git a/‎additional-tests-requirements.txt
+1 b/‎additional-tests-requirements.txt
+1
diff --git a/‎docs/source/about_mapstyle_vs_iterable.mdx
+31 b/‎docs/source/about_mapstyle_vs_iterable.mdx
+31
diff --git a/‎docs/source/package_reference/main_classes.mdx
+2 b/‎docs/source/package_reference/main_classes.mdx
+2
diff --git a/‎docs/source/stream.mdx
+58 b/‎docs/source/stream.mdx
+58
diff --git a/‎docs/source/use_with_pytorch.mdx
+14 b/‎docs/source/use_with_pytorch.mdx
+14
@@ -56,9 +56,10 @@ jobs:
       - name: Install uv
         run: pip install --upgrade uv
       - name: Install dependencies
-        run: |
-          uv pip install --system "datasets[tests,metrics-tests] @ ."
-          uv pip install --system -r additional-tests-requirements.txt --no-deps
+        run: uv pip install --system "datasets[tests,metrics-tests] @ ."
+      - name: Install dependencies (latest versions)
+        if: ${{ matrix.os == 'ubuntu-latest' }}
+        run: uv pip install --system -r additional-tests-requirements.txt --no-deps
       - name: Install dependencies (latest versions)
         if: ${{ matrix.deps_versions == 'deps-latest' }}
         run: uv pip install --system --upgrade pyarrow huggingface-hub dill
 
@@ -1,4 +1,5 @@
 unbabel-comet>=1.0.0
+git+https://github.com/pytorch/data.git
 git+https://github.com/google-research/bleurt.git
 git+https://github.com/ns-moosavi/coval.git
 git+https://github.com/hendrycks/math.git
@@ -205,6 +205,37 @@ for epoch in range(n_epochs):
         pass
 ```
 
+## Checkpoint and resuming differences
+
+If you training loop stops, you may want to restart the training from where it was. To do so you can save a checkpoint of your model and optimizers, as well as your data loader.
+
+To restart the iteration of a map-style dataset, you can simply skip the first examples:
+
+```python
+my_dataset = my_dataset.select(range(start_index, len(dataset)))
+```
+
+But if you use a `DataLoader` with a `Sampler`, you should instead save the state of your sampler (you might have write a custom sampler that allows resuming).
+
+On the other hand, iterable datasets don't provide random access to a specific example inde to resume from. But you can use [`IterableDataset.state_dict`] and [`IterableDataset.load_state_dict`] to resume from a checkpoint instead, similarly to what you can do for models and optimizers:
+
+```python
+>>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3)
+>>> # save in the middle of training
+>>> state_dict = iterable_dataset.state_dict()
+>>> # and resume later
+>>> iterable_dataset.load_state_dict(state_dict)
+```
+
+Under the hood, the iterable dataset keeps track of the current shard being read and the example index in the current shard and it stores this info in the `state_dict`.
+
+To resume from a checkpoint, the dataset skips all the shards that were previously read to restart from the current shard. 
+Then it reads the shard and skips examples until it reaches the exact example from the checkpoint.
+
+Therefore restarting a dataset is quite fast, since it will not re-read the shards that have already been iterated on. Still, resuming a dataset is generally not instantaneous since it has to restart reading from the beginning of the current shard and skip examples until it reaches the checkpoint location.
+
+This can be used with the `StatefulDataLoader` from `torchdata`, see [streaming with a PyTorch DataLoader](./use_with_pytorch#stream-data).
+
 ## Switch from map-style to iterable
 
 If you want to benefit from the "lazy" behavior of an [`IterableDataset`] or their speed advantages, you can switch your map-style [`Dataset`] to an [`IterableDataset`]:
 
@@ -172,6 +172,8 @@ The base class [`IterableDataset`] implements an iterable Dataset backed by pyth
     - shuffle
     - skip
     - take
+    - load_state_dict
+    - state_dict
     - info
     - split
     - builder_name
 
@@ -360,3 +360,61 @@ Lastly, create a simple training loop and start training:
 </frameworkcontent>
 
 <!-- TODO: Write the TF content! -->
+
+### Save a dataset checkpoint and resume iteration
+
+If you training loop stops, you may want to restart the training from where it was. To do so you can save a checkpoint of your model and optimizers, as well as your data loader.
+
+Iterable datasets don't provide random access to a specific example index to resume from, but you can use [`IterableDataset.state_dict`] and [`IterableDataset.load_state_dict`] to resume from a checkpoint instead, similarly to what you can do for models and optimizers:
+
+```python
+>>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3)
+>>> for idx, example in enumerate(iterable_dataset):
+...     print(example)
+...     if idx == 2:
+...         state_dict = iterable_dataset.state_dict()
+...         print("checkpoint")
+...         break
+>>> iterable_dataset.load_state_dict(state_dict)
+>>> print(f"restart from checkpoint")
+>>> for example in iterable_dataset:
+...     print(example)
+```
+
+Returns:
+
+```
+{'a': 0}
+{'a': 1}
+{'a': 2}
+checkpoint
+restart from checkpoint
+{'a': 3}
+{'a': 4}
+{'a': 5}
+```
+
+Under the hood, the iterable dataset keeps track of the current shard being read and the example index in the current shard and it stores this info in the `state_dict`.
+
+To resume from a checkpoint, the dataset skips all the shards that were previously read to restart from the current shard. 
+Then it reads the shard and skips examples until it reaches the exact example from the checkpoint.
+
+Therefore restarting a dataset is quite fast, since it will not re-read the shards that have already been iterated on. Still, resuming a dataset is generally not instantaneous since it has to restart reading from the beginning of the current shard and skip examples until it reaches the checkpoint location.
+
+This can be used with the `StatefulDataLoader` from `torchdata`:
+
+```python
+>>> from torchdata.stateful_dataloader import StatefulDataLoader
+>>> iterable_dataset = load_dataset("deepmind/code_contests", streaming=True, split="train")
+>>> dataloader = StatefulDataLoader(iterable_dataset, batch_size=32, num_workers=4)
+>>> # checkpoint
+>>> state_dict = dataloader.state_dict()  # uses iterable_dataset.state_dict() under the hood
+>>> # resume from checkpoint
+>>> dataloader.load_state_dict(state_dict)  # uses iterable_dataset.load_state_dict() under the hood
+```
+
+<Tip>
+
+Resuming returns exactly where the checkpoint was saved except in two cases: 1) examples from shuffle buffers are lost when resuming and the buffers are refilled with new data and 2) combinations of `.with_format(arrow)` and batched `.map()` may skip one batch.
+
+</Tip>
@@ -213,6 +213,20 @@ If the dataset is split in several shards (i.e. if the dataset consists of multi
 
 In this case each worker is given a subset of the list of shards to stream from.
 
+If you need a DataLoader that you can checkpoint and resume in the middle of training, you can use the `StatefulDataLoader` from [torchdata](https://github.com/pytorch/data):
+
+```py
+>>> from torchdata.stateful_dataloader import StatefulDataLoader
+>>> my_iterable_dataset = load_dataset("deepmind/code_contests", streaming=True, split="train")
+>>> dataloader = StatefulDataLoader(my_iterable_dataset, batch_size=32, num_workers=4)
+>>> # save in the middle of training
+>>> state_dict = dataloader.state_dict()
+>>> # and resume later
+>>> dataloader.load_state_dict(state_dict)
+```
+
+This is possible thanks to [`IterableDataset.state_dict`] and [`IterableDataset.load_state_dict`].
+
 ### Distributed
 
 To split your dataset across your training nodes, you can use [`datasets.distributed.split_dataset_by_node`]: