`Dataset.save_to_disk` hangs when using num_proc > 1

### Describe the bug

Hi, I'm encountered a small issue when saving datasets that led to the saving taking up to multiple hours.
Specifically, [`Dataset.save_to_disk`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.save_to_disk) is a lot slower when using `num_proc>1` than when using `num_proc=1`

The documentation mentions that "Multiprocessing is disabled by default.", but there is no explanation on how to enable it.

### Steps to reproduce the bug

```
import numpy as np
from datasets import Dataset

n_samples = int(4e6)
n_tokens_sample = 100
data_dict = {
    'tokens' : np.random.randint(0, 100, (n_samples, n_tokens_sample)),
}

dataset = Dataset.from_dict(data_dict)
dataset.save_to_disk('test_dataset', num_proc=1)
dataset.save_to_disk('test_dataset', num_proc=4)
dataset.save_to_disk('test_dataset', num_proc=8)
```

This results in:
```
>>> dataset.save_to_disk('test_dataset', num_proc=1)
Saving the dataset (7/7 shards): 100%|██████████████| 4000000/4000000 [00:17<00:00, 228075.15 examples/s]
>>> dataset.save_to_disk('test_dataset', num_proc=4)
Saving the dataset (7/7 shards): 100%|██████████████| 4000000/4000000 [01:49<00:00, 36583.75 examples/s]
>>> dataset.save_to_disk('test_dataset', num_proc=8)
Saving the dataset (8/8 shards): 100%|██████████████| 4000000/4000000 [02:11<00:00, 30518.43 examples/s]
```

With larger datasets it can take hours, but I didn't benchmark that for this bug report.

### Expected behavior

I would expect using `num_proc>1` to be faster instead of slower than `num_proc=1`. 

### Environment info

- `datasets` version: 3.1.0
- Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
- Python version: 3.10.12
- `huggingface_hub` version: 0.26.2
- PyArrow version: 18.0.0
- Pandas version: 2.2.3
- `fsspec` version: 2024.6.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`Dataset.save_to_disk` hangs when using num_proc > 1 #7290

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dataset.save_to_disk hangs when using num_proc > 1 #7290

Description

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`Dataset.save_to_disk` hangs when using num_proc > 1 #7290