Open
Description
Describe the bug
Hi, I'm encountered a small issue when saving datasets that led to the saving taking up to multiple hours.
Specifically, Dataset.save_to_disk
is a lot slower when using num_proc>1
than when using num_proc=1
The documentation mentions that "Multiprocessing is disabled by default.", but there is no explanation on how to enable it.
Steps to reproduce the bug
import numpy as np
from datasets import Dataset
n_samples = int(4e6)
n_tokens_sample = 100
data_dict = {
'tokens' : np.random.randint(0, 100, (n_samples, n_tokens_sample)),
}
dataset = Dataset.from_dict(data_dict)
dataset.save_to_disk('test_dataset', num_proc=1)
dataset.save_to_disk('test_dataset', num_proc=4)
dataset.save_to_disk('test_dataset', num_proc=8)
This results in:
>>> dataset.save_to_disk('test_dataset', num_proc=1)
Saving the dataset (7/7 shards): 100%|██████████████| 4000000/4000000 [00:17<00:00, 228075.15 examples/s]
>>> dataset.save_to_disk('test_dataset', num_proc=4)
Saving the dataset (7/7 shards): 100%|██████████████| 4000000/4000000 [01:49<00:00, 36583.75 examples/s]
>>> dataset.save_to_disk('test_dataset', num_proc=8)
Saving the dataset (8/8 shards): 100%|██████████████| 4000000/4000000 [02:11<00:00, 30518.43 examples/s]
With larger datasets it can take hours, but I didn't benchmark that for this bug report.
Expected behavior
I would expect using num_proc>1
to be faster instead of slower than num_proc=1
.
Environment info
datasets
version: 3.1.0- Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
- Python version: 3.10.12
huggingface_hub
version: 0.26.2- PyArrow version: 18.0.0
- Pandas version: 2.2.3
fsspec
version: 2024.6.1
Metadata
Metadata
Assignees
Labels
No labels