Skip to content

Dataset.save_to_disk hangs when using num_proc > 1 #7290

Open
@JohannesAck

Description

@JohannesAck

Describe the bug

Hi, I'm encountered a small issue when saving datasets that led to the saving taking up to multiple hours.
Specifically, Dataset.save_to_disk is a lot slower when using num_proc>1 than when using num_proc=1

The documentation mentions that "Multiprocessing is disabled by default.", but there is no explanation on how to enable it.

Steps to reproduce the bug

import numpy as np
from datasets import Dataset

n_samples = int(4e6)
n_tokens_sample = 100
data_dict = {
    'tokens' : np.random.randint(0, 100, (n_samples, n_tokens_sample)),
}

dataset = Dataset.from_dict(data_dict)
dataset.save_to_disk('test_dataset', num_proc=1)
dataset.save_to_disk('test_dataset', num_proc=4)
dataset.save_to_disk('test_dataset', num_proc=8)

This results in:

>>> dataset.save_to_disk('test_dataset', num_proc=1)
Saving the dataset (7/7 shards): 100%|██████████████| 4000000/4000000 [00:17<00:00, 228075.15 examples/s]
>>> dataset.save_to_disk('test_dataset', num_proc=4)
Saving the dataset (7/7 shards): 100%|██████████████| 4000000/4000000 [01:49<00:00, 36583.75 examples/s]
>>> dataset.save_to_disk('test_dataset', num_proc=8)
Saving the dataset (8/8 shards): 100%|██████████████| 4000000/4000000 [02:11<00:00, 30518.43 examples/s]

With larger datasets it can take hours, but I didn't benchmark that for this bug report.

Expected behavior

I would expect using num_proc>1 to be faster instead of slower than num_proc=1.

Environment info

  • datasets version: 3.1.0
  • Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • huggingface_hub version: 0.26.2
  • PyArrow version: 18.0.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.6.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions