Skip to content

Commit 5d1cc86

Browse files
authored
[data] shard the dataset to allow multiprocessing when streaming is enabled (#7530)
* Shard the dataset when streaming to allow multiprocessing * Allow user to not set dataset_shards to ensure backward compatibility
1 parent 6d6e0f4 commit 5d1cc86

File tree

4 files changed

+12
-4
lines changed

4 files changed

+12
-4
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -204,7 +204,7 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/
204204

205205
[23/08/11] We supported **[DPO training](https://arxiv.org/abs/2305.18290)** for instruction-tuned models. See [examples](examples/README.md) for usage.
206206

207-
[23/07/31] We supported **dataset streaming**. Try `streaming: true` and `max_steps: 10000` arguments to load your dataset in streaming mode.
207+
[23/07/31] We supported **dataset streaming**. Try `streaming: true` and `max_steps: 10000` arguments to load your dataset in streaming mode. Use `dataset_shards` to enable parallel preprocessing with streaming.
208208

209209
[23/07/29] We released two instruction-tuned 13B models at Hugging Face. See these Hugging Face Repos ([LLaMA-2](https://huggingface.co/hiyouga/Llama-2-Chinese-13b-chat) / [Baichuan](https://huggingface.co/hiyouga/Baichuan-13B-sft)) for details.
210210

README_zh.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -206,7 +206,7 @@ https://github.com/user-attachments/assets/43b700c6-a178-41db-b1f8-8190a5d3fcfc
206206

207207
[23/08/11] 我们支持了指令模型的 **[DPO 训练](https://arxiv.org/abs/2305.18290)**。详细用法请参照 [examples](examples/README_zh.md)
208208

209-
[23/07/31] 我们支持了**数据流式加载**。请使用 `streaming: true``max_steps: 10000` 参数来流式加载数据集。
209+
[23/07/31] 我们支持了**数据流式加载**。请使用 `streaming: true``max_steps: 10000` 参数来流式加载数据集。`dataset_shards` 来开启多进程加载。
210210

211211
[23/07/29] 我们在 Hugging Face 发布了两个 13B 指令微调模型。详细内容请查阅我们的 Hugging Face 项目([LLaMA-2](https://huggingface.co/hiyouga/Llama-2-Chinese-13b-chat) / [Baichuan](https://huggingface.co/hiyouga/Baichuan-13B-sft))。
212212

src/llamafactory/data/loader.py

+6-2
Original file line numberDiff line numberDiff line change
@@ -101,10 +101,12 @@ def _load_single_dataset(
101101
split=dataset_attr.split,
102102
cache_dir=cache_dir,
103103
token=model_args.ms_hub_token,
104-
use_streaming=data_args.streaming,
104+
use_streaming=data_args.streaming and not data_args.dataset_shards, # only set to True when user specified streaming but do not want dataset to be sharded
105105
)
106106
if isinstance(dataset, MsDataset):
107107
dataset = dataset.to_hf_dataset()
108+
if data_args.streaming and data_args.dataset_shards:
109+
dataset = dataset.to_iterable_dataset(num_shards=data_args.dataset_shards)
108110

109111
elif dataset_attr.load_from == "om_hub":
110112
check_version("openmind>=0.8.0", mandatory=True)
@@ -131,10 +133,12 @@ def _load_single_dataset(
131133
split=dataset_attr.split,
132134
cache_dir=model_args.cache_dir,
133135
token=model_args.hf_hub_token,
134-
streaming=data_args.streaming,
135136
num_proc=data_args.preprocessing_num_workers,
136137
trust_remote_code=model_args.trust_remote_code,
138+
streaming=data_args.streaming and not data_args.dataset_shards,
137139
)
140+
if data_args.streaming and data_args.dataset_shards:
141+
dataset = dataset.to_iterable_dataset(num_shards=data_args.dataset_shards)
138142

139143
if dataset_attr.num_samples is not None and not data_args.streaming:
140144
target_num = dataset_attr.num_samples

src/llamafactory/hparams/data_args.py

+4
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,10 @@ class DataArguments:
8383
default=None,
8484
metadata={"help": "The number of processes to use for the pre-processing."},
8585
)
86+
dataset_shards: Optional[int] = field(
87+
default=None,
88+
metadata={"help": "The number of shards to split the dataset into. Only used in streaming mode. This should be set to the same as dataloader_num_workers. Not setting this while streaming data will cause the dataset to be non-sharded and thus only can be processed using one worker."},
89+
)
8690
max_samples: Optional[int] = field(
8791
default=None,
8892
metadata={"help": "For debugging purposes, truncate the number of examples for each dataset."},

0 commit comments

Comments
 (0)