You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -204,7 +204,7 @@ Compared to ChatGLM's [P-Tuning](https://github.com/THUDM/ChatGLM2-6B/tree/main/
204
204
205
205
[23/08/11] We supported **[DPO training](https://arxiv.org/abs/2305.18290)** for instruction-tuned models. See [examples](examples/README.md) for usage.
206
206
207
-
[23/07/31] We supported **dataset streaming**. Try `streaming: true` and `max_steps: 10000` arguments to load your dataset in streaming mode. Use `dataset_shards` to enable parallel preprocessing with streaming.
207
+
[23/07/31] We supported **dataset streaming**. Try `streaming: true` and `max_steps: 10000` arguments to load your dataset in streaming mode.
208
208
209
209
[23/07/29] We released two instruction-tuned 13B models at Hugging Face. See these Hugging Face Repos ([LLaMA-2](https://huggingface.co/hiyouga/Llama-2-Chinese-13b-chat) / [Baichuan](https://huggingface.co/hiyouga/Baichuan-13B-sft)) for details.
Copy file name to clipboardExpand all lines: src/llamafactory/data/loader.py
+4-6
Original file line number
Diff line number
Diff line change
@@ -101,12 +101,10 @@ def _load_single_dataset(
101
101
split=dataset_attr.split,
102
102
cache_dir=cache_dir,
103
103
token=model_args.ms_hub_token,
104
-
use_streaming=data_args.streamingandnotdata_args.dataset_shards, # only set to True when user specified streaming but do not want dataset to be sharded
Copy file name to clipboardExpand all lines: src/llamafactory/hparams/data_args.py
-4
Original file line number
Diff line number
Diff line change
@@ -83,10 +83,6 @@ class DataArguments:
83
83
default=None,
84
84
metadata={"help": "The number of processes to use for the pre-processing."},
85
85
)
86
-
dataset_shards: Optional[int] =field(
87
-
default=None,
88
-
metadata={"help": "The number of shards to split the dataset into. Only used in streaming mode. This should be set to the same as dataloader_num_workers. Not setting this while streaming data will cause the dataset to be non-sharded and thus only can be processed using one worker."},
89
-
)
90
86
max_samples: Optional[int] =field(
91
87
default=None,
92
88
metadata={"help": "For debugging purposes, truncate the number of examples for each dataset."},
0 commit comments