How to specify the split (train/validation) for the dataset in cli #3789

ma7555 · 2025-04-07T12:32:58Z

How can I specify the split to use for training and validation?

CUDA_VISIBLE_DEVICES=0 MAX_PIXELS=262144 \
swift sft \
    --model LLM-Research/gemma-3-1b-it \
    --train_type full \
    --dataset 'swift/path-vqa#train' \
    --val_dataset 'swift/path-vqa#validation' \
    --torch_dtype bfloat16 \
    --num_train_epochs 3 \

Of course this will fail as #train is treated as a subset not a split. How can I specify the split?

The text was updated successfully, but these errors were encountered:

Jintao-Huang · 2025-04-07T15:50:39Z

https://github.com/modelscope/ms-swift/blob/main/swift/llm/dataset/dataset/mllm.py#L174

use 'modelscope/coco_2014_caption:validation'

ma7555 · 2025-04-07T16:55:16Z

Hello @Jintao-Huang, I have to use swift/path-vqa it is not optional.

Jintao-Huang · 2025-04-08T13:21:35Z

https://github.com/modelscope/ms-swift/blob/main/swift/llm/dataset/data/dataset_info.json#L612

You may need to modify the source code to resolve the issue; perhaps the following modification:

https://github.com/modelscope/ms-swift/blob/main/swift/llm/dataset/data/dataset_info.json#L105

    {
        "ms_dataset_id": "swift/path-vqa",
        "hf_dataset_id": "flaviagiammarino/path-vqa",
        "subsets": [{
            "name": "train",
            "split": ["train"]
        },
        {
            "name": "validation",
            "split": ["validation"]
        }]
        "columns": {
            "question": "query",
            "answer": "response"
        },
        "tags": ["multi-modal", "vqa", "medical"]
    },

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to specify the split (train/validation) for the dataset in cli #3789

How to specify the split (train/validation) for the dataset in cli #3789

ma7555 commented Apr 7, 2025 •

edited

Loading

Jintao-Huang commented Apr 7, 2025

ma7555 commented Apr 7, 2025

Jintao-Huang commented Apr 8, 2025

How to specify the split (train/validation) for the dataset in cli #3789

How to specify the split (train/validation) for the dataset in cli #3789

Comments

ma7555 commented Apr 7, 2025 • edited Loading

Jintao-Huang commented Apr 7, 2025

ma7555 commented Apr 7, 2025

Jintao-Huang commented Apr 8, 2025

ma7555 commented Apr 7, 2025 •

edited

Loading