Skip to content

dtype issue when using accelerate #3685

@ZhangYi1999

Description

@ZhangYi1999

When I implement accelerate on the lerobot repo in order to train policy using multi-GPU, I got some bug I never got without accelerate

[rank0]:` RuntimeError: expected scalar type Float but found BFloat16
[rank1]: RuntimeError: expected scalar type Float but found BFloat16

I meet this problem when I am training PI 0 policy and customized DiT policy. However, I don't know how to identify the layer or tensor with this problem.

Is it possible to use a proper configuration of mixed precision to fix this problem so that I don't have to find out which part of the network has unmatch dtype.

This is my accelerate config

- `Accelerate` version: 1.9.0
- Platform: Linux-5.15.0-140-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /opt/venv/bin/accelerate
- Python version: 3.10.12
- Numpy version: 2.2.6
- PyTorch version: 2.7.1+cu126
- PyTorch accelerator: CUDA
- System RAM: 362.13 GB
- GPU type: Tesla V100-PCIE-16GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: bf16
        - use_cpu: False
        - debug: True
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'gradient_accumulation_steps': 1, 'gradient_clipping': 10.0, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero_stage': 2}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions