-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
When I implement accelerate on the lerobot repo in order to train policy using multi-GPU, I got some bug I never got without accelerate
[rank0]:` RuntimeError: expected scalar type Float but found BFloat16
[rank1]: RuntimeError: expected scalar type Float but found BFloat16
I meet this problem when I am training PI 0 policy and customized DiT policy. However, I don't know how to identify the layer or tensor with this problem.
Is it possible to use a proper configuration of mixed precision to fix this problem so that I don't have to find out which part of the network has unmatch dtype.
This is my accelerate config
- `Accelerate` version: 1.9.0
- Platform: Linux-5.15.0-140-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /opt/venv/bin/accelerate
- Python version: 3.10.12
- Numpy version: 2.2.6
- PyTorch version: 2.7.1+cu126
- PyTorch accelerator: CUDA
- System RAM: 362.13 GB
- GPU type: Tesla V100-PCIE-16GB
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- debug: True
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- deepspeed_config: {'gradient_accumulation_steps': 1, 'gradient_clipping': 10.0, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': True, 'zero_stage': 2}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Metadata
Metadata
Assignees
Labels
No labels