You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- `lerobot` version: 0.1.0
- Platform: macOS-15.4.1-arm64-arm-64bit
- Python version: 3.10.13
- Huggingface_hub version: 0.30.2
- Dataset version: 3.5.1
- Numpy version: 2.2.5
- PyTorch version (GPU?): 2.6.0 (False)
- Cuda version: N/A
- Using GPU in script?: Apple Silicon - mps
Information
One of the scripts in the examples/ folder of LeRobot
My own task or dataset (give details below)
Reproduction
Hello,
First of all, thank you very much for the awesome project — I really appreciate the work you've put into it!
I followed the example from main/examples/10_use_so100.md and recorded 50 episodes of a simple task where the robot picks up a brick and places it into a bin. However, when I started training, I quickly noticed that the gradients became exponential, which caused the loss to eventually also become NaN.
Do you have any suggestions on how to stabilize the training or prevent this from happening?
INFO 2025-05-03 16:31:30 ts/train.py:127 Creating dataset
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 314415.59it/s]
INFO 2025-05-03 16:31:30 ts/train.py:138 Creating policy
INFO 2025-05-03 16:31:31 ts/train.py:144 Creating optimizer and scheduler
INFO 2025-05-03 16:31:31 ts/train.py:156 Output dir: outputs/train/act_so100_test_brick
INFO 2025-05-03 16:31:31 ts/train.py:159 cfg.steps=100000 (100K)
INFO 2025-05-03 16:31:31 ts/train.py:160 dataset.num_frames=16901 (17K)
INFO 2025-05-03 16:31:31 ts/train.py:161 dataset.num_episodes=50
INFO 2025-05-03 16:31:31 ts/train.py:162 num_learnable_params=51597190 (52M)
INFO 2025-05-03 16:31:31 ts/train.py:163 num_total_params=51597238 (52M)
INFO 2025-05-03 16:31:31 ts/train.py:202 Start offline training on a fixed dataset
INFO 2025-05-03 16:32:53 ts/train.py:232 step:200 smpl:2K ep:5 epch:0.09 loss:7.924 grdn:18966476.762 lr:1.0e-05 updt_s:0.378 data_s:0.033
INFO 2025-05-03 16:33:58 ts/train.py:232 step:400 smpl:3K ep:9 epch:0.19 loss:5.584 grdn:20729258418.160 lr:1.0e-05 updt_s:0.323 data_s:0.001
INFO 2025-05-03 16:35:03 ts/train.py:232 step:600 smpl:5K ep:14 epch:0.28 loss:5.579 grdn:inf lr:1.0e-05 updt_s:0.323 data_s:0.001
INFO 2025-05-03 16:36:08 ts/train.py:232 step:800 smpl:6K ep:19 epch:0.38 loss:5.584 grdn:inf lr:1.0e-05 updt_s:0.325 data_s:0.001
INFO 2025-05-03 16:37:15 ts/train.py:232 step:1K smpl:8K ep:24 epch:0.47 loss:5.492 grdn:inf lr:1.0e-05 updt_s:0.337 data_s:0.001
INFO 2025-05-03 16:38:23 ts/train.py:232 step:1K smpl:10K ep:28 epch:0.57 loss:5.564 grdn:inf lr:1.0e-05 updt_s:0.338 data_s:0.001
INFO 2025-05-03 16:39:33 ts/train.py:232 step:1K smpl:11K ep:33 epch:0.66 loss:nan grdn:nan lr:1.0e-05 updt_s:0.350 data_s:0.001
INFO 2025-05-03 16:40:42 ts/train.py:232 step:2K smpl:13K ep:38 epch:0.76 loss:nan grdn:nan lr:1.0e-05 updt_s:0.339 data_s:0.001
INFO 2025-05-03 16:41:48 ts/train.py:232 step:2K smpl:14K ep:43 epch:0.85 loss:nan grdn:nan lr:1.0e-05 updt_s:0.329 data_s:0.001
INFO 2025-05-03 16:42:53 ts/train.py:232 step:2K smpl:16K ep:47 epch:0.95 loss:nan grdn:nan lr:1.0e-05 updt_s:0.326 data_s:0.001
Expected behavior
I’ve already tried a few adjustments — lowering the learning rate, reducing the kl_weights, and tweaking other hyperparameters — but none of these changes managed to stabilize the behavior of the parameters during training.
Training with cpu seems to work, the gradients are controlled.
Pusht training with mps was fine.
I was also facing the same issue with two camera sources. One camera source was a wrist mounted innomaker 1080p camera and the other was an IPhone. After modifying the dataset's metadata and removing the IPhone videos, I was able to train an ACT model with non-NAN loss/grad values (the policy with a single camera feature is really bad).
I was also on an Apple Silicon MBP. I'll test whether the issue occurs with only 2 camera sources (and if more than 2 camera sources fixes the issue).
Update: I tested training with 3 cameras and it still experiences loss/gradient explosion (all NaNs) when training on a Macbook Pro.
System Info
Information
Reproduction
Hello,
First of all, thank you very much for the awesome project — I really appreciate the work you've put into it!
I followed the example from main/examples/10_use_so100.md and recorded 50 episodes of a simple task where the robot picks up a brick and places it into a bin. However, when I started training, I quickly noticed that the gradients became exponential, which caused the loss to eventually also become NaN.
Do you have any suggestions on how to stabilize the training or prevent this from happening?
Train command launched :
python lerobot/scripts/train.py \ --dataset.repo_id=fbeck/so100_test \ --policy.type=act \ --output_dir=outputs/train/act_so100_test_brick \ --job_name=act_so100_test_brick \ --policy.device=mps \ --wandb.enable=true
Config :
Console logs :
Expected behavior
I’ve already tried a few adjustments — lowering the learning rate, reducing the kl_weights, and tweaking other hyperparameters — but none of these changes managed to stabilize the behavior of the parameters during training.
Training with cpu seems to work, the gradients are controlled.
Pusht training with mps was fine.
This behaviour seems to be restricted to Apple Silicon as other users with macs have the same troubles (ACT training): https://discord.com/channels/1216765309076115607/1366842424370139216/1367828917867511891
What would you recommend as the best way to investigate this issue further?
Once again, thank you for your support and for sharing the project.
The text was updated successfully, but these errors were encountered: