Issue with Gradient Explosion and NaN Loss in SO100 Training Example #1066

fbeck-dev · 2025-05-03T14:59:42Z

System Info

- `lerobot` version: 0.1.0
- Platform: macOS-15.4.1-arm64-arm-64bit
- Python version: 3.10.13
- Huggingface_hub version: 0.30.2
- Dataset version: 3.5.1
- Numpy version: 2.2.5
- PyTorch version (GPU?): 2.6.0 (False)
- Cuda version: N/A
- Using GPU in script?: Apple Silicon - mps

Information

One of the scripts in the examples/ folder of LeRobot
My own task or dataset (give details below)

Reproduction

Hello,

First of all, thank you very much for the awesome project — I really appreciate the work you've put into it!

I followed the example from main/examples/10_use_so100.md and recorded 50 episodes of a simple task where the robot picks up a brick and places it into a bin. However, when I started training, I quickly noticed that the gradients became exponential, which caused the loss to eventually also become NaN.

Do you have any suggestions on how to stabilize the training or prevent this from happening?

Train command launched :
python lerobot/scripts/train.py \ --dataset.repo_id=fbeck/so100_test \ --policy.type=act \ --output_dir=outputs/train/act_so100_test_brick \ --job_name=act_so100_test_brick \ --policy.device=mps \ --wandb.enable=true

Config :

INFO 2025-05-03 16:31:28 ts/train.py:111 {'batch_size': 8,
'dataset': {'episodes': None,
'image_transforms': {'enable': False,
'max_num_transforms': 3,
'random_order': False,
'tfs': {'brightness': {'kwargs': {'brightness': [0.8,
1.2]},
'type': 'ColorJitter',
'weight': 1.0},
'contrast': {'kwargs': {'contrast': [0.8,
1.2]},
'type': 'ColorJitter',
'weight': 1.0},
'hue': {'kwargs': {'hue': [-0.05,
0.05]},
'type': 'ColorJitter',
'weight': 1.0},
'saturation': {'kwargs': {'saturation': [0.5,
1.5]},
'type': 'ColorJitter',
'weight': 1.0},
'sharpness': {'kwargs': {'sharpness': [0.5,
1.5]},
'type': 'SharpnessJitter',
'weight': 1.0}}},
'repo_id': 'fbeck/so100_test',
'revision': None,
'root': None,
'use_imagenet_stats': True,
'video_backend': 'torchcodec'},
'env': None,
'eval': {'batch_size': 50, 'n_episodes': 50, 'use_async_envs': False},
'eval_freq': 20000,
'job_name': 'act_so100_test_brick',
'log_freq': 200,
'num_workers': 4,
'optimizer': {'betas': [0.9, 0.999],
'eps': 1e-08,
'grad_clip_norm': 10,
'lr': 1e-05,
'type': 'adamw',
'weight_decay': 0.0001},
'output_dir': 'outputs/train/act_so100_test_brick',
'policy': {'chunk_size': 100,
'device': 'mps',
'dim_feedforward': 3200,
'dim_model': 512,
'dropout': 0.1,
'feedforward_activation': 'relu',
'input_features': {},
'kl_weight': 10.0,
'latent_dim': 32,
'n_action_steps': 100,
'n_decoder_layers': 1,
'n_encoder_layers': 4,
'n_heads': 8,
'n_obs_steps': 1,
'n_vae_encoder_layers': 4,
'normalization_mapping': {'ACTION': <NormalizationMode.MEAN_STD: 'MEAN_STD'>,
'STATE': <NormalizationMode.MEAN_STD: 'MEAN_STD'>,
'VISUAL': <NormalizationMode.MEAN_STD: 'MEAN_STD'>},
'optimizer_lr': 1e-05,
'optimizer_lr_backbone': 1e-05,
'optimizer_weight_decay': 0.0001,
'output_features': {},
'pre_norm': False,
'pretrained_backbone_weights': 'ResNet18_Weights.IMAGENET1K_V1',
'replace_final_stride_with_dilation': False,
'temporal_ensemble_coeff': None,
'type': 'act',
'use_amp': False,
'use_vae': True,
'vision_backbone': 'resnet18'},
'resume': False,
'save_checkpoint': True,
'save_freq': 20000,
'scheduler': None,
'seed': 1000,
'steps': 100000,
'use_policy_training_preset': True,
'wandb': {'disable_artifact': False,
'enable': True,
'entity': None,
'mode': None,
'notes': None,
'project': 'lerobot',
'run_id': None}}

Console logs :

INFO 2025-05-03 16:31:30 ts/train.py:127 Creating dataset
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 314415.59it/s]
INFO 2025-05-03 16:31:30 ts/train.py:138 Creating policy
INFO 2025-05-03 16:31:31 ts/train.py:144 Creating optimizer and scheduler
INFO 2025-05-03 16:31:31 ts/train.py:156 Output dir: outputs/train/act_so100_test_brick
INFO 2025-05-03 16:31:31 ts/train.py:159 cfg.steps=100000 (100K)
INFO 2025-05-03 16:31:31 ts/train.py:160 dataset.num_frames=16901 (17K)
INFO 2025-05-03 16:31:31 ts/train.py:161 dataset.num_episodes=50
INFO 2025-05-03 16:31:31 ts/train.py:162 num_learnable_params=51597190 (52M)
INFO 2025-05-03 16:31:31 ts/train.py:163 num_total_params=51597238 (52M)
INFO 2025-05-03 16:31:31 ts/train.py:202 Start offline training on a fixed dataset
INFO 2025-05-03 16:32:53 ts/train.py:232 step:200 smpl:2K ep:5 epch:0.09 loss:7.924 grdn:18966476.762 lr:1.0e-05 updt_s:0.378 data_s:0.033
INFO 2025-05-03 16:33:58 ts/train.py:232 step:400 smpl:3K ep:9 epch:0.19 loss:5.584 grdn:20729258418.160 lr:1.0e-05 updt_s:0.323 data_s:0.001
INFO 2025-05-03 16:35:03 ts/train.py:232 step:600 smpl:5K ep:14 epch:0.28 loss:5.579 grdn:inf lr:1.0e-05 updt_s:0.323 data_s:0.001
INFO 2025-05-03 16:36:08 ts/train.py:232 step:800 smpl:6K ep:19 epch:0.38 loss:5.584 grdn:inf lr:1.0e-05 updt_s:0.325 data_s:0.001
INFO 2025-05-03 16:37:15 ts/train.py:232 step:1K smpl:8K ep:24 epch:0.47 loss:5.492 grdn:inf lr:1.0e-05 updt_s:0.337 data_s:0.001
INFO 2025-05-03 16:38:23 ts/train.py:232 step:1K smpl:10K ep:28 epch:0.57 loss:5.564 grdn:inf lr:1.0e-05 updt_s:0.338 data_s:0.001
INFO 2025-05-03 16:39:33 ts/train.py:232 step:1K smpl:11K ep:33 epch:0.66 loss:nan grdn:nan lr:1.0e-05 updt_s:0.350 data_s:0.001
INFO 2025-05-03 16:40:42 ts/train.py:232 step:2K smpl:13K ep:38 epch:0.76 loss:nan grdn:nan lr:1.0e-05 updt_s:0.339 data_s:0.001
INFO 2025-05-03 16:41:48 ts/train.py:232 step:2K smpl:14K ep:43 epch:0.85 loss:nan grdn:nan lr:1.0e-05 updt_s:0.329 data_s:0.001
INFO 2025-05-03 16:42:53 ts/train.py:232 step:2K smpl:16K ep:47 epch:0.95 loss:nan grdn:nan lr:1.0e-05 updt_s:0.326 data_s:0.001

Expected behavior

I’ve already tried a few adjustments — lowering the learning rate, reducing the kl_weights, and tweaking other hyperparameters — but none of these changes managed to stabilize the behavior of the parameters during training.
Training with cpu seems to work, the gradients are controlled.
Pusht training with mps was fine.

This behaviour seems to be restricted to Apple Silicon as other users with macs have the same troubles (ACT training): https://discord.com/channels/1216765309076115607/1366842424370139216/1367828917867511891
What would you recommend as the best way to investigate this issue further?

Once again, thank you for your support and for sharing the project.

The text was updated successfully, but these errors were encountered:

Oddadmix · 2025-05-04T23:25:18Z

I have been facing the same issue with 2 camera sources. I tried it with 1 camera source and it seems to be fine but accuracy is not good at all.

How many camera sources are you using and what's your setup

a10v · 2025-05-12T21:09:54Z

I was also facing the same issue with two camera sources. One camera source was a wrist mounted innomaker 1080p camera and the other was an IPhone. After modifying the dataset's metadata and removing the IPhone videos, I was able to train an ACT model with non-NAN loss/grad values (the policy with a single camera feature is really bad).

I was also on an Apple Silicon MBP. I'll test whether the issue occurs with only 2 camera sources (and if more than 2 camera sources fixes the issue).

Update: I tested training with 3 cameras and it still experiences loss/gradient explosion (all NaNs) when training on a Macbook Pro.

Oddadmix · 2025-05-12T22:13:24Z

I found that the issue is with mps not the 2 cameras. I did try the training on gpu and it worked perfectly fine

imstevenpmwork added bug Something isn’t working correctly policies Items related to robot policies labels May 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Gradient Explosion and NaN Loss in SO100 Training Example #1066

Issue with Gradient Explosion and NaN Loss in SO100 Training Example #1066

fbeck-dev commented May 3, 2025 •

edited

Loading

Oddadmix commented May 4, 2025

a10v commented May 12, 2025 •

edited

Loading

Oddadmix commented May 12, 2025

Issue with Gradient Explosion and NaN Loss in SO100 Training Example #1066

Issue with Gradient Explosion and NaN Loss in SO100 Training Example #1066

Comments

fbeck-dev commented May 3, 2025 • edited Loading

System Info

Information

Reproduction

Expected behavior

Oddadmix commented May 4, 2025

a10v commented May 12, 2025 • edited Loading

Oddadmix commented May 12, 2025

fbeck-dev commented May 3, 2025 •

edited

Loading

a10v commented May 12, 2025 •

edited

Loading