Skip to content

Issue with Gradient Explosion and NaN Loss in SO100 Training Example #1066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 2 tasks
fbeck-dev opened this issue May 3, 2025 · 3 comments
Open
1 of 2 tasks
Labels
bug Something isn’t working correctly policies Items related to robot policies

Comments

@fbeck-dev
Copy link

fbeck-dev commented May 3, 2025

System Info

- `lerobot` version: 0.1.0
- Platform: macOS-15.4.1-arm64-arm-64bit
- Python version: 3.10.13
- Huggingface_hub version: 0.30.2
- Dataset version: 3.5.1
- Numpy version: 2.2.5
- PyTorch version (GPU?): 2.6.0 (False)
- Cuda version: N/A
- Using GPU in script?: Apple Silicon - mps

Information

  • One of the scripts in the examples/ folder of LeRobot
  • My own task or dataset (give details below)

Reproduction

Hello,

First of all, thank you very much for the awesome project — I really appreciate the work you've put into it!

I followed the example from main/examples/10_use_so100.md and recorded 50 episodes of a simple task where the robot picks up a brick and places it into a bin. However, when I started training, I quickly noticed that the gradients became exponential, which caused the loss to eventually also become NaN.

Do you have any suggestions on how to stabilize the training or prevent this from happening?

Train command launched :
python lerobot/scripts/train.py \ --dataset.repo_id=fbeck/so100_test \ --policy.type=act \ --output_dir=outputs/train/act_so100_test_brick \ --job_name=act_so100_test_brick \ --policy.device=mps \ --wandb.enable=true

Config :

INFO 2025-05-03 16:31:28 ts/train.py:111 {'batch_size': 8,
'dataset': {'episodes': None,
'image_transforms': {'enable': False,
'max_num_transforms': 3,
'random_order': False,
'tfs': {'brightness': {'kwargs': {'brightness': [0.8,
1.2]},
'type': 'ColorJitter',
'weight': 1.0},
'contrast': {'kwargs': {'contrast': [0.8,
1.2]},
'type': 'ColorJitter',
'weight': 1.0},
'hue': {'kwargs': {'hue': [-0.05,
0.05]},
'type': 'ColorJitter',
'weight': 1.0},
'saturation': {'kwargs': {'saturation': [0.5,
1.5]},
'type': 'ColorJitter',
'weight': 1.0},
'sharpness': {'kwargs': {'sharpness': [0.5,
1.5]},
'type': 'SharpnessJitter',
'weight': 1.0}}},
'repo_id': 'fbeck/so100_test',
'revision': None,
'root': None,
'use_imagenet_stats': True,
'video_backend': 'torchcodec'},
'env': None,
'eval': {'batch_size': 50, 'n_episodes': 50, 'use_async_envs': False},
'eval_freq': 20000,
'job_name': 'act_so100_test_brick',
'log_freq': 200,
'num_workers': 4,
'optimizer': {'betas': [0.9, 0.999],
'eps': 1e-08,
'grad_clip_norm': 10,
'lr': 1e-05,
'type': 'adamw',
'weight_decay': 0.0001},
'output_dir': 'outputs/train/act_so100_test_brick',
'policy': {'chunk_size': 100,
'device': 'mps',
'dim_feedforward': 3200,
'dim_model': 512,
'dropout': 0.1,
'feedforward_activation': 'relu',
'input_features': {},
'kl_weight': 10.0,
'latent_dim': 32,
'n_action_steps': 100,
'n_decoder_layers': 1,
'n_encoder_layers': 4,
'n_heads': 8,
'n_obs_steps': 1,
'n_vae_encoder_layers': 4,
'normalization_mapping': {'ACTION': <NormalizationMode.MEAN_STD: 'MEAN_STD'>,
'STATE': <NormalizationMode.MEAN_STD: 'MEAN_STD'>,
'VISUAL': <NormalizationMode.MEAN_STD: 'MEAN_STD'>},
'optimizer_lr': 1e-05,
'optimizer_lr_backbone': 1e-05,
'optimizer_weight_decay': 0.0001,
'output_features': {},
'pre_norm': False,
'pretrained_backbone_weights': 'ResNet18_Weights.IMAGENET1K_V1',
'replace_final_stride_with_dilation': False,
'temporal_ensemble_coeff': None,
'type': 'act',
'use_amp': False,
'use_vae': True,
'vision_backbone': 'resnet18'},
'resume': False,
'save_checkpoint': True,
'save_freq': 20000,
'scheduler': None,
'seed': 1000,
'steps': 100000,
'use_policy_training_preset': True,
'wandb': {'disable_artifact': False,
'enable': True,
'entity': None,
'mode': None,
'notes': None,
'project': 'lerobot',
'run_id': None}}

Console logs :

INFO 2025-05-03 16:31:30 ts/train.py:127 Creating dataset
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 314415.59it/s]
INFO 2025-05-03 16:31:30 ts/train.py:138 Creating policy
INFO 2025-05-03 16:31:31 ts/train.py:144 Creating optimizer and scheduler
INFO 2025-05-03 16:31:31 ts/train.py:156 Output dir: outputs/train/act_so100_test_brick
INFO 2025-05-03 16:31:31 ts/train.py:159 cfg.steps=100000 (100K)
INFO 2025-05-03 16:31:31 ts/train.py:160 dataset.num_frames=16901 (17K)
INFO 2025-05-03 16:31:31 ts/train.py:161 dataset.num_episodes=50
INFO 2025-05-03 16:31:31 ts/train.py:162 num_learnable_params=51597190 (52M)
INFO 2025-05-03 16:31:31 ts/train.py:163 num_total_params=51597238 (52M)
INFO 2025-05-03 16:31:31 ts/train.py:202 Start offline training on a fixed dataset
INFO 2025-05-03 16:32:53 ts/train.py:232 step:200 smpl:2K ep:5 epch:0.09 loss:7.924 grdn:18966476.762 lr:1.0e-05 updt_s:0.378 data_s:0.033
INFO 2025-05-03 16:33:58 ts/train.py:232 step:400 smpl:3K ep:9 epch:0.19 loss:5.584 grdn:20729258418.160 lr:1.0e-05 updt_s:0.323 data_s:0.001
INFO 2025-05-03 16:35:03 ts/train.py:232 step:600 smpl:5K ep:14 epch:0.28 loss:5.579 grdn:inf lr:1.0e-05 updt_s:0.323 data_s:0.001
INFO 2025-05-03 16:36:08 ts/train.py:232 step:800 smpl:6K ep:19 epch:0.38 loss:5.584 grdn:inf lr:1.0e-05 updt_s:0.325 data_s:0.001
INFO 2025-05-03 16:37:15 ts/train.py:232 step:1K smpl:8K ep:24 epch:0.47 loss:5.492 grdn:inf lr:1.0e-05 updt_s:0.337 data_s:0.001
INFO 2025-05-03 16:38:23 ts/train.py:232 step:1K smpl:10K ep:28 epch:0.57 loss:5.564 grdn:inf lr:1.0e-05 updt_s:0.338 data_s:0.001
INFO 2025-05-03 16:39:33 ts/train.py:232 step:1K smpl:11K ep:33 epch:0.66 loss:nan grdn:nan lr:1.0e-05 updt_s:0.350 data_s:0.001
INFO 2025-05-03 16:40:42 ts/train.py:232 step:2K smpl:13K ep:38 epch:0.76 loss:nan grdn:nan lr:1.0e-05 updt_s:0.339 data_s:0.001
INFO 2025-05-03 16:41:48 ts/train.py:232 step:2K smpl:14K ep:43 epch:0.85 loss:nan grdn:nan lr:1.0e-05 updt_s:0.329 data_s:0.001
INFO 2025-05-03 16:42:53 ts/train.py:232 step:2K smpl:16K ep:47 epch:0.95 loss:nan grdn:nan lr:1.0e-05 updt_s:0.326 data_s:0.001

Expected behavior

I’ve already tried a few adjustments — lowering the learning rate, reducing the kl_weights, and tweaking other hyperparameters — but none of these changes managed to stabilize the behavior of the parameters during training.
Training with cpu seems to work, the gradients are controlled.
Pusht training with mps was fine.

This behaviour seems to be restricted to Apple Silicon as other users with macs have the same troubles (ACT training): https://discord.com/channels/1216765309076115607/1366842424370139216/1367828917867511891
What would you recommend as the best way to investigate this issue further?

Once again, thank you for your support and for sharing the project.

@Oddadmix
Copy link

Oddadmix commented May 4, 2025

I have been facing the same issue with 2 camera sources. I tried it with 1 camera source and it seems to be fine but accuracy is not good at all.

How many camera sources are you using and what's your setup

@imstevenpmwork imstevenpmwork added bug Something isn’t working correctly policies Items related to robot policies labels May 6, 2025
@a10v
Copy link

a10v commented May 12, 2025

I was also facing the same issue with two camera sources. One camera source was a wrist mounted innomaker 1080p camera and the other was an IPhone. After modifying the dataset's metadata and removing the IPhone videos, I was able to train an ACT model with non-NAN loss/grad values (the policy with a single camera feature is really bad).

I was also on an Apple Silicon MBP. I'll test whether the issue occurs with only 2 camera sources (and if more than 2 camera sources fixes the issue).

Update: I tested training with 3 cameras and it still experiences loss/gradient explosion (all NaNs) when training on a Macbook Pro.

@Oddadmix
Copy link

I found that the issue is with mps not the 2 cameras. I did try the training on gpu and it worked perfectly fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn’t working correctly policies Items related to robot policies
Projects
None yet
Development

No branches or pull requests

4 participants