Open
Description
BioNeMo Framework Version
Bug Description
Pipeline model parallel crashes for Evo2 training, ie when pipeline-model-parallel-size > 1
Steps to Reproduce
- Build a docker at a commit sha
- Run this command at multi gpu setting with
--pipeline-model-parallel-size=2
train_evo2 -d /workspace/bionemo2/sub-packages/bionemo-evo2/examples/configs/full_pretrain_shortphase_config.yaml --dataset-dir /data/evo2 --grad-acc-batches 1 --fp8 --fp8-wgrad --activation-checkpoint-recompute-num-layers 5 --enable-preemption --ckpt-async-save --use-megatron-comm-overlap-llama3-8k --overlap-grad-reduce --clip-grad=250 --eod-pad-in-loss-mask --seq-length=8192 --seed 3735928559 --lr=0.00015 --wd=0.1 --min-lr=1.5e-05 --warmup-steps=5000 --tensor-parallel-size=1 --context-parallel-size=1 --pipeline-model-parallel-size=2 --workers 8 --num-nodes=2 --devices=8 --micro-batch-size=8 --model-size=1b --max-steps=490000 --early-stop-on-step 650 --limit-val-batches=20 --log-every-n-steps=50 --val-check-interval=200 --create-tflops-callback --create-tensorboard-logger --disable-checkpointing
Error Messages and Logs
IndexError: index 12 is out of range
1: [rank1]: Traceback (most recent call last):
1: [rank1]: File "/usr/local/bin/train_evo2", line 8, in <module>
1: [rank1]: sys.exit(main())
1: [rank1]: ^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/bionemo/evo2/run/train.py", line 705, in main
1: [rank1]: train(args=args)
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/bionemo/evo2/run/train.py", line 698, in train
1: [rank1]: trainer.fit(model, data_module)
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
1: [rank1]: call._call_and_handle_interrupt(
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
1: [rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
1: [rank1]: return function(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
1: [rank1]: self._run(model, ckpt_path=ckpt_path)
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
1: [rank1]: results = self._run_stage()
1: [rank1]: ^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
1: [rank1]: self.fit_loop.run()
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
1: [rank1]: self.advance()
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
1: [rank1]: self.epoch_loop.run(self._data_fetcher)
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
1: [rank1]: self.advance(data_fetcher)
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 250, in advance
1: [rank1]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 190, in run
1: [rank1]: self._optimizer_step(batch_idx, closure)
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 268, in _optimizer_step
1: [rank1]: call._call_lightning_module_hook(
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 167, in _call_lightning_module_hook
1: [rank1]: output = fn(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/core/module.py", line 1306, in optimizer_step
1: [rank1]: optimizer.step(closure=optimizer_closure)
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/core/optimizer.py", line 153, in step
1: [rank1]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 721, in optimizer_step
1: [rank1]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/ddp.py", line 270, in optimizer_step
1: [rank1]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/strategy.py", line 238, in optimizer_step
1: [rank1]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/plugins/precision/precision.py", line 122, in optimizer_step
1: [rank1]: return optimizer.step(closure=closure, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/optim/lr_scheduler.py", line 140, in wrapper
1: [rank1]: return func.__get__(opt, opt.__class__)(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/core/optim/mcore_optim.py", line 129, in step
1: [rank1]: loss = closure()
1: [rank1]: ^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/plugins/precision/precision.py", line 108, in _wrap_closure
1: [rank1]: closure_result = closure()
1: [rank1]: ^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in __call__
1: [rank1]: self._result = self.closure(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
1: [rank1]: return func(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure
1: [rank1]: step_output = self._step_fn()
1: [rank1]: ^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 317, in _training_step
1: [rank1]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 319, in _call_strategy_hook
1: [rank1]: output = fn(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 655, in training_step
1: [rank1]: out = self.model.training_step(dataloader_iter, *args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 384, in training_step
1: [rank1]: return self._step(
1: [rank1]: ^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 496, in _step
1: [rank1]: return self.forward(
1: [rank1]: ^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 346, in forward
1: [rank1]: microbatch_outputs = step()
1: [rank1]: ^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 1251, in __call__
1: [rank1]: return self.forward_backward_func(
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/megatron/core/pipeline_parallel/schedules.py", line 1741, in forward_backward_pipelining_without_interleaving
1: [rank1]: output_tensor, num_tokens = forward_step(
1: [rank1]: ^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/megatron/core/pipeline_parallel/schedules.py", line 275, in forward_step
1: [rank1]: output_tensor, loss_func = forward_step_func(data_iterator, model)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 559, in wrapped_forward_step_func
1: [rank1]: output_tensor = _forward_step(model, batch)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 861, in wrapped
1: [rank1]: return attr(*args)
1: [rank1]: ^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/base.py", line 606, in forward_step
1: [rank1]: return self.config.forward_step_fn(self, batch)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/hyena.py", line 155, in hyena_forward_step
1: [rank1]: return model(**forward_args)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
1: [rank1]: return self._call_impl(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl
1: [rank1]: return forward_call(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/hyena.py", line 120, in forward
1: [rank1]: output_tensor = self.module(
1: [rank1]: ^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
1: [rank1]: return self._call_impl(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl
1: [rank1]: return forward_call(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/megatron/core/distributed/data_parallel_base.py", line 22, in forward
1: [rank1]: return self.module(*inputs, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
1: [rank1]: return self._call_impl(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1848, in _call_impl
1: [rank1]: return inner()
1: [rank1]: ^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1794, in inner
1: [rank1]: result = forward_call(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/megatron/core/transformer/module.py", line 178, in forward
1: [rank1]: outputs = self.module(*inputs, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
1: [rank1]: return self._call_impl(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1848, in _call_impl
1: [rank1]: return inner()
1: [rank1]: ^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1794, in inner
1: [rank1]: result = forward_call(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/megatron/hyena/hyena_model.py", line 263, in forward
1: [rank1]: hidden_states = self.decoder(
1: [rank1]: ^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
1: [rank1]: return self._call_impl(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1848, in _call_impl
1: [rank1]: return inner()
1: [rank1]: ^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1794, in inner
1: [rank1]: result = forward_call(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/megatron/hyena/hyena_block.py", line 294, in forward
1: [rank1]: hidden_states = self._checkpointed_forward(
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/megatron/hyena/hyena_block.py", line 220, in _checkpointed_forward
1: [rank1]: hidden_states = checkpoint_handler(custom(layer_idx, layer_idx + self.config.recompute_num_layers))
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/megatron/hyena/hyena_block.py", line 192, in checkpoint_handler
1: [rank1]: return te_checkpoint(
1: [rank1]: ^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/megatron/core/extensions/transformer_engine.py", line 1235, in te_checkpoint
1: [rank1]: return checkpoint(
1: [rank1]: ^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/_compile.py", line 32, in inner
1: [rank1]: return disable_fn(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 738, in _fn
1: [rank1]: return fn(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/distributed.py", line 668, in checkpoint
1: [rank1]: return _CheckpointFunction.apply(
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 575, in apply
1: [rank1]: return super().apply(*args, **kwargs) # type: ignore[misc]
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/distributed.py", line 310, in forward
1: [rank1]: outputs = run_function(*args, **kwargs)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/megatron/hyena/hyena_block.py", line 176, in custom_forward
1: [rank1]: layer = self._get_layer(index)
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/megatron/hyena/hyena_block.py", line 163, in _get_layer
1: [rank1]: return self.layers[layer_number]
1: [rank1]: ~~~~~~~~~~~^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/container.py", line 334, in __getitem__
1: [rank1]: return self._modules[self._get_abs_string_index(idx)]
1: [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1: [rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/container.py", line 316, in _get_abs_string_index
1: [rank1]: raise IndexError(f"index {idx} is out of range")
1: [rank1]: IndexError: index 12 is out of range
0: [rank0]: Traceback (most recent call last):
0: [rank0]: File "/usr/local/bin/train_evo2", line 8, in <module>
0: [rank0]: sys.exit(main())
0: [rank0]: ^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/bionemo/evo2/run/train.py", line 705, in main
0: [rank0]: train(args=args)
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/bionemo/evo2/run/train.py", line 698, in train
0: [rank0]: trainer.fit(model, data_module)
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
0: [rank0]: call._call_and_handle_interrupt(
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
0: [rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
0: [rank0]: return function(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
0: [rank0]: self._run(model, ckpt_path=ckpt_path)
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
0: [rank0]: results = self._run_stage()
0: [rank0]: ^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
0: [rank0]: self.fit_loop.run()
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
0: [rank0]: self.advance()
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
0: [rank0]: self.epoch_loop.run(self._data_fetcher)
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
0: [rank0]: self.advance(data_fetcher)
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 250, in advance
0: [rank0]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 190, in run
0: [rank0]: self._optimizer_step(batch_idx, closure)
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 268, in _optimizer_step
0: [rank0]: call._call_lightning_module_hook(
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 167, in _call_lightning_module_hook
0: [rank0]: output = fn(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/core/module.py", line 1306, in optimizer_step
0: [rank0]: optimizer.step(closure=optimizer_closure)
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/core/optimizer.py", line 153, in step
0: [rank0]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 721, in optimizer_step
0: [rank0]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/ddp.py", line 270, in optimizer_step
0: [rank0]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/strategy.py", line 238, in optimizer_step
0: [rank0]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/plugins/precision/precision.py", line 122, in optimizer_step
0: [rank0]: return optimizer.step(closure=closure, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/optim/lr_scheduler.py", line 140, in wrapper
0: [rank0]: return func.__get__(opt, opt.__class__)(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/core/optim/mcore_optim.py", line 129, in step
0: [rank0]: loss = closure()
0: [rank0]: ^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/plugins/precision/precision.py", line 108, in _wrap_closure
0: [rank0]: closure_result = closure()
0: [rank0]: ^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in __call__
0: [rank0]: self._result = self.closure(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
0: [rank0]: return func(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure
0: [rank0]: step_output = self._step_fn()
0: [rank0]: ^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 317, in _training_step
0: [rank0]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 319, in _call_strategy_hook
0: [rank0]: output = fn(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 655, in training_step
0: [rank0]: out = self.model.training_step(dataloader_iter, *args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 384, in training_step
0: [rank0]: return self._step(
0: [rank0]: ^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 496, in _step
0: [rank0]: return self.forward(
0: [rank0]: ^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 346, in forward
0: [rank0]: microbatch_outputs = step()
0: [rank0]: ^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 1251, in __call__
0: [rank0]: return self.forward_backward_func(
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/megatron/core/pipeline_parallel/schedules.py", line 1741, in forward_backward_pipelining_without_interleaving
0: [rank0]: output_tensor, num_tokens = forward_step(
0: [rank0]: ^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/megatron/core/pipeline_parallel/schedules.py", line 275, in forward_step
0: [rank0]: output_tensor, loss_func = forward_step_func(data_iterator, model)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 559, in wrapped_forward_step_func
0: [rank0]: output_tensor = _forward_step(model, batch)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/megatron_parallel.py", line 861, in wrapped
0: [rank0]: return attr(*args)
0: [rank0]: ^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/base.py", line 606, in forward_step
0: [rank0]: return self.config.forward_step_fn(self, batch)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/hyena.py", line 155, in hyena_forward_step
0: [rank0]: return model(**forward_args)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
0: [rank0]: return self._call_impl(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl
0: [rank0]: return forward_call(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/hyena.py", line 120, in forward
0: [rank0]: output_tensor = self.module(
0: [rank0]: ^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
0: [rank0]: return self._call_impl(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _call_impl
0: [rank0]: return forward_call(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/megatron/core/distributed/data_parallel_base.py", line 22, in forward
0: [rank0]: return self.module(*inputs, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
0: [rank0]: return self._call_impl(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1848, in _call_impl
0: [rank0]: return inner()
0: [rank0]: ^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1794, in inner
0: [rank0]: result = forward_call(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/megatron/core/transformer/module.py", line 178, in forward
0: [rank0]: outputs = self.module(*inputs, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
0: [rank0]: return self._call_impl(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1848, in _call_impl
0: [rank0]: return inner()
0: [rank0]: ^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1794, in inner
0: [rank0]: result = forward_call(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/megatron/hyena/hyena_model.py", line 263, in forward
0: [rank0]: hidden_states = self.decoder(
0: [rank0]: ^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
0: [rank0]: return self._call_impl(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1848, in _call_impl
0: [rank0]: return inner()
0: [rank0]: ^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1794, in inner
0: [rank0]: result = forward_call(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/megatron/hyena/hyena_block.py", line 294, in forward
0: [rank0]: hidden_states = self._checkpointed_forward(
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/megatron/hyena/hyena_block.py", line 220, in _checkpointed_forward
0: [rank0]: hidden_states = checkpoint_handler(custom(layer_idx, layer_idx + self.config.recompute_num_layers))
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/megatron/hyena/hyena_block.py", line 192, in checkpoint_handler
0: [rank0]: return te_checkpoint(
0: [rank0]: ^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/megatron/core/extensions/transformer_engine.py", line 1235, in te_checkpoint
0: [rank0]: return checkpoint(
0: [rank0]: ^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/_compile.py", line 32, in inner
0: [rank0]: return disable_fn(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 738, in _fn
0: [rank0]: return fn(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/distributed.py", line 668, in checkpoint
0: [rank0]: return _CheckpointFunction.apply(
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 575, in apply
0: [rank0]: return super().apply(*args, **kwargs) # type: ignore[misc]
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/distributed.py", line 310, in forward
0: [rank0]: outputs = run_function(*args, **kwargs)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/megatron/hyena/hyena_block.py", line 176, in custom_forward
0: [rank0]: layer = self._get_layer(index)
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/megatron/hyena/hyena_block.py", line 163, in _get_layer
0: [rank0]: return self.layers[layer_number]
0: [rank0]: ~~~~~~~~~~~^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/container.py", line 334, in __getitem__
0: [rank0]: return self._modules[self._get_abs_string_index(idx)]
0: [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0: [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/container.py", line 316, in _get_abs_string_index
0: [rank0]: raise IndexError(f"index {idx} is ou
Docker Image
No response
System Information
Environment Details:
- OS: [e.g., Ubuntu 20.04]
- CPU: [e.g., Intel i9-12900K]
- RAM: [e.g., 64GB]
GPU Details:
- GPU Model: [e.g., NVIDIA RTX 4090]
- GPU Memory: [e.g., 24GB]
- CUDA Version: [e.g., 12.1]
- CUDA Driver: [e.g., 525.85.05]
- cuDNN Version: [e.g., 8.9.0]