Open
Description
I have been trying to run some of the exp training code onnvcr.io/nvidia/pytorch:23.09-py3 . However, I seem to keep getting errors regardless of the scripts. After some testing, it seems that even running GPT training on a single GPU causes error.
What might be the cause of this INTERNAL ASSERT FAILED (shown below)? My bash script and logs are shown below.
Bash script executed:
### Pre-training for GPT2 125M parameter.
##
# Distributed hyperparameters.
DISTRIBUTED_ARGUMENTS="\
--nproc_per_node 1 \
--nnodes 1 \
--node_rank 0 \
--master_addr localhost \
--master_port 6000"
# Model hyperparameters.
MODEL_ARGUMENTS="\
--num-layers 12 \
--hidden-size 768 \
--num-attention-heads 12 \
--seq-length 1024 \
--max-position-embeddings 1024"
# Training hyperparameters.
TRAINING_ARGUMENTS="\
--micro-batch-size 32 \
--global-batch-size 512 \
--train-iters ${TRAINING_STEPS} \
--lr-decay-iters ${TRAINING_STEPS} \
--lr 0.00015 \
--min-lr 0.00001 \
--lr-decay-style cosine \
--lr-warmup-fraction 0.01 \
--clip-grad 1.0 \
--init-method-std 0.01"
DATA_PATH=my-gpt2_text_document
# NOTE: We don't train for enough tokens for the
# split to matter.
DATA_ARGUMENTS="\
--data-path ${DATA_PATH} \
--vocab-file ./ckpt_gpt/gpt2-vocab.json \
--merge-file ./ckpt_gpt/gpt2-merges.txt \
--make-vocab-size-divisible-by 1024 \
--split 969,30,1"
COMPUTE_ARGUMENTS="\
--fp16 \
--DDP-impl local"
CHECKPOINT_ARGUMENTS="\
--save-interval 2000 \
--save ./${EXP_DIR}"
EVALUATION_ARGUMENTS="\
--eval-iters 100 \
--log-interval 100 \
--eval-interval 1000"
torchrun ${DISTRIBUTED_ARGUMENTS} \
pretrain_gpt.py \
${MODEL_ARGUMENTS} \
${TRAINING_ARGUMENTS} \
${DATA_ARGUMENTS} \
${COMPUTE_ARGUMENTS} \
${CHECKPOINT_ARGUMENTS} \
${EVALUATION_ARGUMENTS} |& tee ./${EXP_DIR}/train.log
Log file (error section)
> elasped time to build and save sample-idx mapping (seconds): 0.000692
> building shuffle index with split [0, 51039) and [51039, 52173) ...
> elasped time to build and save shuffle-idx mapping (seconds): 0.001060
> loading doc-idx mapping from my-gpt2_text_document_test_indexmap_51200ns_1024sl_1234s_doc_idx.npy
> loading sample-idx mapping from my-gpt2_text_document_test_indexmap_51200ns_1024sl_1234s_sample_idx.npy
> loading shuffle-idx mapping from my-gpt2_text_document_test_indexmap_51200ns_1024sl_1234s_shuffle_idx.npy
loaded indexed file in 0.001 seconds
total number of samples: 52174
total number of epochs: 46
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2024-11-14 00:12:12
done with setup ...
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (65.30, 65.30)
train/valid/test-data-iterators-setup ..........: (5738.83, 5738.83)
training ...
[before the start of training step] datetime: 2024-11-14 00:12:12
Traceback (most recent call last):
File "/mount/Megatron-LM-stanford/pretrain_gpt.py", line 154, in <module>
pretrain(train_valid_test_datasets_provider, model_provider,
File "/mount/Megatron-LM-stanford/megatron/training.py", line 147, in pretrain
iteration = train(forward_step_func,
File "/mount/Megatron-LM-stanford/megatron/training.py", line 712, in train
train_step(forward_step_func,
File "/mount/Megatron-LM-stanford/megatron/training.py", line 421, in train_step
losses_reduced = forward_backward_func(
File "/mount/Megatron-LM-stanford/megatron/schedules.py", line 263, in forward_backward_no_pipelining
output_tensor = forward_step(forward_step_func, data_iterator,
File "/mount/Megatron-LM-stanford/megatron/schedules.py", line 133, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
File "/mount/Megatron-LM-stanford/pretrain_gpt.py", line 124, in forward_step
output_tensor = model(tokens, position_ids, attention_mask,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/Megatron-LM-stanford/megatron/model/distributed.py", line 59, in forward
return self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/Megatron-LM-stanford/megatron/model/module.py", line 184, in forward
outputs = self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/Megatron-LM-stanford/megatron/model/gpt_model.py", line 80, in forward
lm_output = self.language_model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/Megatron-LM-stanford/megatron/model/language_model.py", line 432, in forward
encoder_output = self.encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/Megatron-LM-stanford/megatron/model/transformer.py", line 1227, in forward
hidden_states = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/Megatron-LM-stanford/megatron/model/transformer.py", line 739, in forward
self.self_attention(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/Megatron-LM-stanford/megatron/model/transformer.py", line 601, in forward
context_layer = self.core_attention(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mount/Megatron-LM-stanford/megatron/model/transformer.py", line 313, in forward
attention_probs = self.attention_dropout(attention_probs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/dropout.py", line 58, in forward
return F.dropout(input, self.p, self.training, self.inplace)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1268, in dropout
return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
RuntimeError: handle_0 INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/driver_api.cpp":15, please report a bug to PyTorch.
[2024-11-14 00:12:17,743] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2892) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
pretrain_gpt.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-11-14_00:12:17
host : 0c9f17b7c8c7
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2892)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Metadata
Metadata
Assignees
Labels
No labels