You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I'm trying to make a deepspeed version of a code that worked without deepspeed and see if the results can be replicated in deepspeed version. However, it seems our code is not working properly and hence wanted to ask for guidance.
TLDR:
Cannot access gradient from model_engine (only from the model itself which model_engine wraps)
Gradient does not seem to be aggregated across gpus after .backward (i.e. when we run with 4 gpus with batch size of 32 each, the batch size of the input is 32 while the batch size of the gradient is also 32)
The validation (eval mode) loss for the same data is the same for all ranks except rank 0.
We would greatly appreciate if anyone could give us even a slight hint into debugging this.
Below is the a detailed explanation.
Background
Deepspeed version and no deepspeed version code exists separately.
deepspeed version code: pretrain_main_deepspeed.py, pretrain_trainer_deepspeed.py
no deepspeed version code : pretrain_main.py, pretrain_trainer.py
Experimental conditions are
IDENTICAL : global batch size, lr, wd, similar optimizer(AdamW , FusedAdam)
DIFFERENT : number of gpus
=> We expect them to give same training curves.
I found out that 1node 4GPU version(deepspeed) seem to work fine based on validation loss, following the loss curve of 1node 4gpu(torch’s Data Parallel). (Orange : deepspeed, Blue: no deepspeed)
However, in 16 node condition, loss doesn't converge to expected level. I've tried variants of below.
(b) batch_size(same batch_size per GPU : 128 / batch_size per GPU : 32) X lr (optimal lr in 1node condition * N / *sqrt(N))
Below is result of (b).
Implementation details
I've attached the script I've used.
Note) train_micro_batch_size_per_gpu is not used in the actual code, argument batch_size is the actual batch size fed into dataloader.
data_loader = DataLoader(
train_dataset, batch_size=params.batch_size, # This is per-GPU batch size
sampler=train_sampler,
num_workers=16, shuffle=(train_sampler is None), # Shuffle is True if not using DDP sampler
persistent_workers=True,
prefetch_factor=3, pin_memory=True, drop_last=True, collate_fn=collate_fn_for_data_info
)
So I'm guessing there's fundamental problem with my deepspeed code.
I've printed out below to see if gradient is accumulated across ranks.
There were three problems
Cannot inspect gradient from model_engine
Gradient shape not as expected
Validation loss with fake data gave different loss only for rank 0.
Questions
How to inspect gradient across ranks?
I've tried two versions as below. model printed out gradients while model_engine didn't. (2node 4GPU)
def debug_gradients(self, batch_idx, log_every=10):
"""Quick gradient debugging"""
if batch_idx % log_every != 0:
return
rank = deepspeed.comm.get_rank() if self.params.deepspeed else 8888
print(f"\n[Rank {rank}] Gradient Check - Batch {batch_idx}")
print("✅MODEL VERSION")
for name, param in self.model.named_parameters():
grad = deepspeed.utils.safe_get_full_grad(param)
if grad is not None:
print("Gradient Shape: ", name, grad.shape)
else:
print("Gradient is None for parameter:", name)
print("✅MODEL ENGINE VERSION")
for name, param in self.model_engine.named_parameters():
if param.grad is not None:
grad_norm = param.grad.norm().item()
print(f" {name}: shape={param.grad.shape}, norm={grad_norm:.4f}")
else:
print(f" {name}: No gradient")`
x.shape: torch.Size([32, 19, 30, 500])
✅MODEL VERSION
Gradient is None for parameter: mask_encoding
Gradient Shape: embedding.0.proj_in.0.weight torch.Size([32, 1, 1, 63])
Gradient Shape: embedding.0.proj_in.0.bias torch.Size([32])
Gradient Shape: embedding.0.proj_in.1.weight torch.Size([32])
Gradient Shape: embedding.0.proj_in.1.bias torch.Size([32])
Gradient Shape: embedding.0.proj_in.3.weight torch.Size([32, 32, 1, 3])
✅MODEL ENGINE VERSION
module.mask_encoding: No gradient
module.embedding.0.proj_in.0.weight: No gradient
module.embedding.0.proj_in.0.bias: No gradient
module.embedding.0.proj_in.1.weight: No gradient
module.embedding.0.proj_in.1.bias: No gradient
Also, input shape, gradient shape weren't as I expected.
(1node 4GPU)
Since batch_size is 32, I think gradient shape should be (128, -,-,-) since gradient should be shared across model_engine. How should I target this?
I've checked validation loss with fake data (same shape with input x, torch.ones and torch.zeros) and found validation loss differ only in rank 0. (2node 4GPU)
`def validate(self, epoch, normalize_factor=100.0):
self.model_engine.eval()
valid_losses_rank = [] # Losses on this rank
# tqdm only on rank 0
iterable_valid_loader = self.valid_data_loader
if self.is_rank_0:
iterable_valid_loader = tqdm(self.valid_data_loader, desc=f"Validation Epoch {epoch}", mininterval=10)
with torch.no_grad():
for batch_idx, (x, data_info_list) in enumerate(iterable_valid_loader):
x = x.to(self.device, dtype=torch.float32) / 100.0
print("validation x.shape: ", x.shape)
##!
fake_data = torch.ones_like(x, device=self.device, dtype=torch.float32)/2 # Create (same) fake data across ranks
fake_loss = self.SSL.compute_loss(fake_data, data_info_list=data_info_list)
print(f"Fake loss for validation for rank {deepspeed.comm.get_rank()}: {fake_loss.item()}") # Print fake loss for debugging
##!`
Fake loss for validation for rank 0: 0.8215165138244629
Fake loss for validation for rank 4: 0.8218275308609009
Fake loss for validation for rank 5: 0.8218275308609009
Fake loss for validation for rank 7: 0.8218275308609009
Fake loss for validation for rank 6: 0.8218275308609009
Fake loss for validation for rank 2: 0.8218275308609009
Fake loss for validation for rank 1: 0.8218275308609009
Fake loss for validation for rank 3: 0.8218275308609009
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, I'm trying to make a deepspeed version of a code that worked without deepspeed and see if the results can be replicated in deepspeed version. However, it seems our code is not working properly and hence wanted to ask for guidance.
TLDR:
.backward
(i.e. when we run with 4 gpus with batch size of 32 each, the batch size of the input is 32 while the batch size of the gradient is also 32)We would greatly appreciate if anyone could give us even a slight hint into debugging this.
Below is the a detailed explanation.
Background
Deepspeed version and no deepspeed version code exists separately.
pretrain_main_deepspeed.py
,pretrain_trainer_deepspeed.py
pretrain_main.py
,pretrain_trainer.py
Experimental conditions are
=> We expect them to give same training curves.
I found out that 1node 4GPU version(deepspeed) seem to work fine based on validation loss, following the loss curve of 1node 4gpu(torch’s Data Parallel). (Orange : deepspeed, Blue: no deepspeed)

However, in 16 node condition, loss doesn't converge to expected level. I've tried variants of below.
Below is result of (b).
Implementation details
I've attached the script I've used.
Note)
train_micro_batch_size_per_gpu
is not used in the actual code, argumentbatch_size
is the actual batch size fed into dataloader.run_deepspeed_CBraMod_pretraining_lucy.sh.zip
pretrain_trainer_deepspeed.py.zip
Problem & Questions
So I'm guessing there's fundamental problem with my deepspeed code.
I've printed out below to see if gradient is accumulated across ranks.
There were three problems
model_engine
Questions
How to inspect gradient across ranks?
I've tried two versions as below.
model
printed out gradients whilemodel_engine
didn't. (2node 4GPU)Also, input shape, gradient shape weren't as I expected.
(1node 4GPU)
Since batch_size is 32, I think gradient shape should be (128, -,-,-) since gradient should be shared across model_engine. How should I target this?
I've checked validation loss with fake data (same shape with input x,
torch.ones
andtorch.zeros
) and found validation loss differ only in rank 0. (2node 4GPU)Beta Was this translation helpful? Give feedback.
All reactions