[BUG] aux_loss and z_loss is incorrect when use calculate_per_token_loss and cp

**Describe the bug**

1. If using calculate_per_token_loss and cp > 1, 
   firstly, aux_loss is divided by the square of full num_tokens (considered cp）
  https://github.com/NVIDIA/Megatron-LM/blob/a845aa7e12b3a117e24c2352b9e3e60bad2e3a17/megatron/core/transformer/moe/moe_utils.py#L60)
  
secondly， aux_loss is scaled  by num_local_tokens here.
  https://github.com/NVIDIA/Megatron-LM/blob/a845aa7e12b3a117e24c2352b9e3e60bad2e3a17/megatron/core/transformer/moe/router.py#L312
 
finally, scale both the main_loss gradient and aux_loss gradient by  1/(num_local_tokens * dp_size *  num_micro_batches) in finalize_model_grads function. 
 however, the num_local_tokens is not local but full.
  https://github.com/NVIDIA/Megatron-LM/blob/a845aa7e12b3a117e24c2352b9e3e60bad2e3a17/pretrain_gpt.py#L179
 
so we should scale aux_loss by full num_tokens (considered cp and sp）not num_local_tokens



2. If not use calculate_per_token_loss but use cp, gradient is divided  by dp*cp in finalize_model_grads function. lm_loss is scaled by cp in advance, but aux_loss is not scaled by cp, so should we multiply aux_loss by cp? 
    





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] aux_loss and z_loss is incorrect when use calculate_per_token_loss and cp #1652

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] aux_loss and z_loss is incorrect when use calculate_per_token_loss and cp #1652

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions