Skip to content

Releases: NVIDIA/Megatron-LM

NVIDIA Megatron Core 0.12.1

23 May 09:54
Compare
Choose a tag to compare
Merge branch 'gaod/llama4/te_fix' into 'core_r0.12.0'

Fix the TE assertion for release

See merge request ADLR/megatron-lm!3340

NVIDIA Megatron Core 0.12.0

06 May 21:10
core_v0.12.0
d580efc
Compare
Choose a tag to compare
  • Add FP8 recipe selection to arguments (--fp8-recipe, --first-last-layers-bf16, --num-layers-at-start-in-bf16, --num-layers-at-end-in-bf16)
  • Context parallel: fix loss scaling when calculate_per_token_loss=True
  • Make the number of data parallel communication buckets configurable (--ddp-num-buckets, --ddp-pad-buckets-for-high-nccl-busbw)
  • Inference
    • Support in-flight batching and chunked KV cache
    • Reduce memory usage,
      • by not materializing full attention mask
      • by only materializing logits for the last token during decode
      • by removing an obsolete tensor reference
  • Hybrid Model
    • Inference
      • Add CUDA graph support
      • Change tools/run_mamba_text_generation_server.py to use megatron.core.inference
      • Fix a shape issue when materializing logits for Mamba model
    • Improve initialization of Mamba layers
    • Add configuration switches (--mamba-state-dim, --mamba-head-dim, --mamba-num-groups, --is-hybrid-model)
    • Make num_floating_point_operations work with hybrid model
    • Make hybrid_conversion.py work with mixer that uses TE linear
    • Add FP8 support
    • Fix Mamba dt_bias tensor parallelism
    • Support multimodal tokenizer
    • Improve data parallelism scaling
  • MoE
    • Features:
      • DeepEP support, compatible with all the parallelisms and token drop / dropless
      • Important precision improvement: Enable FP32/FP64 routing and unpermutation using –moe-router-dtype. FP32 is recommended for all fine-grained MoE training
      • CUDA Graph support for MoE
      • Multi-Token Prediction (MTP) Support
      • Fused indices_to_multihot kernel for DeepEP dispatcher
    • Bug fixes:
      • Fix Hang Issue with MoE+Dense Hybrid models
      • Update theoretical memory and tflops estimation for MoE and MLA
      • Fix MoE Aux loss scaling for per token loss
      • Fixes for group limited routing and expert bias. We verified these fixes through dsv3 e2e verifications
    • Known issues:
      • The ckpt trained with Custom FSDP for MoE may not be compatible with 3D parallel training.

NVIDIA Megatron Core 0.12.0rc3

15 Apr 19:50
Compare
Choose a tag to compare
Pre-release

Prerelease: NVIDIA Megatron Core 0.12.0rc3 (2025-04-15)

NVIDIA Megatron Core 0.12.0rc2

09 Apr 10:27
Compare
Choose a tag to compare
Pre-release

Prerelease: NVIDIA Megatron Core 0.12.0rc2 (2025-04-09)

NVIDIA Megatron Core 0.11.0

14 Mar 22:59
aa6207e
Compare
Choose a tag to compare
  • Add multi datacenter training support though N/S connection
  • MoE
    • Features
      • Support DeepSeek-V3 fine-tuning
        • Aux-loss-free load balancing strategy
        • Node-limited routing and Device-limited routing support.
        • Tensor Parallelism support for MLA and Sequence Auxiliary Loss
        • MTP (with TP and PP support) is coming soon.
      • Permutation / Unpermutation fusion kernel from TransformerEngine.
      • Uneven virtual pipeline parallel split support in first and last PP stage.
    • Bug fixes:
      • Fix the grad scale when TP != expert-TP and average_in_collective is enabled in DDP.
      • Fix TEGroupedMLP distckpt compatibility issue with FP8 padding/unpadding.
    • Known Issues:
      • When training the Dense+MoE hybrid model, the process will hang if any PP rank does not have expert params.

NVIDIA Megatron Core 0.11.0rc0

20 Feb 10:43
7c00175
Compare
Choose a tag to compare
Pre-release

Prerelease: NVIDIA Megatron Core 0.11.0rc0 (2025-02-20)

NVIDIA Megatron Core 0.10.0

17 Feb 17:31
7ee599a
Compare
Choose a tag to compare
  • Adding MLA to MCore
  • Enable FP8 for GroupedMLP
  • MoE Parallel Folding
  • Enhance MoE Architecture: Support MoE Layer Frequency Patterns and Configurable MoE FFN Hidden Size
  • Multimodal: NVLM training and evaluation support in MCore
  • Mamba Hybrid
    • Increase performance and reduce memory footprint of Triton language/compiler distributed caching
    • Add more unit testing and fix bugs

NVIDIA Megatron Core 0.9.0

24 Oct 10:30
Compare
Choose a tag to compare
  • Uneven pipeline parallelism
    • Enable pipeline parallelism where first and last ranks have fewer transformer layers than the intermediate ranks
  • Per layer CUDAGraph support for GPT training with Transformer Engine modules
  • Enable different TP sizes for the vision encoder
  • Enable pipeline parallelism for T5 & Llava models
  • Support multi-tile multi-image input in Llava models
  • MoE
    • FP8 support
    • Runtime upcycling support
    • Dispatcher implementation optimizations
    • Shared expert support with overlapping optimizations
      • Qwen Model support
  • Mamba Hybrid
    • Main branch is no longer compatible with released checkpoints (use ssm branch)
    • Add distributed checkpointing
    • Fix bugs related to inference
    • Add unit tests
  • Known Issues
    • When using sequence parallel, during the transformer block forward pass, dropout is not using the appropriate rng context.

NVIDIA Megatron Core 0.8.0

13 Aug 12:12
Compare
Choose a tag to compare
  • Multimodal
    • Added initial support for training vision language models using the LLaVA architecture
    • Added initial support for inference with multimodal inputs
    • End-to-end multimodal example from data collection to training to evaluation is provided in examples/multimodal
  • MoE
    • Context Parallel support.
    • Distributed checkpoint support for grouped GEMM.
  • Mamba
    • Added initial support for training and inference of Mamba-2 models
    • Support for hybrid models consisting of Mamba-2, attention, and MLP layers
    • Examples provided in examples/mamba

NVIDIA Megatron Core 0.7.0

05 Jun 23:12
Compare
Choose a tag to compare
  • MoE
    • Token drop support
    • Several efficiency optimizations
    • Improved model parallelism
    • Memory optimizations
  • Distributed checkpointing
    • Enabled for Retro
    • Asynchronous checkpoint saving
  • Several minor bug fixes, speed improvements, and memory optimizations