Releases: NVIDIA/Megatron-LM
Releases · NVIDIA/Megatron-LM
NVIDIA Megatron Core 0.12.1
Merge branch 'gaod/llama4/te_fix' into 'core_r0.12.0' Fix the TE assertion for release See merge request ADLR/megatron-lm!3340
NVIDIA Megatron Core 0.12.0
- Add FP8 recipe selection to arguments (--fp8-recipe, --first-last-layers-bf16, --num-layers-at-start-in-bf16, --num-layers-at-end-in-bf16)
- Context parallel: fix loss scaling when calculate_per_token_loss=True
- Make the number of data parallel communication buckets configurable (--ddp-num-buckets, --ddp-pad-buckets-for-high-nccl-busbw)
- Inference
- Support in-flight batching and chunked KV cache
- Reduce memory usage,
- by not materializing full attention mask
- by only materializing logits for the last token during decode
- by removing an obsolete tensor reference
- Hybrid Model
- Inference
- Add CUDA graph support
- Change tools/run_mamba_text_generation_server.py to use megatron.core.inference
- Fix a shape issue when materializing logits for Mamba model
- Improve initialization of Mamba layers
- Add configuration switches (--mamba-state-dim, --mamba-head-dim, --mamba-num-groups, --is-hybrid-model)
- Make num_floating_point_operations work with hybrid model
- Make hybrid_conversion.py work with mixer that uses TE linear
- Add FP8 support
- Fix Mamba dt_bias tensor parallelism
- Support multimodal tokenizer
- Improve data parallelism scaling
- Inference
- MoE
- Features:
- DeepEP support, compatible with all the parallelisms and token drop / dropless
- Important precision improvement: Enable FP32/FP64 routing and unpermutation using –moe-router-dtype. FP32 is recommended for all fine-grained MoE training
- CUDA Graph support for MoE
- Multi-Token Prediction (MTP) Support
- Fused indices_to_multihot kernel for DeepEP dispatcher
- Bug fixes:
- Fix Hang Issue with MoE+Dense Hybrid models
- Update theoretical memory and tflops estimation for MoE and MLA
- Fix MoE Aux loss scaling for per token loss
- Fixes for group limited routing and expert bias. We verified these fixes through dsv3 e2e verifications
- Known issues:
- The ckpt trained with Custom FSDP for MoE may not be compatible with 3D parallel training.
- Features:
NVIDIA Megatron Core 0.12.0rc3
Prerelease: NVIDIA Megatron Core 0.12.0rc3 (2025-04-15)
NVIDIA Megatron Core 0.12.0rc2
Prerelease: NVIDIA Megatron Core 0.12.0rc2 (2025-04-09)
NVIDIA Megatron Core 0.11.0
- Add multi datacenter training support though N/S connection
- MoE
- Features
- Support DeepSeek-V3 fine-tuning
- Aux-loss-free load balancing strategy
- Node-limited routing and Device-limited routing support.
- Tensor Parallelism support for MLA and Sequence Auxiliary Loss
- MTP (with TP and PP support) is coming soon.
- Permutation / Unpermutation fusion kernel from TransformerEngine.
- Uneven virtual pipeline parallel split support in first and last PP stage.
- Support DeepSeek-V3 fine-tuning
- Bug fixes:
- Fix the grad scale when TP != expert-TP and average_in_collective is enabled in DDP.
- Fix TEGroupedMLP distckpt compatibility issue with FP8 padding/unpadding.
- Known Issues:
- When training the Dense+MoE hybrid model, the process will hang if any PP rank does not have expert params.
- Features
NVIDIA Megatron Core 0.11.0rc0
Prerelease: NVIDIA Megatron Core 0.11.0rc0 (2025-02-20)
NVIDIA Megatron Core 0.10.0
- Adding MLA to MCore
- Enable FP8 for GroupedMLP
- MoE Parallel Folding
- Enhance MoE Architecture: Support MoE Layer Frequency Patterns and Configurable MoE FFN Hidden Size
- Multimodal: NVLM training and evaluation support in MCore
- Mamba Hybrid
- Increase performance and reduce memory footprint of Triton language/compiler distributed caching
- Add more unit testing and fix bugs
NVIDIA Megatron Core 0.9.0
- Uneven pipeline parallelism
- Enable pipeline parallelism where first and last ranks have fewer transformer layers than the intermediate ranks
- Per layer CUDAGraph support for GPT training with Transformer Engine modules
- Enable different TP sizes for the vision encoder
- Enable pipeline parallelism for T5 & Llava models
- Support multi-tile multi-image input in Llava models
- MoE
- FP8 support
- Runtime upcycling support
- Dispatcher implementation optimizations
- Shared expert support with overlapping optimizations
- Qwen Model support
- Mamba Hybrid
- Main branch is no longer compatible with released checkpoints (use ssm branch)
- Add distributed checkpointing
- Fix bugs related to inference
- Add unit tests
- Known Issues
- When using sequence parallel, during the transformer block forward pass, dropout is not using the appropriate rng context.
NVIDIA Megatron Core 0.8.0
- Multimodal
- Added initial support for training vision language models using the LLaVA architecture
- Added initial support for inference with multimodal inputs
- End-to-end multimodal example from data collection to training to evaluation is provided in examples/multimodal
- MoE
- Context Parallel support.
- Distributed checkpoint support for grouped GEMM.
- Mamba
- Added initial support for training and inference of Mamba-2 models
- Support for hybrid models consisting of Mamba-2, attention, and MLP layers
- Examples provided in examples/mamba
NVIDIA Megatron Core 0.7.0
- MoE
- Token drop support
- Several efficiency optimizations
- Improved model parallelism
- Memory optimizations
- Distributed checkpointing
- Enabled for Retro
- Asynchronous checkpoint saving
- Several minor bug fixes, speed improvements, and memory optimizations