Ifu dev 20250318 v2.2 #212

VeeraRajasekhar · 2025-06-24T05:58:23Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Przemek Tredak <[email protected]>

…x FP8 related codes (#1468) * add prob permute; fix fp8tensor Signed-off-by: Hongxiao Bai <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert unnecessary changes in UT Signed-off-by: Hongxiao Bai <[email protected]> * remove unnecessary probs dtype convert Signed-off-by: Hongxiao Bai <[email protected]> * keep the output nums if probs is not provided Signed-off-by: Hongxiao Bai <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine the doc string Signed-off-by: Hongxiao Bai <[email protected]> * fix lint Signed-off-by: Hongxiao Bai <[email protected]> * use fp32 compute type Signed-off-by: Hongxiao Bai <[email protected]> * style fix Signed-off-by: Hongxiao Bai <[email protected]> * fix empty input return Signed-off-by: Hongxiao Bai <[email protected]> * separate prob related functions out Signed-off-by: Hongxiao Bai <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Hongxiao Bai <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xin Yao <[email protected]> Co-authored-by: Phuong Nguyen <[email protected]>

flax module with compute dtype inferred from the inputs Signed-off-by: Phuong Nguyen <[email protected]>

* Fix issues for MCore DDP. Signed-off-by: Dennis Liu <[email protected]> * Remove force data release for CPU offloading. Signed-off-by: Dennis Liu <[email protected]> * Add preserved attributeds. Signed-off-by: Dennis Liu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add main_grad to prevserved attributes. Signed-off-by: Dennis Liu <[email protected]> * Change prepare_for_saving to original tensor and add .data to CPU hook. Signed-off-by: Dennis Liu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update. Signed-off-by: Dennis Liu <[email protected]> * Fix for LayernormLinear in FP8. Signed-off-by: Dennis Liu <[email protected]> --------- Signed-off-by: Dennis Liu <[email protected]> Co-authored-by: Xin Yao <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Fix typo Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

* fix fuse_wgrad_accumulation for GroupedLinear Signed-off-by: Xin Yao <[email protected]> * fix fuse_wgrad_accumulation for GroupedLinear Signed-off-by: Xin Yao <[email protected]> * update tests Signed-off-by: Xin Yao <[email protected]> --------- Signed-off-by: Xin Yao <[email protected]> Co-authored-by: Tim Moon <[email protected]>

* Fix te sequential for older pytorch versions Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * FIxes Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* commit some debug code Signed-off-by: Xiaowei Ren <[email protected]> * add more debug info Signed-off-by: Xiaowei Ren <[email protected]> * debug code commit and typo fix Signed-off-by: Xiaowei Ren <[email protected]> * a typo fix Signed-off-by: Xiaowei Ren <[email protected]> * remove debug info Signed-off-by: Xiaowei Ren <[email protected]> * do not return lse Signed-off-by: Xiaowei Ren <[email protected]> * add amax_per_step for quantizers of CP Signed-off-by: Xiaowei Ren <[email protected]> * fix FP8 + CP Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bug fix Signed-off-by: Xiaowei Ren <[email protected]> * bug fix Signed-off-by: Xiaowei Ren <[email protected]> * dtype fix Signed-off-by: Xiaowei Ren <[email protected]> * bug fix Signed-off-by: Xiaowei Ren <[email protected]> --------- Signed-off-by: Xiaowei Ren <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xiaowei Ren <[email protected]>

… (#1466) Use same API in optimizer zero_grad as PyT optimizers Signed-off-by: Tim Moon <[email protected]>

* Remove dependency on transformer_engine::Tensor in attention.cu Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Templatize thd_partition_indices_kernel and thd_read_half_tensor_kernel kernels ONLY for invoking recompilation and not directly using the pre-compiled symbols in libtransformer.so Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Modify attention.cu for thd templatized kernels. Remove dependency on common.h Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Move thd structs from libtransformer.so to framework extensions include header Code cleanup Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Consolidate and move thd_utils from common to framework extensions Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Remove template decorators around thd_partition_indices_kernel and thd_read_half_tensor_kernel Signed-off-by: Kshitij Janardan Lakhani <[email protected]> Code clean up Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kshitij Janardan Lakhani <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix Signed-off-by: Pawel Gadzinski <[email protected]> * reshape inp Signed-off-by: Pawel Gadzinski <[email protected]> --------- Signed-off-by: Pawel Gadzinski <[email protected]>

* non-exit tests Signed-off-by: Pawel Gadzinski <[email protected]> * fix Signed-off-by: Pawel Gadzinski <[email protected]> * fix Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Pawel Gadzinski <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* minor fixes for attention Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Charlene Yang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix a crash with module._apply(lambda t: t.cpu()) Signed-off-by: Guyue Huang <[email protected]> * Add comments Signed-off-by: Guyue Huang <[email protected]> * Make sure tensor is moved to dst device before quantizer quantizes Signed-off-by: Guyue Huang <[email protected]> --------- Signed-off-by: Guyue Huang <[email protected]> Co-authored-by: Tim Moon <[email protected]>

* add remove_caches api Signed-off-by: Youngeun Kwon <[email protected]> * Update transformer_engine/pytorch/tensor/float8_tensor.py Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Youngeun Kwon <[email protected]> * explicit delete Signed-off-by: Youngeun Kwon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Youngeun Kwon <[email protected]> Co-authored-by: Tim Moon <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Added parallel cross entropy loss implementation using online softmax Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added tests Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added reshape of loss output Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added to test list Signed-off-by: Selvaraj Anandaraj <[email protected]> * Added Triton dependency Signed-off-by: Selvaraj Anandaraj <[email protected]> * Added copyright Signed-off-by: Selvaraj Anandaraj <[email protected]> * Fixed lint errors Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update setup.py Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Signed-off-by: Selvaraj Anandaraj <[email protected]> * Fixed lint and triton failure Signed-off-by: Selvaraj Anandaraj <[email protected]> * Removed flattening for scalars Signed-off-by: Selvaraj Anandaraj <[email protected]> * Skip tests on Blackwell due to TE CI caveat Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added reason arg Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Do not register Triton dependency with setuptools Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Selvaraj Anandaraj <[email protected]> Signed-off-by: Selvaraj Anandaraj <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Selvaraj Anandaraj <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]>

* Added TMA alignment check to cast_fp8_1D Signed-off-by: Oleg Goncharov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use tensor const-ref instead of tensor const-ptr Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Oleg Goncharov <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]>

* Skip context parallelism tests if not enough GPUs Signed-off-by: Tim Moon <[email protected]> * Apply suggestions from code review Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Tim Moon <[email protected]>

Signed-off-by: Tim Moon <[email protected]>

…p (#1452) * Support vectorized local reduction for p2p-based ReduceScatter overlap Signed-off-by: Sangkug Lym <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cleanup Signed-off-by: Sangkug Lym <[email protected]> --------- Signed-off-by: Sangkug Lym <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* TP-RS local reduction: fix lint err Signed-off-by: Sangkug Lym <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Sangkug Lym <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix quantized tensor shape Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * add shape to make_like; add test for chunk Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix typo from suggestion Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

…e (#1516) * Enforce torch 2.0 and run attn tests with torch.compile Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * replace torch.compile with jit_fuser Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fixes Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* delete extra tensor objects after restoring float8 tensors Signed-off-by: Sudhakar Singh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit fix Signed-off-by: Sudhakar Singh <[email protected]> * fix the leak in float8tensor and mxfloat8tensor classes Signed-off-by: Sudhakar Singh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * uncomment the fix Signed-off-by: Sudhakar Singh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by: Sudhakar Singh <[email protected]> --------- Signed-off-by: Sudhakar Singh <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…rt (#1528) Set flag in norm modules for Mcore sequence-parallel support Signed-off-by: Tim Moon <[email protected]>

* Support THD + ring attention for self attn Signed-off-by: Reese Wang <[email protected]> * Consolidate reorder strategy Signed-off-by: Reese Wang <[email protected]> * Fix dataclass frozen issue Signed-off-by: Reese Wang <[email protected]> * Remove redundant code Signed-off-by: Reese Wang <[email protected]> * Use AttnBiasType, AttnMaskType, QKVLayout in cpp_extension Signed-off-by: Reese Wang <[email protected]> * Fix lint Signed-off-by: Reese Wang <[email protected]> * Refine P2P helper check_supported Signed-off-by: Reese Wang <[email protected]> * Add segment_ids/pos check Signed-off-by: Reese Wang <[email protected]> * Fixup Signed-off-by: Reese Wang <[email protected]> * Add dual chunk swap example Signed-off-by: Reese Wang <[email protected]> * Align different reorder code structure Signed-off-by: Reese Wang <[email protected]> --------- Signed-off-by: Reese Wang <[email protected]> Co-authored-by: Phuong Nguyen <[email protected]>

Signed-off-by: Vasudevan Rengasamy <[email protected]>

Added constexpr checks of tensor boundaries Signed-off-by: Oleg Goncharov <[email protected]>

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Expose only required symbols from libtransformer_engine.so during linking for pytorch Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Augment libtransformer_engine.version for jax compatibility Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Augment the libtransformer_engine.version to ensure compatibility with CPP tests Remove getenv from the .version file Combine system.cpp and system.h Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Nit: Remove commented code for not including common.h Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Replace explicit getenv instantiations with a helper template Use filesystem calls in file_exists() Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert comment to falsy instead of false Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Kshitij Lakhani <[email protected]> --------- Signed-off-by: Kshitij Janardan Lakhani <[email protected]> Signed-off-by: Kshitij Lakhani <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <[email protected]>

Don't set data to null Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

Fix incorrect docstrings in tensor saving functions Signed-off-by: Tim Moon <[email protected]>

* fix recompilation of out and lse correction in p2p+bshd/sbhd Signed-off-by: Xiaowei Ren <[email protected]> * fix recompilation of get_seq_chunk_ids_for_reordering Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix recomplilation of reorder_seq_chunks_for_a2a Signed-off-by: Xiaowei Ren <[email protected]> * recover a change Signed-off-by: Xiaowei Ren <[email protected]> * typo fix Signed-off-by: Xiaowei Ren <[email protected]> * minor change to softmax_lse correction Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cache cu_seqlens for BSHD/SBHD format Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * do not need to allocate out buffer for BSHD/SBHD Signed-off-by: Xiaowei Ren <[email protected]> * code refactoring Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix Signed-off-by: Xiaowei Ren <[email protected]> * refactor init out correction Signed-off-by: Xiaowei Ren <[email protected]> * fix a docstring Signed-off-by: Xiaowei Ren <[email protected]> * typo fix Signed-off-by: Xiaowei Ren <[email protected]> * code refactoring Signed-off-by: Xiaowei Ren <[email protected]> * fix init out correct dtype Signed-off-by: Xiaowei Ren <[email protected]> * add pad_between_seqs to DPA API Signed-off-by: Xiaowei Ren <[email protected]> * add pad_between_seqs to the API of MHA and transformer layer Signed-off-by: Xiaowei Ren <[email protected]> * add pad_between_seqs to the API of MHA and transformer layer Signed-off-by: Xiaowei Ren <[email protected]> --------- Signed-off-by: Xiaowei Ren <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* check in per-tensor current scaling full recipe Signed-off-by: zhongboz <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: zhongboz <[email protected]> setup basics of current scaling quantizer in python level Signed-off-by: zhongboz <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: zhongboz <[email protected]> add test case for current scaling dequantize Signed-off-by: zhongboz <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: zhongboz <[email protected]> finish linear layer fwd bwd test, determined error with bf16 Signed-off-by: zhongboz <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: zhongboz <[email protected]> achieved zero tolerance for Linear by specify gemm use_split_accumulator config Signed-off-by: zhongboz <[email protected]> enable layernormlinear with current scaling, pass bitwise test Signed-off-by: zhongboz <[email protected]> refactor test case code Signed-off-by: zhongboz <[email protected]> make current scaling quantizers distrbuted, pass distributed linear&layernormlinear tests Signed-off-by: zhongboz <[email protected]> bug fix: use cached fp8 recipe in backward Signed-off-by: zhongboz <[email protected]> fix layernorm_mlp with current scaling, fix activation_helper with current scaling Signed-off-by: zhongboz <[email protected]> support detailed numerical settings from recipe to quantization kernel Signed-off-by: zhongboz <[email protected]> resolving MR comments Signed-off-by: zhongboz <[email protected]> recipe naming Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve mr comments, remove IS_CURRENT_SCALING template from kernels Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve mr comments, make current scaling c++ test cases Signed-off-by: zhongboz <[email protected]> * add current scaling to test_numerics.py, skip act recomp and grouped linear Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add benchmark for quantizer Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add benchmarks for linear layer Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bug fix, typo Signed-off-by: zhongboz <[email protected]> * resolve more mr comments Signed-off-by: zhongboz <[email protected]> * avoid potential race condition by not using from_blob to construct amax tensor in C++ Signed-off-by: zhongboz <[email protected]> * resolve more comments Signed-off-by: zhongboz <[email protected]> * Debug linter warnings and license check Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug import error in FP8 tensor test Signed-off-by: Tim Moon <[email protected]> * Debug compilation error with CUDA 12.1 for Turing Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve mr comments, fix activation cast fusion Signed-off-by: zhongboz <[email protected]> * resolve comments, add NVTEQuantizationParams for compute scale Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove is_current_scaling check totally from common folder Signed-off-by: zhongboz <[email protected]> * remove benchmarks, will contribute in another repo Signed-off-by: zhongboz <[email protected]> * adjust cs default recipe config Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adjust comments in test Signed-off-by: zhongboz <[email protected]> * Remove current scaling mode from core lib Signed-off-by: Tim Moon <[email protected]> * Refactor current-scaling-specific logic in core C++ lib Move amax and scale update functions out of casting functions, and put into dedicated current-scaling source file. Add general API for accessing quantization config object. Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add missing header in C++ tests Signed-off-by: Tim Moon <[email protected]> * Disable test config with FP8 transpose on Blackwell Signed-off-by: Tim Moon <[email protected]> * Fix compilation error in C++ test Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: zhongboz <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: zhongboz <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]>

* Verified TE2.0 with offloading Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skipping tests for Ampere and removed child class preparing Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * offloading support for MXFP8 dtype Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Changed quantized tensor detection mechanism Signed-off-by: Selvaraj Anandaraj <[email protected]> * Fix mxfp8 offload, lint errors, and var name Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Supported disabling offloading for quantized tensors Signed-off-by: Selvaraj Anandaraj <[email protected]> * bug fix Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed bugs Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added support for None in list of Quantized data tensors Signed-off-by: root <[email protected]> * Hopper backward compatibility cleanup Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Coding style nit Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added guards Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Selvaraj Anandaraj <[email protected]> Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Selvaraj Anandaraj <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

Internal quantizer for input to the modules Signed-off-by: Przemek Tredak <[email protected]>

Remove Megatron-LM convergence test Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

Signed-off-by: Tim Moon <[email protected]>

Revert "Use internal quantizer for input to the modules (#1551)" This reverts commit b3e7035. Signed-off-by: Przemek Tredak <[email protected]>

…… (#1540) Remove xla_ignore_channel_id check and ignore Scan loop warning in unit test Signed-off-by: Reese Wang <[email protected]>

* fix dtypes in fused attn bwd for FP8 Signed-off-by: Charlene Yang <[email protected]> * add comments for dtypes Signed-off-by: Charlene Yang <[email protected]> * remove redundant qkv_dtype in fwd Signed-off-by: Charlene Yang <[email protected]> * remove Nones in bwd returns Signed-off-by: Charlene Yang <[email protected]> --------- Signed-off-by: Charlene Yang <[email protected]>

* Explicitly use python3 and pip3 Signed-off-by: Tim Moon <[email protected]> * Run pre-commit as Python module Signed-off-by: Tim Moon <[email protected]> * Replace some missed references to "python" or "pip" Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Tim Moon <[email protected]>

Make ffi compatible with jax 0.4 Signed-off-by: Reese Wang <[email protected]> Co-authored-by: Phuong Nguyen <[email protected]>

* Delete row-wise data in single-GPU linear forward Signed-off-by: Tim Moon <[email protected]> * Debug Python->C++ parsing of transpose-only Float8Tensors Signed-off-by: Tim Moon <[email protected]> * Debug tensor shape calculation without row-wise data Signed-off-by: Tim Moon <[email protected]> * Debug correctness issues with only column-wise data Signed-off-by: Tim Moon <[email protected]> * Only cache column-wise input in LayerNormLinear Signed-off-by: Tim Moon <[email protected]> * Support MXFP8 all-gather with only column-wise data Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix moe cases, lint, rm unused ctx Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix CPU activation offloading and use consistent logic for save/restore Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix tests Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix typo Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * RM stray file Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix distributed and cpp tests Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix norm cpp tests Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Rm stray file Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * RM stray file Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix MXFP8 AG Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix FP8 with sequence parallelism Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix UB bulk dgrad Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

…e (#1558) * add tex.bgrad_quantize support for CS Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused import Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: zhongboz <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: zhongboz <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]>

update FE to 1.11 Signed-off-by: Charlene Yang <[email protected]>

fix cpu device import error Signed-off-by: Hongxiao Bai <[email protected]> Co-authored-by: Tim Moon <[email protected]>

* Add options to comm overlap tests Signed-off-by: Vasudevan Rengasamy <[email protected]> * Fix Typo Signed-off-by: Vasudevan Rengasamy <[email protected]> * Update tests/pytorch/distributed/run_layer_with_overlap.py Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Vasudevan Rengasamy <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]>

* Create pytorch/dot_product_attention module and pytorch/d_p_a/utils.py Move attention logging into a separate class in pytorch/d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Create FlashAttentionUtils class in pytorch/d_p_a/utils/py for versioning info Move versioning info out of pytorch/attention.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Move AttentionParams and get_attention_backend from attention.py to d_p_a/utils.py Fix tests and imports for the above refactor change Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move get_qkv_layout(), get_full_mask(), get_alibi(), get_attention_quantizers() to d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move tensor packing and unpacking helper functions from pyt/attention.py to d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move cumulative seqlens and indices methods from pyt/attention.py to d_p_a/utils.py Rename cumulative functions from using _cu_ to using _cumul_ to differentiate from CUDA cu calls protocol Rename tensor packaging methods with leading underscore to make them as internal to file Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unnecessary imports in pytorch/attention.py and d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Create d_p_a/inference.py and move InferenceParams from pyt/attention.py to it Modify tests and other files to import InferenceParams correctly Signed-off-by: Kshitij Janardan Lakhani <[email protected]> Modify docs api for InferenceParams Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Create d_p_a/rope.py and move RoPE methods from pytorch/attention.py to it Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Code cleanup Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix qa testing induced bug Code clean up Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix incorrect pack_tensor arg type Code clean up Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * nit: Resolve lint errors Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove typedef FAUtils for FlashAttentionUtils Use attn_log instead of att_log Signed-off-by: Kshitij Janardan Lakhani <[email protected]> Fix lint error Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit: Fix the function name from get_cumul to the earlier get_cu Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * nit: Fix typos, explicit imports and remove extra comments Signed-off-by: Kshitij Janardan Lakhani <[email protected]> --------- Signed-off-by: Kshitij Janardan Lakhani <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Charlene Yang <[email protected]>

…554) * support tp-comm-overlap in Current Scaling recipe Signed-off-by: Li Tao <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * clean Signed-off-by: Li Tao <[email protected]> * fix test recipe argument to generalize to MXFP8 Signed-off-by: Li Tao <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reduce duplicated transpose in certain cases Signed-off-by: Li Tao <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use per_tensor_scaling() to judge DS or CS Signed-off-by: Li Tao <[email protected]> * minor fixes Signed-off-by: Li Tao <[email protected]> * change comment description Signed-off-by: Li Tao <[email protected]> * add multi-layer unit test for tp overlap Signed-off-by: Li Tao <[email protected]> * support test case that run for several times Signed-off-by: Li Tao <[email protected]> * avoid save ub tensor in prepare_for_saving Signed-off-by: Li Tao <[email protected]> * fix Signed-off-by: Li Tao <[email protected]> * switch to a simple fix Signed-off-by: Li Tao <[email protected]> * formatting Signed-off-by: Li Tao <[email protected]> * simply test cases; avoid additional clone() Signed-off-by: Li Tao <[email protected]> * fall back to get_buffer in layernormmlp Signed-off-by: Li Tao <[email protected]> * use 2 layers for fp8 tpoverlap multi-layer test for better tolerance, limit max gpus for test Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Li Tao <[email protected]> Signed-off-by: zhongboz <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: zhongboz <[email protected]>

* Add issue template Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fixes Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Make GPU info section Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Do not create multiple cublas handle Signed-off-by: Przemek Tredak <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix for multiple GPUs per thread Signed-off-by: Przemek Tredak <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix multithreaded execution Signed-off-by: Przemek Tredak <[email protected]> * Fix from conlfict Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Przemek Tredak <[email protected]> Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

* DistOpt support with offloading Signed-off-by: Selvaraj Anandaraj <[email protected]> * Added distopt support for TE2.0 Signed-off-by: Selvaraj Anandaraj <[email protected]> * Restricted this to MCore DistOpt only Signed-off-by: Selvaraj Anandaraj <[email protected]> * Added guards Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/pytorch/module/linear.py Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Signed-off-by: Selvaraj Anandaraj <[email protected]> * Update transformer_engine/pytorch/module/layernorm_linear.py Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Signed-off-by: Selvaraj Anandaraj <[email protected]> --------- Signed-off-by: Selvaraj Anandaraj <[email protected]> Signed-off-by: Selvaraj Anandaraj <[email protected]> Co-authored-by: Selvaraj Anandaraj <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

* [QA] Add error handling -Standardize test failure handling using the unified 'test_fail' function and 'error_exit' function. Signed-off-by: Linxi Ding <[email protected]> * Update script to use explicit python3, pip3, and python3 -m pytest calls - Change pip to pip3. - Change python to python3. - Change pytest to python3 -m pytest. Signed-off-by: Linxi Ding <[email protected]> --------- Signed-off-by: Linxi Ding <[email protected]>

* Update full recompute feature to save recipe. The recompute context uses the same recipe and fp8 settings as the original fwd pass. Signed-off-by: Keith Wyss <[email protected]> * Formatted python code. Signed-off-by: Keith Wyss <[email protected]> * Simplify code by relying on recipe in ctx Signed-off-by: Keith Wyss <[email protected]> * MR feedback: import style Signed-off-by: Keith Wyss <[email protected]> --------- Signed-off-by: Keith Wyss <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

* add paged attention; test_kv_cache_accuray and test_paged_attn pass Signed-off-by: Charlene Yang <[email protected]> * remove unnecessary change from last commit Signed-off-by: Charlene Yang <[email protected]> * test_fused_attn pass Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove unnecessary import in test_numerics Signed-off-by: Charlene Yang <[email protected]> * add license for test Signed-off-by: Charlene Yang <[email protected]> * fix lint Signed-off-by: Charlene Yang <[email protected]> * add to L0 test Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update license for test_paged_attn Signed-off-by: Charlene Yang <[email protected]> * update kv_cache_manager license Signed-off-by: Charlene Yang <[email protected]> * fix build issue from previous merge Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: minor fix/preparation for inference/cuda graph Signed-off-by: Charlene Yang <[email protected]> * WIP: non-paged Signed-off-by: Charlene Yang <[email protected]> * WIP: non-paged, bshd/sbhd Signed-off-by: Charlene Yang <[email protected]> * WIP: non-paged, thd, no CG Signed-off-by: Charlene Yang <[email protected]> * WIP: non-paged, thd, CG Signed-off-by: Charlene Yang <[email protected]> * WIP: non-paged, CG Signed-off-by: Charlene Yang <[email protected]> * WIP: non-paged, using paged kernel Signed-off-by: Charlene Yang <[email protected]> * WIP: restructure kernels Signed-off-by: Charlene Yang <[email protected]> * WIP: paged, CG Signed-off-by: Charlene Yang <[email protected]> * WIP: padding + BRCM Signed-off-by: Charlene Yang <[email protected]> * WIP: restructure IP, clean up Signed-off-by: Charlene Yang <[email protected]> * WIP: fix non-CG, fused Signed-off-by: Charlene Yang <[email protected]> * WIP: fix last commit Signed-off-by: Charlene Yang <[email protected]> * WIP: unfused, non-CG Signed-off-by: Charlene Yang <[email protected]> * WIP: flash-attn, non-CG Signed-off-by: Charlene Yang <[email protected]> * WIP: flash_attn_with_kvcache Signed-off-by: Charlene Yang <[email protected]> * commit two files missed by bcef6b34 Signed-off-by: Charlene Yang <[email protected]> * WIP: thd_bshd_bshd Signed-off-by: Charlene Yang <[email protected]> * WIP: fix last commit Signed-off-by: Charlene Yang <[email protected]> * WIP: fix 1c31b68d Signed-off-by: Charlene Yang <[email protected]> * WIP: add bshd_2sbhd, sbhd_2bshd Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: some cleanup Signed-off-by: Charlene Yang <[email protected]> * WIP: all qkv_format combinations and merge CM files Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: some lint fixes Signed-off-by: Charlene Yang <[email protected]> * WIP: add docstring for IP Signed-off-by: Charlene Yang <[email protected]> * fix sequences_pre Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: minor fixes for multi-layer Signed-off-by: Charlene Yang <[email protected]> * WIP: initial multi-layer test Signed-off-by: Charlene Yang <[email protected]> * WIP: minor clean up Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: clean up Signed-off-by: Charlene Yang <[email protected]> * WIP: switch to flash_attn_varlen_func Signed-off-by: Charlene Yang <[email protected]> * WIP: fix unfused for separate q/kv format Signed-off-by: Charlene Yang <[email protected]> * WIP: fix fused for separate q/kv formats Signed-off-by: Charlene Yang <[email protected]> * WIP: flash attn + TELayer + 2 layers Signed-off-by: Charlene Yang <[email protected]> * WIP: unfused + TL + 2layers Signed-off-by: Charlene Yang <[email protected]> * WIP: all modules/backend Signed-off-by: Charlene Yang <[email protected]> * WIP: minor cleanup Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: FlashAttention on Hopper with 2.7.3 Signed-off-by: Charlene Yang <[email protected]> * WIP: FlashAttention + v3 from 39e7179 Signed-off-by: Charlene Yang <[email protected]> * WIP: FlashAttention + v3 + FP8 + WIP Signed-off-by: Charlene Yang <[email protected]> * WIP: add backend support table Signed-off-by: Charlene Yang <[email protected]> * WIP: clean up Signed-off-by: Charlene Yang <[email protected]> * WIP: separate use_flash_attention_2 and _3 Signed-off-by: Charlene Yang <[email protected]> * WIP: tweaks to paged attn script Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: enable/disable certain cases for fused attn Signed-off-by: Charlene Yang <[email protected]> * WIP: small fixes for lint and cg Signed-off-by: Charlene Yang <[email protected]> * WIP: minor fixes for attn/infer Signed-off-by: Charlene Yang <[email protected]> * WIP: fix CP Signed-off-by: Charlene Yang <[email protected]> * WIP: readd page info to FADescriptor_v1 Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor tweak to test_numerics.py Signed-off-by: Charlene Yang <[email protected]> * fix 9.5/9.7 sq/skv + mask logic Signed-off-by: Charlene Yang <[email protected]> * clean up Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix for FA3 Signed-off-by: Charlene Yang <[email protected]> * more minor fixes for FA3 Signed-off-by: Charlene Yang <[email protected]> * test page_size=1 for FA3 Signed-off-by: Charlene Yang <[email protected]> * fix t3hd/th3d strides Signed-off-by: Charlene Yang <[email protected]> * fix ckpt recompute and fa3 k_scale Signed-off-by: Charlene Yang <[email protected]> * raise dynamo recompile limit for test Signed-off-by: Charlene Yang <[email protected]> * remove thunder test from L0 Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix FA selection logic Signed-off-by: Charlene Yang <[email protected]> * fix FA3 q_descale shape Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove page_table from IP.step() returns Signed-off-by: Charlene Yang <[email protected]> * fix FP8 FlashAttn DPA fp8_dpa tests Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CP Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor tweaks Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FA3 note and L3 test Signed-off-by: Charlene Yang <[email protected]> * fix lint Signed-off-by: Charlene Yang <[email protected]> * remove redundant import in test Signed-off-by: Charlene Yang <[email protected]> * adopt new FA3 APIs from FA2.7.3+/hopper for CP and non-CP Signed-off-by: Charlene Yang <[email protected]> * fix lint Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * relax tols for TransformerLayers Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix merge Signed-off-by: Charlene Yang <[email protected]> * fix merge 2 Signed-off-by: Charlene Yang <[email protected]> * fix FA import comments Signed-off-by: Charlene Yang <[email protected]> * relax tols for Ampere Signed-off-by: Charlene Yang <[email protected]> * fix fa3 version and reduce messaging Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FA3 to its latest commit on main Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add default values to IP and assertion to graph.py Signed-off-by: Charlene Yang <[email protected]> * add more comments in attention Signed-off-by: Charlene Yang <[email protected]> * use custom_cache_manager instead of cache_manager Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Charlene Yang <[email protected]> Signed-off-by: Charlene Yang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix softmax shape for THD format. Signed-off-by: Michael Goldfarb <[email protected]>

* Do not apply bias when apply_bias is False Signed-off-by: Przemek Tredak <[email protected]> * Bwd fix for LNMLP and tests Signed-off-by: Przemek Tredak <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix for the dbias calculation Signed-off-by: Przemek Tredak <[email protected]> * Improve tests and cleaning the logic Signed-off-by: Przemek Tredak <[email protected]> * Tightened test tolerances a little Signed-off-by: Przemek Tredak <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "Tightened test tolerances a little" This reverts commit 2e20a92c884a84759006541adc1d638ab91dde62. Signed-off-by: Przemek Tredak <[email protected]> * Update tests/pytorch/test_numerics.py Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Przemyslaw Tredak <[email protected]> * Fix the Gelu Aux type Signed-off-by: Przemek Tredak <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove use_fc1_bias option Signed-off-by: Przemek Tredak <[email protected]> --------- Signed-off-by: Przemek Tredak <[email protected]> Signed-off-by: Przemyslaw Tredak <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <[email protected]>

ptrendx and others added 30 commits February 14, 2025 17:11

Changed VERSION to 2.2.0.dev0

b39397c

Signed-off-by: Przemek Tredak <[email protected]>

[JAX] Flax with compute dtype inferred from input dtype. (#1485)

6673f16

flax module with compute dtype inferred from the inputs Signed-off-by: Phuong Nguyen <[email protected]>

[PyTorch] Fix typo (#1495)

56c0c07

Fix typo Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

[PyTorch] Use same API in optimizer zero_grad as PyTorch optimizers…

b4fbc2b

… (#1466) Use same API in optimizer zero_grad as PyT optimizers Signed-off-by: Tim Moon <[email protected]>

[Pytorch] Added missing assert_dim_for_fp8_exec for Linear

d668f18

* fix Signed-off-by: Pawel Gadzinski <[email protected]> * reshape inp Signed-off-by: Pawel Gadzinski <[email protected]> --------- Signed-off-by: Pawel Gadzinski <[email protected]>

Update build test CUDA version to 12.1 (#1517)

eb28c65

Signed-off-by: Tim Moon <[email protected]>

[PyTorch] Set flags in norm modules for Mcore sequence-parallel suppo…

4b523d2

…rt (#1528) Set flag in norm modules for Mcore sequence-parallel support Signed-off-by: Tim Moon <[email protected]>

Launch GEMM on compute_stream which has low priority. (#1522)

fc1b91c

Signed-off-by: Vasudevan Rengasamy <[email protected]>

[common] Removed tensor boundary checks in MXFP8 kernels (#1519)

bc4c452

Added constexpr checks of tensor boundaries Signed-off-by: Oleg Goncharov <[email protected]>

Add sanity test for lightning-thunder integration (#1531)

90d5d45

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

ksivaman and others added 30 commits March 7, 2025 11:32

[PyTorch] Don't set FP8 data to None when saving base tensors (#1548)

48b8eea

Don't set data to null Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

Add user to TE CI (#1547)

44c8fd0

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

[PyTorch] Fix incorrect docstrings in tensor saving functions (#1549)

2ad5da9

Fix incorrect docstrings in tensor saving functions Signed-off-by: Tim Moon <[email protected]>

Use internal quantizer for input to the modules (#1551)

b3e7035

Internal quantizer for input to the modules Signed-off-by: Przemek Tredak <[email protected]>

[PyTorch] Remove Megatron-LM convergence test (#1521)

f090551

Remove Megatron-LM convergence test Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

Disable parallelism in core build test (#1550)

314ab9a

Signed-off-by: Tim Moon <[email protected]>

Revert "Use internal quantizer for input to the modules" (#1555)

f3a009d

Revert "Use internal quantizer for input to the modules (#1551)" This reverts commit b3e7035. Signed-off-by: Przemek Tredak <[email protected]>

Remove xla_ignore_channel_id check and ignore Scan loop warning in un…

ab4fd3c

…… (#1540) Remove xla_ignore_channel_id check and ignore Scan loop warning in unit test Signed-off-by: Reese Wang <[email protected]>

[JAX] FFI API compatibility with both 0.4 and 0.5 (#1562)

0e13788

Make ffi compatible with jax 0.4 Signed-off-by: Reese Wang <[email protected]> Co-authored-by: Phuong Nguyen <[email protected]>

Update FE to 1.11 (#1580)

dc40f9f

update FE to 1.11 Signed-off-by: Charlene Yang <[email protected]>

Fix import error on CPU only devices (#1578)

12c3e32

fix cpu device import error Signed-off-by: Hongxiao Bai <[email protected]> Co-authored-by: Tim Moon <[email protected]>

[JAX] Fix softmax aux shapes for packed/THD format (#1575)

bee4649

* Fix softmax shape for THD format. Signed-off-by: Michael Goldfarb <[email protected]>

IFU 2.2: merge pre-resolution files

aa1c987

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ifu dev 20250318 v2.2 #212

Ifu dev 20250318 v2.2 #212

VeeraRajasekhar commented Jun 24, 2025

Uh oh!

Uh oh!

Ifu dev 20250318 v2.2 #212

Are you sure you want to change the base?

Ifu dev 20250318 v2.2 #212

Conversation

VeeraRajasekhar commented Jun 24, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!