-
Notifications
You must be signed in to change notification settings - Fork 15
Ifu dev 20250318 v2.2 #212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
VeeraRajasekhar
wants to merge
71
commits into
dev
Choose a base branch
from
IFU-dev-20250318-v2.2
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
+14,719
−4,401
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Przemek Tredak <[email protected]>
…x FP8 related codes (#1468) * add prob permute; fix fp8tensor Signed-off-by: Hongxiao Bai <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert unnecessary changes in UT Signed-off-by: Hongxiao Bai <[email protected]> * remove unnecessary probs dtype convert Signed-off-by: Hongxiao Bai <[email protected]> * keep the output nums if probs is not provided Signed-off-by: Hongxiao Bai <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine the doc string Signed-off-by: Hongxiao Bai <[email protected]> * fix lint Signed-off-by: Hongxiao Bai <[email protected]> * use fp32 compute type Signed-off-by: Hongxiao Bai <[email protected]> * style fix Signed-off-by: Hongxiao Bai <[email protected]> * fix empty input return Signed-off-by: Hongxiao Bai <[email protected]> * separate prob related functions out Signed-off-by: Hongxiao Bai <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Hongxiao Bai <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xin Yao <[email protected]> Co-authored-by: Phuong Nguyen <[email protected]>
flax module with compute dtype inferred from the inputs Signed-off-by: Phuong Nguyen <[email protected]>
* Fix issues for MCore DDP. Signed-off-by: Dennis Liu <[email protected]> * Remove force data release for CPU offloading. Signed-off-by: Dennis Liu <[email protected]> * Add preserved attributeds. Signed-off-by: Dennis Liu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add main_grad to prevserved attributes. Signed-off-by: Dennis Liu <[email protected]> * Change prepare_for_saving to original tensor and add .data to CPU hook. Signed-off-by: Dennis Liu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update. Signed-off-by: Dennis Liu <[email protected]> * Fix for LayernormLinear in FP8. Signed-off-by: Dennis Liu <[email protected]> --------- Signed-off-by: Dennis Liu <[email protected]> Co-authored-by: Xin Yao <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fix typo Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
* fix fuse_wgrad_accumulation for GroupedLinear Signed-off-by: Xin Yao <[email protected]> * fix fuse_wgrad_accumulation for GroupedLinear Signed-off-by: Xin Yao <[email protected]> * update tests Signed-off-by: Xin Yao <[email protected]> --------- Signed-off-by: Xin Yao <[email protected]> Co-authored-by: Tim Moon <[email protected]>
* Fix te sequential for older pytorch versions Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * FIxes Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
* commit some debug code Signed-off-by: Xiaowei Ren <[email protected]> * add more debug info Signed-off-by: Xiaowei Ren <[email protected]> * debug code commit and typo fix Signed-off-by: Xiaowei Ren <[email protected]> * a typo fix Signed-off-by: Xiaowei Ren <[email protected]> * remove debug info Signed-off-by: Xiaowei Ren <[email protected]> * do not return lse Signed-off-by: Xiaowei Ren <[email protected]> * add amax_per_step for quantizers of CP Signed-off-by: Xiaowei Ren <[email protected]> * fix FP8 + CP Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bug fix Signed-off-by: Xiaowei Ren <[email protected]> * bug fix Signed-off-by: Xiaowei Ren <[email protected]> * dtype fix Signed-off-by: Xiaowei Ren <[email protected]> * bug fix Signed-off-by: Xiaowei Ren <[email protected]> --------- Signed-off-by: Xiaowei Ren <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xiaowei Ren <[email protected]>
… (#1466) Use same API in optimizer zero_grad as PyT optimizers Signed-off-by: Tim Moon <[email protected]>
* Remove dependency on transformer_engine::Tensor in attention.cu Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Templatize thd_partition_indices_kernel and thd_read_half_tensor_kernel kernels ONLY for invoking recompilation and not directly using the pre-compiled symbols in libtransformer.so Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Modify attention.cu for thd templatized kernels. Remove dependency on common.h Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Move thd structs from libtransformer.so to framework extensions include header Code cleanup Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Consolidate and move thd_utils from common to framework extensions Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Remove template decorators around thd_partition_indices_kernel and thd_read_half_tensor_kernel Signed-off-by: Kshitij Janardan Lakhani <[email protected]> Code clean up Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kshitij Janardan Lakhani <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix Signed-off-by: Pawel Gadzinski <[email protected]> * reshape inp Signed-off-by: Pawel Gadzinski <[email protected]> --------- Signed-off-by: Pawel Gadzinski <[email protected]>
* non-exit tests Signed-off-by: Pawel Gadzinski <[email protected]> * fix Signed-off-by: Pawel Gadzinski <[email protected]> * fix Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Pawel Gadzinski <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* minor fixes for attention Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Charlene Yang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix a crash with module._apply(lambda t: t.cpu()) Signed-off-by: Guyue Huang <[email protected]> * Add comments Signed-off-by: Guyue Huang <[email protected]> * Make sure tensor is moved to dst device before quantizer quantizes Signed-off-by: Guyue Huang <[email protected]> --------- Signed-off-by: Guyue Huang <[email protected]> Co-authored-by: Tim Moon <[email protected]>
* add remove_caches api Signed-off-by: Youngeun Kwon <[email protected]> * Update transformer_engine/pytorch/tensor/float8_tensor.py Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Youngeun Kwon <[email protected]> * explicit delete Signed-off-by: Youngeun Kwon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Youngeun Kwon <[email protected]> Co-authored-by: Tim Moon <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Added parallel cross entropy loss implementation using online softmax Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added tests Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added reshape of loss output Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added to test list Signed-off-by: Selvaraj Anandaraj <[email protected]> * Added Triton dependency Signed-off-by: Selvaraj Anandaraj <[email protected]> * Added copyright Signed-off-by: Selvaraj Anandaraj <[email protected]> * Fixed lint errors Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update setup.py Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Signed-off-by: Selvaraj Anandaraj <[email protected]> * Fixed lint and triton failure Signed-off-by: Selvaraj Anandaraj <[email protected]> * Removed flattening for scalars Signed-off-by: Selvaraj Anandaraj <[email protected]> * Skip tests on Blackwell due to TE CI caveat Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added reason arg Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Do not register Triton dependency with setuptools Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Selvaraj Anandaraj <[email protected]> Signed-off-by: Selvaraj Anandaraj <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Selvaraj Anandaraj <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]>
* Added TMA alignment check to cast_fp8_1D Signed-off-by: Oleg Goncharov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use tensor const-ref instead of tensor const-ptr Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Oleg Goncharov <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]>
* Skip context parallelism tests if not enough GPUs Signed-off-by: Tim Moon <[email protected]> * Apply suggestions from code review Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
…p (#1452) * Support vectorized local reduction for p2p-based ReduceScatter overlap Signed-off-by: Sangkug Lym <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cleanup Signed-off-by: Sangkug Lym <[email protected]> --------- Signed-off-by: Sangkug Lym <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* TP-RS local reduction: fix lint err Signed-off-by: Sangkug Lym <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Sangkug Lym <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix quantized tensor shape Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * add shape to make_like; add test for chunk Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix typo from suggestion Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
…e (#1516) * Enforce torch 2.0 and run attn tests with torch.compile Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * replace torch.compile with jit_fuser Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fixes Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
* delete extra tensor objects after restoring float8 tensors Signed-off-by: Sudhakar Singh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit fix Signed-off-by: Sudhakar Singh <[email protected]> * fix the leak in float8tensor and mxfloat8tensor classes Signed-off-by: Sudhakar Singh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * uncomment the fix Signed-off-by: Sudhakar Singh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by: Sudhakar Singh <[email protected]> --------- Signed-off-by: Sudhakar Singh <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…rt (#1528) Set flag in norm modules for Mcore sequence-parallel support Signed-off-by: Tim Moon <[email protected]>
* Support THD + ring attention for self attn Signed-off-by: Reese Wang <[email protected]> * Consolidate reorder strategy Signed-off-by: Reese Wang <[email protected]> * Fix dataclass frozen issue Signed-off-by: Reese Wang <[email protected]> * Remove redundant code Signed-off-by: Reese Wang <[email protected]> * Use AttnBiasType, AttnMaskType, QKVLayout in cpp_extension Signed-off-by: Reese Wang <[email protected]> * Fix lint Signed-off-by: Reese Wang <[email protected]> * Refine P2P helper check_supported Signed-off-by: Reese Wang <[email protected]> * Add segment_ids/pos check Signed-off-by: Reese Wang <[email protected]> * Fixup Signed-off-by: Reese Wang <[email protected]> * Add dual chunk swap example Signed-off-by: Reese Wang <[email protected]> * Align different reorder code structure Signed-off-by: Reese Wang <[email protected]> --------- Signed-off-by: Reese Wang <[email protected]> Co-authored-by: Phuong Nguyen <[email protected]>
Signed-off-by: Vasudevan Rengasamy <[email protected]>
Added constexpr checks of tensor boundaries Signed-off-by: Oleg Goncharov <[email protected]>
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
* Expose only required symbols from libtransformer_engine.so during linking for pytorch Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Augment libtransformer_engine.version for jax compatibility Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Augment the libtransformer_engine.version to ensure compatibility with CPP tests Remove getenv from the .version file Combine system.cpp and system.h Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Nit: Remove commented code for not including common.h Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Replace explicit getenv instantiations with a helper template Use filesystem calls in file_exists() Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert comment to falsy instead of false Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Kshitij Lakhani <[email protected]> --------- Signed-off-by: Kshitij Janardan Lakhani <[email protected]> Signed-off-by: Kshitij Lakhani <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <[email protected]>
Don't set data to null Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Fix incorrect docstrings in tensor saving functions Signed-off-by: Tim Moon <[email protected]>
* fix recompilation of out and lse correction in p2p+bshd/sbhd Signed-off-by: Xiaowei Ren <[email protected]> * fix recompilation of get_seq_chunk_ids_for_reordering Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix recomplilation of reorder_seq_chunks_for_a2a Signed-off-by: Xiaowei Ren <[email protected]> * recover a change Signed-off-by: Xiaowei Ren <[email protected]> * typo fix Signed-off-by: Xiaowei Ren <[email protected]> * minor change to softmax_lse correction Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cache cu_seqlens for BSHD/SBHD format Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * do not need to allocate out buffer for BSHD/SBHD Signed-off-by: Xiaowei Ren <[email protected]> * code refactoring Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix Signed-off-by: Xiaowei Ren <[email protected]> * refactor init out correction Signed-off-by: Xiaowei Ren <[email protected]> * fix a docstring Signed-off-by: Xiaowei Ren <[email protected]> * typo fix Signed-off-by: Xiaowei Ren <[email protected]> * code refactoring Signed-off-by: Xiaowei Ren <[email protected]> * fix init out correct dtype Signed-off-by: Xiaowei Ren <[email protected]> * add pad_between_seqs to DPA API Signed-off-by: Xiaowei Ren <[email protected]> * add pad_between_seqs to the API of MHA and transformer layer Signed-off-by: Xiaowei Ren <[email protected]> * add pad_between_seqs to the API of MHA and transformer layer Signed-off-by: Xiaowei Ren <[email protected]> --------- Signed-off-by: Xiaowei Ren <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* check in per-tensor current scaling full recipe Signed-off-by: zhongboz <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: zhongboz <[email protected]> setup basics of current scaling quantizer in python level Signed-off-by: zhongboz <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: zhongboz <[email protected]> add test case for current scaling dequantize Signed-off-by: zhongboz <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: zhongboz <[email protected]> finish linear layer fwd bwd test, determined error with bf16 Signed-off-by: zhongboz <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: zhongboz <[email protected]> achieved zero tolerance for Linear by specify gemm use_split_accumulator config Signed-off-by: zhongboz <[email protected]> enable layernormlinear with current scaling, pass bitwise test Signed-off-by: zhongboz <[email protected]> refactor test case code Signed-off-by: zhongboz <[email protected]> make current scaling quantizers distrbuted, pass distributed linear&layernormlinear tests Signed-off-by: zhongboz <[email protected]> bug fix: use cached fp8 recipe in backward Signed-off-by: zhongboz <[email protected]> fix layernorm_mlp with current scaling, fix activation_helper with current scaling Signed-off-by: zhongboz <[email protected]> support detailed numerical settings from recipe to quantization kernel Signed-off-by: zhongboz <[email protected]> resolving MR comments Signed-off-by: zhongboz <[email protected]> recipe naming Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve mr comments, remove IS_CURRENT_SCALING template from kernels Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve mr comments, make current scaling c++ test cases Signed-off-by: zhongboz <[email protected]> * add current scaling to test_numerics.py, skip act recomp and grouped linear Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add benchmark for quantizer Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add benchmarks for linear layer Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bug fix, typo Signed-off-by: zhongboz <[email protected]> * resolve more mr comments Signed-off-by: zhongboz <[email protected]> * avoid potential race condition by not using from_blob to construct amax tensor in C++ Signed-off-by: zhongboz <[email protected]> * resolve more comments Signed-off-by: zhongboz <[email protected]> * Debug linter warnings and license check Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug import error in FP8 tensor test Signed-off-by: Tim Moon <[email protected]> * Debug compilation error with CUDA 12.1 for Turing Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve mr comments, fix activation cast fusion Signed-off-by: zhongboz <[email protected]> * resolve comments, add NVTEQuantizationParams for compute scale Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove is_current_scaling check totally from common folder Signed-off-by: zhongboz <[email protected]> * remove benchmarks, will contribute in another repo Signed-off-by: zhongboz <[email protected]> * adjust cs default recipe config Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adjust comments in test Signed-off-by: zhongboz <[email protected]> * Remove current scaling mode from core lib Signed-off-by: Tim Moon <[email protected]> * Refactor current-scaling-specific logic in core C++ lib Move amax and scale update functions out of casting functions, and put into dedicated current-scaling source file. Add general API for accessing quantization config object. Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add missing header in C++ tests Signed-off-by: Tim Moon <[email protected]> * Disable test config with FP8 transpose on Blackwell Signed-off-by: Tim Moon <[email protected]> * Fix compilation error in C++ test Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: zhongboz <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: zhongboz <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]>
* Verified TE2.0 with offloading Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skipping tests for Ampere and removed child class preparing Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * offloading support for MXFP8 dtype Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Changed quantized tensor detection mechanism Signed-off-by: Selvaraj Anandaraj <[email protected]> * Fix mxfp8 offload, lint errors, and var name Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Supported disabling offloading for quantized tensors Signed-off-by: Selvaraj Anandaraj <[email protected]> * bug fix Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed bugs Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added support for None in list of Quantized data tensors Signed-off-by: root <[email protected]> * Hopper backward compatibility cleanup Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Coding style nit Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added guards Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Selvaraj Anandaraj <[email protected]> Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Selvaraj Anandaraj <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
Internal quantizer for input to the modules Signed-off-by: Przemek Tredak <[email protected]>
Remove Megatron-LM convergence test Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Revert "Use internal quantizer for input to the modules (#1551)" This reverts commit b3e7035. Signed-off-by: Przemek Tredak <[email protected]>
…… (#1540) Remove xla_ignore_channel_id check and ignore Scan loop warning in unit test Signed-off-by: Reese Wang <[email protected]>
* fix dtypes in fused attn bwd for FP8 Signed-off-by: Charlene Yang <[email protected]> * add comments for dtypes Signed-off-by: Charlene Yang <[email protected]> * remove redundant qkv_dtype in fwd Signed-off-by: Charlene Yang <[email protected]> * remove Nones in bwd returns Signed-off-by: Charlene Yang <[email protected]> --------- Signed-off-by: Charlene Yang <[email protected]>
* Explicitly use python3 and pip3 Signed-off-by: Tim Moon <[email protected]> * Run pre-commit as Python module Signed-off-by: Tim Moon <[email protected]> * Replace some missed references to "python" or "pip" Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Tim Moon <[email protected]>
Make ffi compatible with jax 0.4 Signed-off-by: Reese Wang <[email protected]> Co-authored-by: Phuong Nguyen <[email protected]>
* Delete row-wise data in single-GPU linear forward Signed-off-by: Tim Moon <[email protected]> * Debug Python->C++ parsing of transpose-only Float8Tensors Signed-off-by: Tim Moon <[email protected]> * Debug tensor shape calculation without row-wise data Signed-off-by: Tim Moon <[email protected]> * Debug correctness issues with only column-wise data Signed-off-by: Tim Moon <[email protected]> * Only cache column-wise input in LayerNormLinear Signed-off-by: Tim Moon <[email protected]> * Support MXFP8 all-gather with only column-wise data Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix moe cases, lint, rm unused ctx Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix CPU activation offloading and use consistent logic for save/restore Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix tests Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix typo Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * RM stray file Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix distributed and cpp tests Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix norm cpp tests Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Rm stray file Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * RM stray file Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix MXFP8 AG Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix FP8 with sequence parallelism Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix UB bulk dgrad Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
…e (#1558) * add tex.bgrad_quantize support for CS Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused import Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: zhongboz <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: zhongboz <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]>
update FE to 1.11 Signed-off-by: Charlene Yang <[email protected]>
fix cpu device import error Signed-off-by: Hongxiao Bai <[email protected]> Co-authored-by: Tim Moon <[email protected]>
* Add options to comm overlap tests Signed-off-by: Vasudevan Rengasamy <[email protected]> * Fix Typo Signed-off-by: Vasudevan Rengasamy <[email protected]> * Update tests/pytorch/distributed/run_layer_with_overlap.py Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Vasudevan Rengasamy <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]>
* Create pytorch/dot_product_attention module and pytorch/d_p_a/utils.py Move attention logging into a separate class in pytorch/d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Create FlashAttentionUtils class in pytorch/d_p_a/utils/py for versioning info Move versioning info out of pytorch/attention.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Move AttentionParams and get_attention_backend from attention.py to d_p_a/utils.py Fix tests and imports for the above refactor change Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move get_qkv_layout(), get_full_mask(), get_alibi(), get_attention_quantizers() to d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move tensor packing and unpacking helper functions from pyt/attention.py to d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move cumulative seqlens and indices methods from pyt/attention.py to d_p_a/utils.py Rename cumulative functions from using _cu_ to using _cumul_ to differentiate from CUDA cu calls protocol Rename tensor packaging methods with leading underscore to make them as internal to file Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unnecessary imports in pytorch/attention.py and d_p_a/utils.py Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * Create d_p_a/inference.py and move InferenceParams from pyt/attention.py to it Modify tests and other files to import InferenceParams correctly Signed-off-by: Kshitij Janardan Lakhani <[email protected]> Modify docs api for InferenceParams Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Create d_p_a/rope.py and move RoPE methods from pytorch/attention.py to it Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Code cleanup Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix qa testing induced bug Code clean up Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix incorrect pack_tensor arg type Code clean up Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * nit: Resolve lint errors Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove typedef FAUtils for FlashAttentionUtils Use attn_log instead of att_log Signed-off-by: Kshitij Janardan Lakhani <[email protected]> Fix lint error Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit: Fix the function name from get_cumul to the earlier get_cu Signed-off-by: Kshitij Janardan Lakhani <[email protected]> * nit: Fix typos, explicit imports and remove extra comments Signed-off-by: Kshitij Janardan Lakhani <[email protected]> --------- Signed-off-by: Kshitij Janardan Lakhani <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Charlene Yang <[email protected]>
…554) * support tp-comm-overlap in Current Scaling recipe Signed-off-by: Li Tao <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * clean Signed-off-by: Li Tao <[email protected]> * fix test recipe argument to generalize to MXFP8 Signed-off-by: Li Tao <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reduce duplicated transpose in certain cases Signed-off-by: Li Tao <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use per_tensor_scaling() to judge DS or CS Signed-off-by: Li Tao <[email protected]> * minor fixes Signed-off-by: Li Tao <[email protected]> * change comment description Signed-off-by: Li Tao <[email protected]> * add multi-layer unit test for tp overlap Signed-off-by: Li Tao <[email protected]> * support test case that run for several times Signed-off-by: Li Tao <[email protected]> * avoid save ub tensor in prepare_for_saving Signed-off-by: Li Tao <[email protected]> * fix Signed-off-by: Li Tao <[email protected]> * switch to a simple fix Signed-off-by: Li Tao <[email protected]> * formatting Signed-off-by: Li Tao <[email protected]> * simply test cases; avoid additional clone() Signed-off-by: Li Tao <[email protected]> * fall back to get_buffer in layernormmlp Signed-off-by: Li Tao <[email protected]> * use 2 layers for fp8 tpoverlap multi-layer test for better tolerance, limit max gpus for test Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Li Tao <[email protected]> Signed-off-by: zhongboz <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: zhongboz <[email protected]>
* Add issue template Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fixes Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Make GPU info section Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
* Do not create multiple cublas handle Signed-off-by: Przemek Tredak <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix for multiple GPUs per thread Signed-off-by: Przemek Tredak <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix multithreaded execution Signed-off-by: Przemek Tredak <[email protected]> * Fix from conlfict Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Przemek Tredak <[email protected]> Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
* DistOpt support with offloading Signed-off-by: Selvaraj Anandaraj <[email protected]> * Added distopt support for TE2.0 Signed-off-by: Selvaraj Anandaraj <[email protected]> * Restricted this to MCore DistOpt only Signed-off-by: Selvaraj Anandaraj <[email protected]> * Added guards Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/pytorch/module/linear.py Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Signed-off-by: Selvaraj Anandaraj <[email protected]> * Update transformer_engine/pytorch/module/layernorm_linear.py Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Signed-off-by: Selvaraj Anandaraj <[email protected]> --------- Signed-off-by: Selvaraj Anandaraj <[email protected]> Signed-off-by: Selvaraj Anandaraj <[email protected]> Co-authored-by: Selvaraj Anandaraj <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
* [QA] Add error handling -Standardize test failure handling using the unified 'test_fail' function and 'error_exit' function. Signed-off-by: Linxi Ding <[email protected]> * Update script to use explicit python3, pip3, and python3 -m pytest calls - Change pip to pip3. - Change python to python3. - Change pytest to python3 -m pytest. Signed-off-by: Linxi Ding <[email protected]> --------- Signed-off-by: Linxi Ding <[email protected]>
* Update full recompute feature to save recipe. The recompute context uses the same recipe and fp8 settings as the original fwd pass. Signed-off-by: Keith Wyss <[email protected]> * Formatted python code. Signed-off-by: Keith Wyss <[email protected]> * Simplify code by relying on recipe in ctx Signed-off-by: Keith Wyss <[email protected]> * MR feedback: import style Signed-off-by: Keith Wyss <[email protected]> --------- Signed-off-by: Keith Wyss <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
* add paged attention; test_kv_cache_accuray and test_paged_attn pass Signed-off-by: Charlene Yang <[email protected]> * remove unnecessary change from last commit Signed-off-by: Charlene Yang <[email protected]> * test_fused_attn pass Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove unnecessary import in test_numerics Signed-off-by: Charlene Yang <[email protected]> * add license for test Signed-off-by: Charlene Yang <[email protected]> * fix lint Signed-off-by: Charlene Yang <[email protected]> * add to L0 test Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update license for test_paged_attn Signed-off-by: Charlene Yang <[email protected]> * update kv_cache_manager license Signed-off-by: Charlene Yang <[email protected]> * fix build issue from previous merge Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: minor fix/preparation for inference/cuda graph Signed-off-by: Charlene Yang <[email protected]> * WIP: non-paged Signed-off-by: Charlene Yang <[email protected]> * WIP: non-paged, bshd/sbhd Signed-off-by: Charlene Yang <[email protected]> * WIP: non-paged, thd, no CG Signed-off-by: Charlene Yang <[email protected]> * WIP: non-paged, thd, CG Signed-off-by: Charlene Yang <[email protected]> * WIP: non-paged, CG Signed-off-by: Charlene Yang <[email protected]> * WIP: non-paged, using paged kernel Signed-off-by: Charlene Yang <[email protected]> * WIP: restructure kernels Signed-off-by: Charlene Yang <[email protected]> * WIP: paged, CG Signed-off-by: Charlene Yang <[email protected]> * WIP: padding + BRCM Signed-off-by: Charlene Yang <[email protected]> * WIP: restructure IP, clean up Signed-off-by: Charlene Yang <[email protected]> * WIP: fix non-CG, fused Signed-off-by: Charlene Yang <[email protected]> * WIP: fix last commit Signed-off-by: Charlene Yang <[email protected]> * WIP: unfused, non-CG Signed-off-by: Charlene Yang <[email protected]> * WIP: flash-attn, non-CG Signed-off-by: Charlene Yang <[email protected]> * WIP: flash_attn_with_kvcache Signed-off-by: Charlene Yang <[email protected]> * commit two files missed by bcef6b34 Signed-off-by: Charlene Yang <[email protected]> * WIP: thd_bshd_bshd Signed-off-by: Charlene Yang <[email protected]> * WIP: fix last commit Signed-off-by: Charlene Yang <[email protected]> * WIP: fix 1c31b68d Signed-off-by: Charlene Yang <[email protected]> * WIP: add bshd_2sbhd, sbhd_2bshd Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: some cleanup Signed-off-by: Charlene Yang <[email protected]> * WIP: all qkv_format combinations and merge CM files Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: some lint fixes Signed-off-by: Charlene Yang <[email protected]> * WIP: add docstring for IP Signed-off-by: Charlene Yang <[email protected]> * fix sequences_pre Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: minor fixes for multi-layer Signed-off-by: Charlene Yang <[email protected]> * WIP: initial multi-layer test Signed-off-by: Charlene Yang <[email protected]> * WIP: minor clean up Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: clean up Signed-off-by: Charlene Yang <[email protected]> * WIP: switch to flash_attn_varlen_func Signed-off-by: Charlene Yang <[email protected]> * WIP: fix unfused for separate q/kv format Signed-off-by: Charlene Yang <[email protected]> * WIP: fix fused for separate q/kv formats Signed-off-by: Charlene Yang <[email protected]> * WIP: flash attn + TELayer + 2 layers Signed-off-by: Charlene Yang <[email protected]> * WIP: unfused + TL + 2layers Signed-off-by: Charlene Yang <[email protected]> * WIP: all modules/backend Signed-off-by: Charlene Yang <[email protected]> * WIP: minor cleanup Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: FlashAttention on Hopper with 2.7.3 Signed-off-by: Charlene Yang <[email protected]> * WIP: FlashAttention + v3 from 39e7179 Signed-off-by: Charlene Yang <[email protected]> * WIP: FlashAttention + v3 + FP8 + WIP Signed-off-by: Charlene Yang <[email protected]> * WIP: add backend support table Signed-off-by: Charlene Yang <[email protected]> * WIP: clean up Signed-off-by: Charlene Yang <[email protected]> * WIP: separate use_flash_attention_2 and _3 Signed-off-by: Charlene Yang <[email protected]> * WIP: tweaks to paged attn script Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: enable/disable certain cases for fused attn Signed-off-by: Charlene Yang <[email protected]> * WIP: small fixes for lint and cg Signed-off-by: Charlene Yang <[email protected]> * WIP: minor fixes for attn/infer Signed-off-by: Charlene Yang <[email protected]> * WIP: fix CP Signed-off-by: Charlene Yang <[email protected]> * WIP: readd page info to FADescriptor_v1 Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor tweak to test_numerics.py Signed-off-by: Charlene Yang <[email protected]> * fix 9.5/9.7 sq/skv + mask logic Signed-off-by: Charlene Yang <[email protected]> * clean up Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix for FA3 Signed-off-by: Charlene Yang <[email protected]> * more minor fixes for FA3 Signed-off-by: Charlene Yang <[email protected]> * test page_size=1 for FA3 Signed-off-by: Charlene Yang <[email protected]> * fix t3hd/th3d strides Signed-off-by: Charlene Yang <[email protected]> * fix ckpt recompute and fa3 k_scale Signed-off-by: Charlene Yang <[email protected]> * raise dynamo recompile limit for test Signed-off-by: Charlene Yang <[email protected]> * remove thunder test from L0 Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix FA selection logic Signed-off-by: Charlene Yang <[email protected]> * fix FA3 q_descale shape Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove page_table from IP.step() returns Signed-off-by: Charlene Yang <[email protected]> * fix FP8 FlashAttn DPA fp8_dpa tests Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CP Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor tweaks Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FA3 note and L3 test Signed-off-by: Charlene Yang <[email protected]> * fix lint Signed-off-by: Charlene Yang <[email protected]> * remove redundant import in test Signed-off-by: Charlene Yang <[email protected]> * adopt new FA3 APIs from FA2.7.3+/hopper for CP and non-CP Signed-off-by: Charlene Yang <[email protected]> * fix lint Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * relax tols for TransformerLayers Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix merge Signed-off-by: Charlene Yang <[email protected]> * fix merge 2 Signed-off-by: Charlene Yang <[email protected]> * fix FA import comments Signed-off-by: Charlene Yang <[email protected]> * relax tols for Ampere Signed-off-by: Charlene Yang <[email protected]> * fix fa3 version and reduce messaging Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FA3 to its latest commit on main Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add default values to IP and assertion to graph.py Signed-off-by: Charlene Yang <[email protected]> * add more comments in attention Signed-off-by: Charlene Yang <[email protected]> * use custom_cache_manager instead of cache_manager Signed-off-by: Charlene Yang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Charlene Yang <[email protected]> Signed-off-by: Charlene Yang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix softmax shape for THD format. Signed-off-by: Michael Goldfarb <[email protected]>
* Do not apply bias when apply_bias is False Signed-off-by: Przemek Tredak <[email protected]> * Bwd fix for LNMLP and tests Signed-off-by: Przemek Tredak <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix for the dbias calculation Signed-off-by: Przemek Tredak <[email protected]> * Improve tests and cleaning the logic Signed-off-by: Przemek Tredak <[email protected]> * Tightened test tolerances a little Signed-off-by: Przemek Tredak <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "Tightened test tolerances a little" This reverts commit 2e20a92c884a84759006541adc1d638ab91dde62. Signed-off-by: Przemek Tredak <[email protected]> * Update tests/pytorch/test_numerics.py Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Przemyslaw Tredak <[email protected]> * Fix the Gelu Aux type Signed-off-by: Przemek Tredak <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove use_fc1_bias option Signed-off-by: Przemek Tredak <[email protected]> --------- Signed-off-by: Przemek Tredak <[email protected]> Signed-off-by: Przemyslaw Tredak <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: