Skip to content

Ifu dev 20250318 v2.2 #212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 71 commits into
base: dev
Choose a base branch
from
Draft

Ifu dev 20250318 v2.2 #212

wants to merge 71 commits into from

Conversation

VeeraRajasekhar
Copy link
Contributor

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

ptrendx and others added 30 commits February 14, 2025 17:11
Signed-off-by: Przemek Tredak <[email protected]>
…x FP8 related codes (#1468)

* add prob permute; fix fp8tensor

Signed-off-by: Hongxiao Bai <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert unnecessary changes in UT

Signed-off-by: Hongxiao Bai <[email protected]>

* remove unnecessary probs dtype convert

Signed-off-by: Hongxiao Bai <[email protected]>

* keep the output nums if probs is not provided

Signed-off-by: Hongxiao Bai <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refine the doc string

Signed-off-by: Hongxiao Bai <[email protected]>

* fix lint

Signed-off-by: Hongxiao Bai <[email protected]>

* use fp32 compute type

Signed-off-by: Hongxiao Bai <[email protected]>

* style fix

Signed-off-by: Hongxiao Bai <[email protected]>

* fix empty input return

Signed-off-by: Hongxiao Bai <[email protected]>

* separate prob related functions out

Signed-off-by: Hongxiao Bai <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Hongxiao Bai <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xin Yao <[email protected]>
Co-authored-by: Phuong Nguyen <[email protected]>
flax module with compute dtype inferred from the inputs

Signed-off-by: Phuong Nguyen <[email protected]>
* Fix issues for MCore DDP.

Signed-off-by: Dennis Liu <[email protected]>

* Remove force data release for CPU offloading.

Signed-off-by: Dennis Liu <[email protected]>

* Add preserved attributeds.

Signed-off-by: Dennis Liu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add main_grad to prevserved attributes.

Signed-off-by: Dennis Liu <[email protected]>

* Change prepare_for_saving to original tensor and add .data to CPU hook.

Signed-off-by: Dennis Liu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update.

Signed-off-by: Dennis Liu <[email protected]>

* Fix for LayernormLinear in FP8.

Signed-off-by: Dennis Liu <[email protected]>

---------

Signed-off-by: Dennis Liu <[email protected]>
Co-authored-by: Xin Yao <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fix typo

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
* fix fuse_wgrad_accumulation for GroupedLinear

Signed-off-by: Xin Yao <[email protected]>

* fix fuse_wgrad_accumulation for GroupedLinear

Signed-off-by: Xin Yao <[email protected]>

* update tests

Signed-off-by: Xin Yao <[email protected]>

---------

Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
* Fix te sequential for older pytorch versions

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* FIxes

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

---------

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
* commit some debug code

Signed-off-by: Xiaowei Ren <[email protected]>

* add more debug info

Signed-off-by: Xiaowei Ren <[email protected]>

* debug code commit and typo fix

Signed-off-by: Xiaowei Ren <[email protected]>

* a typo fix

Signed-off-by: Xiaowei Ren <[email protected]>

* remove debug info

Signed-off-by: Xiaowei Ren <[email protected]>

* do not return lse

Signed-off-by: Xiaowei Ren <[email protected]>

* add amax_per_step for quantizers of CP

Signed-off-by: Xiaowei Ren <[email protected]>

* fix FP8 + CP

Signed-off-by: Xiaowei Ren <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* bug fix

Signed-off-by: Xiaowei Ren <[email protected]>

* bug fix

Signed-off-by: Xiaowei Ren <[email protected]>

* dtype fix

Signed-off-by: Xiaowei Ren <[email protected]>

* bug fix

Signed-off-by: Xiaowei Ren <[email protected]>

---------

Signed-off-by: Xiaowei Ren <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xiaowei Ren <[email protected]>
… (#1466)

Use same API in optimizer zero_grad as PyT optimizers

Signed-off-by: Tim Moon <[email protected]>
* Remove dependency on transformer_engine::Tensor in attention.cu

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* Templatize thd_partition_indices_kernel and thd_read_half_tensor_kernel kernels ONLY for invoking recompilation and not directly using the pre-compiled symbols in libtransformer.so

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* Modify attention.cu for thd templatized kernels. Remove dependency on common.h

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* Move thd structs from libtransformer.so to framework extensions include header

Code cleanup

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Consolidate and move thd_utils from common to framework extensions

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* Remove template decorators around thd_partition_indices_kernel and thd_read_half_tensor_kernel

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

Code clean up

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* reshape inp

Signed-off-by: Pawel Gadzinski <[email protected]>

---------

Signed-off-by: Pawel Gadzinski <[email protected]>
* non-exit tests

Signed-off-by: Pawel Gadzinski <[email protected]>

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* minor fixes for attention

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix a crash with module._apply(lambda t: t.cpu())

Signed-off-by: Guyue Huang <[email protected]>

* Add comments

Signed-off-by: Guyue Huang <[email protected]>

* Make sure tensor is moved to dst device before quantizer quantizes

Signed-off-by: Guyue Huang <[email protected]>

---------

Signed-off-by: Guyue Huang <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
* add remove_caches api

Signed-off-by: Youngeun Kwon <[email protected]>

* Update transformer_engine/pytorch/tensor/float8_tensor.py

Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Youngeun Kwon <[email protected]>

* explicit delete

Signed-off-by: Youngeun Kwon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Youngeun Kwon <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Added parallel cross entropy loss implementation using online softmax

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added tests

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added reshape of loss output

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added to test list

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* Added Triton dependency

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* Added copyright

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* Fixed lint errors

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update setup.py

Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
Signed-off-by: Selvaraj Anandaraj <[email protected]>

* Fixed lint and triton failure

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* Removed flattening for scalars

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* Skip tests on Blackwell due to TE CI caveat

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added reason arg

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Do not register Triton dependency with setuptools

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Selvaraj Anandaraj <[email protected]>
Signed-off-by: Selvaraj Anandaraj <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Selvaraj Anandaraj <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
* Added TMA alignment check to cast_fp8_1D

Signed-off-by: Oleg Goncharov <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use tensor const-ref instead of tensor const-ptr

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Oleg Goncharov <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
* Skip context parallelism tests if not enough GPUs

Signed-off-by: Tim Moon <[email protected]>

* Apply suggestions from code review

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
…p (#1452)

* Support vectorized local reduction for p2p-based ReduceScatter overlap

Signed-off-by: Sangkug Lym <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* cleanup

Signed-off-by: Sangkug Lym <[email protected]>

---------

Signed-off-by: Sangkug Lym <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* TP-RS local reduction: fix lint err

Signed-off-by: Sangkug Lym <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Sangkug Lym <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix quantized tensor shape

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* add shape to make_like; add test for chunk

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Fix typo from suggestion

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

---------

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
…e (#1516)

* Enforce torch 2.0 and run attn tests with torch.compile

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* replace torch.compile with jit_fuser

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Fixes

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

---------

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
* delete extra tensor objects after restoring float8 tensors

Signed-off-by: Sudhakar Singh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* nit fix

Signed-off-by: Sudhakar Singh <[email protected]>

* fix the leak in float8tensor and mxfloat8tensor classes

Signed-off-by: Sudhakar Singh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* uncomment the fix

Signed-off-by: Sudhakar Singh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix lint

Signed-off-by: Sudhakar Singh <[email protected]>

---------

Signed-off-by: Sudhakar Singh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…rt (#1528)

Set flag in norm modules for Mcore sequence-parallel support

Signed-off-by: Tim Moon <[email protected]>
* Support THD + ring attention for self attn

Signed-off-by: Reese Wang <[email protected]>

* Consolidate reorder strategy

Signed-off-by: Reese Wang <[email protected]>

* Fix dataclass frozen issue

Signed-off-by: Reese Wang <[email protected]>

* Remove redundant code

Signed-off-by: Reese Wang <[email protected]>

* Use AttnBiasType, AttnMaskType, QKVLayout in cpp_extension

Signed-off-by: Reese Wang <[email protected]>

* Fix lint

Signed-off-by: Reese Wang <[email protected]>

* Refine P2P helper check_supported

Signed-off-by: Reese Wang <[email protected]>

* Add segment_ids/pos check

Signed-off-by: Reese Wang <[email protected]>

* Fixup

Signed-off-by: Reese Wang <[email protected]>

* Add dual chunk swap example

Signed-off-by: Reese Wang <[email protected]>

* Align different reorder code structure

Signed-off-by: Reese Wang <[email protected]>

---------

Signed-off-by: Reese Wang <[email protected]>
Co-authored-by: Phuong Nguyen <[email protected]>
Added constexpr checks of tensor boundaries

Signed-off-by: Oleg Goncharov <[email protected]>
* Expose only required symbols from libtransformer_engine.so during linking for pytorch

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* Augment libtransformer_engine.version for jax compatibility

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* Augment the libtransformer_engine.version to ensure compatibility with CPP tests
Remove getenv from the .version file
Combine system.cpp and system.h

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Nit: Remove commented code for not including common.h

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* Replace explicit getenv instantiations with a helper template
Use filesystem calls in file_exists()

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert comment to falsy instead of false

Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Kshitij Lakhani <[email protected]>

---------

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>
Signed-off-by: Kshitij Lakhani <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <[email protected]>
ksivaman and others added 30 commits March 7, 2025 11:32
Don't set data to null

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Fix incorrect docstrings in tensor saving functions

Signed-off-by: Tim Moon <[email protected]>
* fix recompilation of out and lse correction in p2p+bshd/sbhd

Signed-off-by: Xiaowei Ren <[email protected]>

* fix recompilation of get_seq_chunk_ids_for_reordering

Signed-off-by: Xiaowei Ren <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix recomplilation of reorder_seq_chunks_for_a2a

Signed-off-by: Xiaowei Ren <[email protected]>

* recover a change

Signed-off-by: Xiaowei Ren <[email protected]>

* typo fix

Signed-off-by: Xiaowei Ren <[email protected]>

* minor change to softmax_lse correction

Signed-off-by: Xiaowei Ren <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* cache cu_seqlens for BSHD/SBHD format

Signed-off-by: Xiaowei Ren <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* do not need to allocate out buffer for BSHD/SBHD

Signed-off-by: Xiaowei Ren <[email protected]>

* code refactoring

Signed-off-by: Xiaowei Ren <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fix

Signed-off-by: Xiaowei Ren <[email protected]>

* refactor init out correction

Signed-off-by: Xiaowei Ren <[email protected]>

* fix a docstring

Signed-off-by: Xiaowei Ren <[email protected]>

* typo fix

Signed-off-by: Xiaowei Ren <[email protected]>

* code refactoring

Signed-off-by: Xiaowei Ren <[email protected]>

* fix init out correct dtype

Signed-off-by: Xiaowei Ren <[email protected]>

* add pad_between_seqs to DPA API

Signed-off-by: Xiaowei Ren <[email protected]>

* add pad_between_seqs to the API of MHA and transformer layer

Signed-off-by: Xiaowei Ren <[email protected]>

* add pad_between_seqs to the API of MHA and transformer layer

Signed-off-by: Xiaowei Ren <[email protected]>

---------

Signed-off-by: Xiaowei Ren <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* check in per-tensor current scaling full recipe

Signed-off-by: zhongboz <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: zhongboz <[email protected]>

setup basics of current scaling quantizer in python level

Signed-off-by: zhongboz <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: zhongboz <[email protected]>

add test case for current scaling dequantize

Signed-off-by: zhongboz <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: zhongboz <[email protected]>

finish linear layer fwd bwd test, determined error with bf16

Signed-off-by: zhongboz <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: zhongboz <[email protected]>

achieved zero tolerance for Linear by specify gemm use_split_accumulator config

Signed-off-by: zhongboz <[email protected]>

enable layernormlinear with current scaling, pass bitwise test

Signed-off-by: zhongboz <[email protected]>

refactor test case code

Signed-off-by: zhongboz <[email protected]>

make current scaling quantizers distrbuted, pass distributed linear&layernormlinear tests

Signed-off-by: zhongboz <[email protected]>

bug fix: use cached fp8 recipe in backward

Signed-off-by: zhongboz <[email protected]>

fix layernorm_mlp with current scaling, fix activation_helper with current scaling

Signed-off-by: zhongboz <[email protected]>

support detailed numerical settings from recipe to quantization kernel

Signed-off-by: zhongboz <[email protected]>

resolving MR comments

Signed-off-by: zhongboz <[email protected]>

recipe naming

Signed-off-by: zhongboz <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* resolve mr comments, remove IS_CURRENT_SCALING template from kernels

Signed-off-by: zhongboz <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* resolve mr comments, make current scaling c++ test cases

Signed-off-by: zhongboz <[email protected]>

* add current scaling to test_numerics.py, skip act recomp and grouped linear

Signed-off-by: zhongboz <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add benchmark for quantizer

Signed-off-by: zhongboz <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add benchmarks for linear layer

Signed-off-by: zhongboz <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* bug fix, typo

Signed-off-by: zhongboz <[email protected]>

* resolve more mr comments

Signed-off-by: zhongboz <[email protected]>

* avoid potential race condition by not using from_blob to construct amax tensor in C++

Signed-off-by: zhongboz <[email protected]>

* resolve more comments

Signed-off-by: zhongboz <[email protected]>

* Debug linter warnings and license check

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Debug import error in FP8 tensor test

Signed-off-by: Tim Moon <[email protected]>

* Debug compilation error with CUDA 12.1 for Turing

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* resolve mr comments, fix activation cast fusion

Signed-off-by: zhongboz <[email protected]>

* resolve comments, add NVTEQuantizationParams for compute scale

Signed-off-by: zhongboz <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove is_current_scaling check totally from common folder

Signed-off-by: zhongboz <[email protected]>

* remove benchmarks, will contribute in another repo

Signed-off-by: zhongboz <[email protected]>

* adjust cs default recipe config

Signed-off-by: zhongboz <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* adjust comments in test

Signed-off-by: zhongboz <[email protected]>

* Remove current scaling mode from core lib

Signed-off-by: Tim Moon <[email protected]>

* Refactor current-scaling-specific logic in core C++ lib

Move amax and scale update functions out of casting functions, and put into dedicated current-scaling source file. Add general API for accessing quantization config object.

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add missing header in C++ tests

Signed-off-by: Tim Moon <[email protected]>

* Disable test config with FP8 transpose on Blackwell

Signed-off-by: Tim Moon <[email protected]>

* Fix compilation error in C++ test

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: zhongboz <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: zhongboz <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
* Verified TE2.0 with offloading

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Skipping tests for Ampere and removed child class preparing

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* offloading support for MXFP8 dtype

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Changed quantized tensor detection mechanism

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* Fix mxfp8 offload, lint errors, and var name

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Supported disabling offloading for quantized tensors

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* bug fix

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed bugs

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added support for None in list of Quantized data tensors

Signed-off-by: root <[email protected]>

* Hopper backward compatibility cleanup

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Coding style nit

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added guards

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Selvaraj Anandaraj <[email protected]>
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Co-authored-by: Selvaraj Anandaraj <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
Internal quantizer for input to the modules

Signed-off-by: Przemek Tredak <[email protected]>
Remove Megatron-LM convergence test

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
Revert "Use internal quantizer for input to the modules (#1551)"

This reverts commit b3e7035.

Signed-off-by: Przemek Tredak <[email protected]>
…… (#1540)

Remove xla_ignore_channel_id check and ignore Scan loop warning in unit test

Signed-off-by: Reese Wang <[email protected]>
* fix dtypes in fused attn bwd for FP8

Signed-off-by: Charlene Yang <[email protected]>

* add comments for dtypes

Signed-off-by: Charlene Yang <[email protected]>

* remove redundant qkv_dtype in fwd

Signed-off-by: Charlene Yang <[email protected]>

* remove Nones in bwd returns

Signed-off-by: Charlene Yang <[email protected]>

---------

Signed-off-by: Charlene Yang <[email protected]>
* Explicitly use python3 and pip3

Signed-off-by: Tim Moon <[email protected]>

* Run pre-commit as Python module

Signed-off-by: Tim Moon <[email protected]>

* Replace some missed references to "python" or "pip"

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Make ffi compatible with jax 0.4

Signed-off-by: Reese Wang <[email protected]>
Co-authored-by: Phuong Nguyen <[email protected]>
* Delete row-wise data in single-GPU linear forward

Signed-off-by: Tim Moon <[email protected]>

* Debug Python->C++ parsing of transpose-only Float8Tensors

Signed-off-by: Tim Moon <[email protected]>

* Debug tensor shape calculation without row-wise data

Signed-off-by: Tim Moon <[email protected]>

* Debug correctness issues with only column-wise data

Signed-off-by: Tim Moon <[email protected]>

* Only cache column-wise input in LayerNormLinear

Signed-off-by: Tim Moon <[email protected]>

* Support MXFP8 all-gather with only column-wise data

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix moe cases, lint, rm unused ctx

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Fix CPU activation offloading and use consistent logic for save/restore

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Fix tests

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Fix typo

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* RM stray file

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Fix distributed and cpp tests

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Fix norm cpp tests

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Rm stray file

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* RM stray file

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Fix MXFP8 AG

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Fix FP8 with sequence parallelism

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Fix UB bulk dgrad

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
…e (#1558)

* add tex.bgrad_quantize support for CS

Signed-off-by: zhongboz <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unused import

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: zhongboz <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: zhongboz <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
update FE to 1.11

Signed-off-by: Charlene Yang <[email protected]>
fix cpu device import error

Signed-off-by: Hongxiao Bai <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
* Add options to comm overlap tests

Signed-off-by: Vasudevan Rengasamy <[email protected]>

* Fix Typo

Signed-off-by: Vasudevan Rengasamy <[email protected]>

* Update tests/pytorch/distributed/run_layer_with_overlap.py

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Vasudevan Rengasamy <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
* Create pytorch/dot_product_attention module and pytorch/d_p_a/utils.py
Move attention logging into a separate class in pytorch/d_p_a/utils.py

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* Create FlashAttentionUtils class in pytorch/d_p_a/utils/py for versioning info
Move versioning info out of pytorch/attention.py

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* Move AttentionParams and get_attention_backend from attention.py to d_p_a/utils.py
Fix tests and imports for the above refactor change

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Move get_qkv_layout(), get_full_mask(), get_alibi(), get_attention_quantizers() to d_p_a/utils.py

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Move tensor packing and unpacking helper functions from pyt/attention.py to d_p_a/utils.py

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Move cumulative seqlens and indices methods from pyt/attention.py to d_p_a/utils.py
Rename cumulative functions from using _cu_ to using _cumul_ to differentiate from CUDA cu calls protocol
Rename tensor packaging methods with leading underscore to make them as internal to file

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unnecessary imports in pytorch/attention.py and d_p_a/utils.py

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* Create d_p_a/inference.py and move InferenceParams from pyt/attention.py to it
Modify tests and other files to import InferenceParams correctly

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

Modify docs api for InferenceParams

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Create d_p_a/rope.py and move RoPE methods from  pytorch/attention.py to it

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Code cleanup

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix qa testing induced bug
Code clean up

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix incorrect pack_tensor arg type
Code clean up

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* nit: Resolve lint errors

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove typedef FAUtils for FlashAttentionUtils
Use attn_log instead of att_log

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

Fix lint error

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* nit: Fix the function name from get_cumul to the earlier get_cu

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

* nit: Fix typos, explicit imports and remove extra comments

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>

---------

Signed-off-by: Kshitij Janardan Lakhani <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Charlene Yang <[email protected]>
…554)

* support tp-comm-overlap in Current Scaling recipe

Signed-off-by: Li Tao <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean

Signed-off-by: Li Tao <[email protected]>

* fix test recipe argument to generalize to MXFP8

Signed-off-by: Li Tao <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Reduce duplicated transpose in certain cases

Signed-off-by: Li Tao <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use per_tensor_scaling() to judge DS or CS

Signed-off-by: Li Tao <[email protected]>

* minor fixes

Signed-off-by: Li Tao <[email protected]>

* change comment description

Signed-off-by: Li Tao <[email protected]>

* add multi-layer unit test for tp overlap

Signed-off-by: Li Tao <[email protected]>

* support test case that run for several times

Signed-off-by: Li Tao <[email protected]>

* avoid save ub tensor in prepare_for_saving

Signed-off-by: Li Tao <[email protected]>

* fix

Signed-off-by: Li Tao <[email protected]>

* switch to a simple fix

Signed-off-by: Li Tao <[email protected]>

* formatting

Signed-off-by: Li Tao <[email protected]>

* simply test cases; avoid additional clone()

Signed-off-by: Li Tao <[email protected]>

* fall back to get_buffer in layernormmlp

Signed-off-by: Li Tao <[email protected]>

* use 2 layers for fp8 tpoverlap multi-layer test for better tolerance, limit max gpus for test

Signed-off-by: zhongboz <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Li Tao <[email protected]>
Signed-off-by: zhongboz <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: zhongboz <[email protected]>
* Add issue template

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Fixes

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

* Make GPU info section

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

---------

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
* Do not create multiple cublas handle

Signed-off-by: Przemek Tredak <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix for multiple GPUs per thread

Signed-off-by: Przemek Tredak <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix multithreaded execution

Signed-off-by: Przemek Tredak <[email protected]>

* Fix from conlfict

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

---------

Signed-off-by: Przemek Tredak <[email protected]>
Signed-off-by: Kirthi Shankar Sivamani <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
* DistOpt support with offloading

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* Added distopt support for TE2.0

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* Restricted this to MCore DistOpt only

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* Added guards

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/pytorch/module/linear.py

Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
Signed-off-by: Selvaraj Anandaraj <[email protected]>

* Update transformer_engine/pytorch/module/layernorm_linear.py

Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
Signed-off-by: Selvaraj Anandaraj <[email protected]>

---------

Signed-off-by: Selvaraj Anandaraj <[email protected]>
Signed-off-by: Selvaraj Anandaraj <[email protected]>
Co-authored-by: Selvaraj Anandaraj <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
* [QA] Add error handling

-Standardize test failure handling using the unified 'test_fail' function and 'error_exit' function.

Signed-off-by: Linxi Ding <[email protected]>

* Update script to use explicit python3, pip3, and python3 -m pytest calls

- Change pip to pip3.
- Change python to python3.
- Change pytest to python3 -m pytest.

Signed-off-by: Linxi Ding <[email protected]>

---------

Signed-off-by: Linxi Ding <[email protected]>
* Update full recompute feature to save recipe.

The recompute context uses the same recipe
and fp8 settings as the original fwd pass.

Signed-off-by: Keith Wyss <[email protected]>

* Formatted python code.

Signed-off-by: Keith Wyss <[email protected]>

* Simplify code by relying on recipe in ctx

Signed-off-by: Keith Wyss <[email protected]>

* MR feedback: import style

Signed-off-by: Keith Wyss <[email protected]>

---------

Signed-off-by: Keith Wyss <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
* add paged attention; test_kv_cache_accuray and test_paged_attn pass

Signed-off-by: Charlene Yang <[email protected]>

* remove unnecessary change from last commit

Signed-off-by: Charlene Yang <[email protected]>

* test_fused_attn pass

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove unnecessary import in test_numerics

Signed-off-by: Charlene Yang <[email protected]>

* add license for test

Signed-off-by: Charlene Yang <[email protected]>

* fix lint

Signed-off-by: Charlene Yang <[email protected]>

* add to L0 test

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update license for test_paged_attn

Signed-off-by: Charlene Yang <[email protected]>

* update kv_cache_manager license

Signed-off-by: Charlene Yang <[email protected]>

* fix build issue from previous merge

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* WIP: minor fix/preparation for inference/cuda graph

Signed-off-by: Charlene Yang <[email protected]>

* WIP: non-paged

Signed-off-by: Charlene Yang <[email protected]>

* WIP: non-paged, bshd/sbhd

Signed-off-by: Charlene Yang <[email protected]>

* WIP: non-paged, thd, no CG

Signed-off-by: Charlene Yang <[email protected]>

* WIP: non-paged, thd, CG

Signed-off-by: Charlene Yang <[email protected]>

* WIP: non-paged, CG

Signed-off-by: Charlene Yang <[email protected]>

* WIP: non-paged, using paged kernel

Signed-off-by: Charlene Yang <[email protected]>

* WIP: restructure kernels

Signed-off-by: Charlene Yang <[email protected]>

* WIP: paged, CG

Signed-off-by: Charlene Yang <[email protected]>

* WIP: padding + BRCM

Signed-off-by: Charlene Yang <[email protected]>

* WIP: restructure IP, clean up

Signed-off-by: Charlene Yang <[email protected]>

* WIP: fix non-CG, fused

Signed-off-by: Charlene Yang <[email protected]>

* WIP: fix last commit

Signed-off-by: Charlene Yang <[email protected]>

* WIP: unfused, non-CG

Signed-off-by: Charlene Yang <[email protected]>

* WIP: flash-attn, non-CG

Signed-off-by: Charlene Yang <[email protected]>

* WIP: flash_attn_with_kvcache

Signed-off-by: Charlene Yang <[email protected]>

* commit two files missed by bcef6b34

Signed-off-by: Charlene Yang <[email protected]>

* WIP: thd_bshd_bshd

Signed-off-by: Charlene Yang <[email protected]>

* WIP: fix last commit

Signed-off-by: Charlene Yang <[email protected]>

* WIP: fix 1c31b68d

Signed-off-by: Charlene Yang <[email protected]>

* WIP: add bshd_2sbhd, sbhd_2bshd

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* WIP: some cleanup

Signed-off-by: Charlene Yang <[email protected]>

* WIP: all qkv_format combinations and merge CM files

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* WIP: some lint fixes

Signed-off-by: Charlene Yang <[email protected]>

* WIP: add docstring for IP

Signed-off-by: Charlene Yang <[email protected]>

* fix sequences_pre

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* WIP: minor fixes for multi-layer

Signed-off-by: Charlene Yang <[email protected]>

* WIP: initial multi-layer test

Signed-off-by: Charlene Yang <[email protected]>

* WIP: minor clean up

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* WIP: clean up

Signed-off-by: Charlene Yang <[email protected]>

* WIP: switch to flash_attn_varlen_func

Signed-off-by: Charlene Yang <[email protected]>

* WIP: fix unfused for separate q/kv format

Signed-off-by: Charlene Yang <[email protected]>

* WIP: fix fused for separate q/kv formats

Signed-off-by: Charlene Yang <[email protected]>

* WIP: flash attn + TELayer + 2 layers

Signed-off-by: Charlene Yang <[email protected]>

* WIP: unfused + TL + 2layers

Signed-off-by: Charlene Yang <[email protected]>

* WIP: all modules/backend

Signed-off-by: Charlene Yang <[email protected]>

* WIP: minor cleanup

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* WIP: FlashAttention on Hopper with 2.7.3

Signed-off-by: Charlene Yang <[email protected]>

* WIP: FlashAttention + v3 from 39e7179

Signed-off-by: Charlene Yang <[email protected]>

* WIP: FlashAttention + v3 + FP8 + WIP

Signed-off-by: Charlene Yang <[email protected]>

* WIP: add backend support table

Signed-off-by: Charlene Yang <[email protected]>

* WIP: clean up

Signed-off-by: Charlene Yang <[email protected]>

* WIP: separate use_flash_attention_2 and _3

Signed-off-by: Charlene Yang <[email protected]>

* WIP: tweaks to paged attn script

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* WIP: enable/disable certain cases for fused attn

Signed-off-by: Charlene Yang <[email protected]>

* WIP: small fixes for lint and cg

Signed-off-by: Charlene Yang <[email protected]>

* WIP: minor fixes for attn/infer

Signed-off-by: Charlene Yang <[email protected]>

* WIP: fix CP

Signed-off-by: Charlene Yang <[email protected]>

* WIP: readd page info to FADescriptor_v1

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor tweak to test_numerics.py

Signed-off-by: Charlene Yang <[email protected]>

* fix 9.5/9.7 sq/skv + mask logic

Signed-off-by: Charlene Yang <[email protected]>

* clean up

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fix for FA3

Signed-off-by: Charlene Yang <[email protected]>

* more minor fixes for FA3

Signed-off-by: Charlene Yang <[email protected]>

* test page_size=1 for FA3

Signed-off-by: Charlene Yang <[email protected]>

* fix t3hd/th3d strides

Signed-off-by: Charlene Yang <[email protected]>

* fix ckpt recompute and fa3 k_scale

Signed-off-by: Charlene Yang <[email protected]>

* raise dynamo recompile limit for test

Signed-off-by: Charlene Yang <[email protected]>

* remove thunder test from L0

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix FA selection logic

Signed-off-by: Charlene Yang <[email protected]>

* fix FA3 q_descale shape

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove page_table from IP.step() returns

Signed-off-by: Charlene Yang <[email protected]>

* fix FP8 FlashAttn DPA fp8_dpa tests

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix CP

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor tweaks

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update FA3 note and L3 test

Signed-off-by: Charlene Yang <[email protected]>

* fix lint

Signed-off-by: Charlene Yang <[email protected]>

* remove redundant import in test

Signed-off-by: Charlene Yang <[email protected]>

* adopt new FA3 APIs from FA2.7.3+/hopper for CP and non-CP

Signed-off-by: Charlene Yang <[email protected]>

* fix lint

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* relax tols for TransformerLayers

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix merge

Signed-off-by: Charlene Yang <[email protected]>

* fix merge 2

Signed-off-by: Charlene Yang <[email protected]>

* fix FA import comments

Signed-off-by: Charlene Yang <[email protected]>

* relax tols for Ampere

Signed-off-by: Charlene Yang <[email protected]>

* fix fa3 version and reduce messaging

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update FA3 to its latest commit on main

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add default values to IP and assertion to graph.py

Signed-off-by: Charlene Yang <[email protected]>

* add more comments in attention

Signed-off-by: Charlene Yang <[email protected]>

* use custom_cache_manager instead of cache_manager

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Charlene Yang <[email protected]>
Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix softmax shape for THD format.

Signed-off-by: Michael Goldfarb <[email protected]>
* Do not apply bias when apply_bias is False

Signed-off-by: Przemek Tredak <[email protected]>

* Bwd fix for LNMLP and tests

Signed-off-by: Przemek Tredak <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix for the dbias calculation

Signed-off-by: Przemek Tredak <[email protected]>

* Improve tests and cleaning the logic

Signed-off-by: Przemek Tredak <[email protected]>

* Tightened test tolerances a little

Signed-off-by: Przemek Tredak <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert "Tightened test tolerances a little"

This reverts commit 2e20a92c884a84759006541adc1d638ab91dde62.

Signed-off-by: Przemek Tredak <[email protected]>

* Update tests/pytorch/test_numerics.py

Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Przemyslaw Tredak <[email protected]>

* Fix the Gelu Aux type

Signed-off-by: Przemek Tredak <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove use_fc1_bias option

Signed-off-by: Przemek Tredak <[email protected]>

---------

Signed-off-by: Przemek Tredak <[email protected]>
Signed-off-by: Przemyslaw Tredak <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.