add multi-item scoring #1015

arde171 · 2025-04-11T17:37:10Z

Co-authored with Qingquan Song (@qingquansong) and Ziang Li (@zianglih )

Multi-item scoring

concatenate multiple candidates of a same member with all ranking candidates with delimiter separation.
<member prefix (profile & history)> + + + + item 2 + ... + item N
Extract the logits of the hidden states of the tokens before each delimiter token and extract the log prob of given label tokens. For each single prompt, output returned will be a 2D list with shape N * K where N is the number of candidate it contains and K is the number of choices we provided to the server engine (e.g., 2 for ["Yes", "No"])) (mainly done in the logit processor)

The PR optimized the multi-item scoring attention by passing four new args and use it to check the masking condition. The provided args are:

prefix_len_ptr :Optional[torch.Tensor]
    prefix length. A uint32 1D tensor indicating the prefix length of each prompt. The tensor size is equal to the batch size.
token_pos_in_items_ptr : Optional[float]
    A uint16 1D tensor (it will be converted to uint16 in flashinfer) indicating the token position of each item and started from 0 (delimiter)
    for each item. E.g., if we have 3 items of length 3, 2, 4 respectively for this member. This vector will be looking like
    `[0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, 0]` with 4 delimiters indexed as 0. For batch size > 1,
    we will concat them as 1D with zero paddings to make sure each has the same length, the padding length is defined by
    `token_pos_in_items_len` - length of the raw `token_pos_in_items_ptr` for each prompt.
token_pos_in_items_len : Optional[int]
    zero padding length for `token_pos_in_items_ptr` to better handle the bsz > 1 case. Still using the above 3,2,4 example.
    If we set `token_pos_in_items_len` to be 20, it will be  `[0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0]`
    with 7 padded zeros. (note there're 8 zeros in the end where the first one is the delimiter token 0 in the end of the prompt)
max_item_len_ptr : Optional[float]
    a uint16 vector contains the max token length of all items for each prompt

Optimizations

Implement efficient multi-item scoring mask for FA2 and FA3.
Enhance FA3 to support batch-idx for the multi-item scoring mask.
Implement skip tiles for FA2 and FA3 multi-item scoring
Optimize mask by preloading to L1 cache for thread register.

Co-authored-by: qingquansong <[email protected]> Co-authored-by: zianglih <[email protected]>

add multi-item scoring

qingquansong · 2025-04-11T17:42:19Z

Hey @yzh119 as discussed, here's the PR for multi-item scoring masked attention. Please feel free to leave comments and provide suggestions if there could be better ways to help upstream the change. Thank you in advance!

yzh119

Overall LGTM, leave some comments on additional parameters.

btw, some unittests failed (https://github.com/flashinfer-ai/flashinfer/blob/9220fb3443b5a5d274f00ca5552f798e225239b7/tests/test_block_sparse.py) bacause of the pybind interface change, would you mind fixing the sparse APIs as well (https://github.com/flashinfer-ai/flashinfer/blob/main/flashinfer/sparse.py)?

yzh119 · 2025-04-16T08:13:06Z

csrc/batch_prefill_customize_config.jinja

@@ -108,6 +112,10 @@ struct PagedParams {
  uint32_t* total_num_rows;
  uint32_t padded_batch_size;
  bool partition_kv;
+  uint32_t* prefix_len_ptr;


Can you move them to additional params? I tend to managing all of the options parameters as additional, instead of default ones, which is easier to manager.

Examples include:

flashinfer/flashinfer/jit/attention/pytorch.py

Line 456 in 9220fb3

additional_tensor_names = ["maybe_custom_mask", "maybe_alibi_slopes"]

(we use prefix maybe_ to indicate these components are optional and have type std::optional<...>).

@yzh119 as suggested, moved multi-item scoring parameters as addtional.

yzh119 · 2025-04-16T08:13:34Z

csrc/batch_prefill_sm90_customize_config.jinja

@@ -66,6 +67,11 @@ struct RaggedParams {
  int window_left;

  bool causal;
+


Ditto, better to move to additional params.

yzh119 · 2025-04-16T08:13:54Z

csrc/batch_prefill_sm90_customize_config.jinja

@@ -43,6 +43,7 @@ struct RaggedParams {
  IdType* kv_lens;
  IdType* head_indices;
  IdType* work_indptr;
+  IdType* batch_indices;


Thanks for doing this, yes we have to add it.

zianglih · 2025-04-16T08:38:52Z

include/flashinfer/attention/prefill.cuh

@@ -786,6 +786,73 @@ __device__ __forceinline__ void logits_mask(
  }
 }

+template <typename KTraits, typename Params>
+__device__ __forceinline__ void logits_mask_customized(


Better rename this function to "logits_mask_multi_item_scoring". Or move its body to the previous "logits_mask".

zianglih · 2025-04-16T08:43:38Z

include/flashinfer/attention/prefill.cuh

@@ -2114,9 +2216,18 @@ __device__ __forceinline__ void BatchPrefillWithPagedKVCacheDevice(
             : chunk_size) /
        CTA_TILE_KV;

+    const uint32_t unified_num_iterations =


Maybe rename MIS "num_iterations_full" to "num_iterations", and MIS "num_iterations" to "num_iterations_prefix" to avoid redundancy.

use additional params, reactor and simplify the code

qingquansong · 2025-04-29T17:59:13Z

Hey @yzh119 , @arde171 has resolved the comments, could you help take another look? Thank you!

yzh119

Hi @arde171 @qingquansong @zianglih thanks for the great contribution, the PR looks good to me in general and I have add some commits to address the conflicts with mainline. Let's merge this PR first and then move forward.

The remaining possible improvements include (not necessarily in this PR):

consider the skipped blocks in scheduler (plan function), otherwise the ahead-of-time scheduler might have false estimation about execution time of a tile.
further modularize the template to make the attention pattern in multi-item scoring a special form of attention variant, currently we still insert some special code to handle this pattern in the template but we hope we can fully decouple attention variant from the template themselves.

@abcdabcd987

## 📌 Description - `batch_indices_offset` (introduced in #1015 ) are not passed to fp8 attention kernels, this PR fixes the issue. - adding fp8 kernels to aot generators. ## 🔍 Related Issues #1064 --- ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). --- ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). --- ## Reviewer Notes cc @abcdabcd987

@abcdabcd987

…i#1087)  ## 📌 Description - `batch_indices_offset` (introduced in flashinfer-ai#1015 ) are not passed to fp8 attention kernels, this PR fixes the issue. - adding fp8 kernels to aot generators. ## 🔍 Related Issues flashinfer-ai#1064 --- ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). --- ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). --- ## Reviewer Notes cc @abcdabcd987

arde171 and others added 4 commits April 11, 2025 16:53

add multi-item scoring

4bc79e6

Co-authored-by: qingquansong <[email protected]> Co-authored-by: zianglih <[email protected]>

fix precommit errors

70bd358

Merge pull request #1 from arde171/arde/mis

1f79f0d

add multi-item scoring

Merge branch 'flashinfer-ai:main' into main

46fbae2

yzh119 self-requested a review April 11, 2025 18:11

yzh119 reviewed Apr 16, 2025

View reviewed changes

zianglih reviewed Apr 16, 2025

View reviewed changes

arde171 added 21 commits April 18, 2025 18:34

fix clang

3e8f974

fix unit test

059fffc

update

3f76f38

additional params

6164dfe

fix

8084962

fixt

b15dafd

fix

b3c5deb

fix

53de0e7

fix

a134677

fix

f0a4458

fix

6ea1d2a

fix

f0ee31d

fix

4ef78e5

fix

6c2f25e

revert

485e2bb

fix

68d87bd

revert

c959fcf

fix pybind

32b8ea8

fix

7e1d7a2

revert

c348fda

revert

33ff636

arde171 added 4 commits April 28, 2025 10:50

incorporate review comments

5fbb4b2

use additional params, reactor and simplify the code

Merge branch 'main' into main

0858050

typo

bf2c2f3

fix

6225af8

yzh119 added 8 commits April 30, 2025 12:33

Merge remote-tracking branch 'origin/main' into arde171/main

a1535a6

upd

079c7fe

else branch should be protected by constexpr

b917c64

upd

e77eca5

bugfix

24bfece

fix conflicts with fp8 hopper

cef3255

fix fp8

4c19a0c

bugfix

b256c8f

yzh119 approved these changes Apr 30, 2025

View reviewed changes

yzh119 and others added 4 commits April 30, 2025 15:50

bugfix

f4629f4

fix decode

06e11dd

remove lineinfo to fix binary size overflow

e929fc4

remove 3 from aot mask mode

398b7a0

yzh119 merged commit 6c6f1a5 into flashinfer-ai:main Apr 30, 2025
2 checks passed

yzh119 mentioned this pull request May 14, 2025

C++ test example failed for test_single_prefill #1057

Open

yzh119 mentioned this pull request May 23, 2025

bugfix: fix fp8 attention kernels aot compilation issue #1087

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add multi-item scoring #1015

add multi-item scoring #1015

Uh oh!

arde171 commented Apr 11, 2025

Uh oh!

qingquansong commented Apr 11, 2025

Uh oh!

yzh119 left a comment

Uh oh!

yzh119 Apr 16, 2025

Uh oh!

arde171 Apr 28, 2025

Uh oh!

yzh119 Apr 16, 2025

Uh oh!

arde171 Apr 28, 2025

Uh oh!

yzh119 Apr 16, 2025

Uh oh!

zianglih Apr 16, 2025

Uh oh!

arde171 Apr 28, 2025

Uh oh!

zianglih Apr 16, 2025 •

edited

Loading

Uh oh!

arde171 Apr 28, 2025

Uh oh!

qingquansong commented Apr 29, 2025

Uh oh!

yzh119 left a comment

Uh oh!

Uh oh!

Uh oh!

		@@ -66,6 +67,11 @@ struct RaggedParams {
		int window_left;

		bool causal;

add multi-item scoring #1015

add multi-item scoring #1015

Uh oh!

Conversation

arde171 commented Apr 11, 2025

Uh oh!

qingquansong commented Apr 11, 2025

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zianglih Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qingquansong commented Apr 29, 2025

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zianglih Apr 16, 2025 •

edited

Loading