Skip to content

[FlexAttention] FlexDecoding accuracy discrepancy between XPU and CUDA while compiling torch.ops.higher_order.flex_attention #3588

Closed as duplicate of#3631
@hoshibara

Description

@hoshibara

Describe the bug

Issue Description:
While running the XPU FlexDecoding UT, we found a test failure due to a tensor mismatch.

python test/inductor/test_flex_decoding.py TestFlexDecoding.test_paged_attention_page_size_float16_score_mod1_head_dims1_page_size_256

We captured the compiled results on both XPU and CUDA, unified their outputs, and found that running the Triton codes generated by UT will get different results.

triton-code.zip

Environment details

You can refer to this issue to setup a reproducing environment:
#3518

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions