[FlexAttention] FlexDecoding accuracy discrepancy between XPU and CUDA while compiling `torch.ops.higher_order.flex_attention`

### Describe the bug

Issue Description:
While running the XPU FlexDecoding UT, we found a test failure due to a tensor mismatch.
```bash
python test/inductor/test_flex_decoding.py TestFlexDecoding.test_paged_attention_page_size_float16_score_mod1_head_dims1_page_size_256
```
We captured the compiled results on both XPU and CUDA, unified their outputs, and found that running the Triton codes generated by UT will get different results.

[triton-code.zip](https://github.com/user-attachments/files/19050895/triton-code.zip)


### Environment details

You can refer to this issue to setup a reproducing environment:
https://github.com/intel/intel-xpu-backend-for-triton/issues/3518

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FlexAttention] FlexDecoding accuracy discrepancy between XPU and CUDA while compiling `torch.ops.higher_order.flex_attention` #3588

Describe the bug

Environment details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FlexAttention] FlexDecoding accuracy discrepancy between XPU and CUDA while compiling torch.ops.higher_order.flex_attention #3588

Description

Describe the bug

Environment details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[FlexAttention] FlexDecoding accuracy discrepancy between XPU and CUDA while compiling `torch.ops.higher_order.flex_attention` #3588