Description
Describe the bug
Issue Description:
While running the XPU FlexDecoding UT, we found a test failure due to a tensor mismatch.
python test/inductor/test_flex_decoding.py TestFlexDecoding.test_paged_attention_page_size_float16_score_mod1_head_dims1_page_size_256
We captured the compiled results on both XPU and CUDA, unified their outputs, and found that running the Triton codes generated by UT will get different results.
Environment details
You can refer to this issue to setup a reproducing environment:
#3518