Skip to content

[windows] Fix assembler error in pal trap handler. #163

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

stellaraccident
Copy link
Contributor

@stellaraccident stellaraccident commented May 10, 2025

This has been showing up in real use as an error printed to the console complaining that the ".not_s_trap" label cannot be found on device initialization.

Tracked back to this commit: 7b72c1b

In the referenced original source, the .not_s_trap label contained conditional code for gfx94x which was removed. Outside of that case, it jumps to .no_skip_debug_trap, which we use here.

Note that I think this was broken since the above commit was landed. However, there does not seem to be error handling around the comgr call to assemble and install the trap handler, so this only manifests as an LLVM error printed to the console. I'm not sure what testing is done on this, but this means that if it was just an unattended pipeline, the bug may have been missed. Recommend fixing error handling to fail on error.

Tested by running hipblaslt tests on Windows/gfx1151 in TheRock and verifying that the error was not printed and tests run correctly.

jaydeeppatel1111 and others added 30 commits December 12, 2024 06:17
…eam.

Change-Id: I9c0f94a9531555278a51202ec7203961e1344c2e
Remove static functions in graph

Change-Id: I4df94915f81f250acaea60398aea32ef0ed658e2
Change-Id: I95421c87904dd62d7ee214539a57c7bda1097ff4
Change-Id: Id624b2c91e6b701bc0ee561a0c193f2c66654890
Support gfx9-4-generic target to cover mi3XX.
Support features sramecc and xnack in generic target.
Improve some code formats.
Add more log on compiler.

Change-Id: I6b3c6af55c60cffd43ce6f17b75998f751b75713
Change-Id: I5ac63a6626af8c2b4ac382c52dfe1aaf0b3716b8
Change-Id: I9028df0bb73289791d169e7f064a1d0f615236a5
…rams APIs

Change-Id: Ieccecfe6173cc68fd3c01f86c99f7cc09fe194a3
Change-Id: Ia21afddf5223ecd132a06f37bb430961fb7a9341
Change-Id: Id4df63b8ae64a1113f85d89aa250ac9f7cc8b9bb
…aphExec

Change-Id: I54d67a1665355579bc249d8ff4f9806e9ee14588
Change-Id: I2aac9d211f64b3d6c121d8b010d215dcbdeac3aa
- added wptr and rptr to ClPrint in dispatchBarrierPacket and dispatchBarrierValuePacket

Change-Id: I8a62289deb23c9f657a9b0ac6138bb55eafecba2
Change-Id: I60777ef5c56b60dd8100d0d794ca10fb3b96a555
Change-Id: I2093a39d79a46da7e102266c04c2a71e03dcb88e
Use shared mutex for events validation

Change-Id: Iff291c758d9edd65717c506150f3b9d39e5306ba
Change-Id: I9c91f5d945a8d8bd2b2f55e3d11ede66afe4eef7
Change-Id: I6e58dfbe4ba13db8717edc36020fefabc9ddbe23
The vector with all kernels is preallocated on the executable init.
Thus, reduce the scope  of global lock to the binary creation only.

Change-Id: I73035013a6562175069137e895bba815f466ee35
Change-Id: I17fdaf7ac323507f99a7c071066944296537489c
NOPTION is meant for component options or alias runtime options so
the option group must not be OA_RUNTIME or OA_MISC_ALIAS must be set,
otherwise we incorrectly assume that it has an option variable and
attempting to write to it causes corruption of OptionVariables.

Change-Id: Iafb5a8f743e5ed0f87be36061c44578178f6cfde
Change-Id: Id33144623555a5d25e029ca644f6274610dcd0ad
Updated CHANGELOG to include the performance fix for
kernel launch latency with increasing number of idle streams.

Change-Id: I509e14cb8f8cd3abe61c6ede78808e96ef8f06e1
Change-Id: I9a764ac99cd03d0a18ebc99cdd0313301e35565b
Support different address modes in X, Y, Z directions

Change-Id: If1db5a8af33c92dd14b48968c3e8eceb97daea6c
… node should be same.

Change-Id: I6ebc21cc42e41ad5d952a69fb3b3cb095f32cffb
Uri decoder logic currently silently ignores processing of memory uri.
This patch enables the existing logic to handle the processing of offset
and size related to loaded code-object having memory URI.

Change-Id: If03579cefb11d91f667410464dc89404df9270a3
…1976)

Change-Id: I45ae4711a047f4484a018b9409c9f6ecf09720ce
Change-Id: I9d69695e4b6668e6de00f1f6b060862872358340
…or dispatch packets

 - Added DEBUG_CLR_SKIP_RELEASE_SCOPE flag to force release scope to
   SCOPE_NONE in AQL packet header

Change-Id: Ife02cddb9d5cd4749103ce585d3d5fe9024c6868
Zhang, Victor and others added 7 commits May 7, 2025 09:52
* SWDEV-528142 - add error check for KernelParameters::capture

* Update kernel.cpp

---------

Co-authored-by: victzhan <[email protected]>
Convention is to always link against .so.* at runtime.
Having it link against .so will break on systems that package
the .so files in their dev/devel package.

This issue was found when building ROCm 6.4 for Fedora.

Commiting on behalf of GitHub user Mystro256
* SWDEV-528913 - support gfx950 in rocsetting

---------

Co-authored-by: Jimbo Xie <[email protected]>
* SWDEV-527299 - Support HIP_POINTER_ATTRIBUTE_CONTEXT

As HIP enables UVA by default, it seems we can simply expose the context to support this feature.
This has been showing up in real use as an error printed to the console complaining that the ".not_s_trap" label cannot be found on device initialization.

Tracked back to this commit: ROCm@7b72c1b

In [the referenced original source](https://github.com/ROCm/ROCR-Runtime/tree/amd-staging/runtime/hsa-runtime/core/runtime/trap_handler), the `.not_s_trap` label contained conditional code for gfx94x which was removed. Outside of that case, it jumps to `.no_skip_debug_trap`, which we use here.

Tested by running hipblaslt tests on Windows/gfx1151 in TheRock and verifying that the error was not printed and tests run correctly.
amd-justchen
amd-justchen previously approved these changes May 10, 2025
Copy link

@amd-justchen amd-justchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanation makes sense to me

lancesix
lancesix previously approved these changes May 11, 2025
Copy link
Contributor

@lancesix lancesix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for catching that. I thought I had a similar patch somewhere in the pipeline to fix this, but this one is equivalent.

FWIW, this LGTM.

stellaraccident added a commit to ROCm/TheRock that referenced this pull request May 11, 2025
@stellaraccident
Copy link
Contributor Author

Do you all have a special procedure for landing patches? I'm getting "Cannot change this locked branch".

@jayhawk-commits

@stellaraccident stellaraccident changed the base branch from amd-staging to develop May 11, 2025 05:22
@stellaraccident stellaraccident dismissed stale reviews from lancesix and amd-justchen May 11, 2025 05:22

The base branch was changed.

@stellaraccident
Copy link
Contributor Author

Hi, thanks for catching that. I thought I had a similar patch somewhere in the pipeline to fix this, but this one is equivalent.

FWIW, this LGTM.

Indeed - looks fixed on develop. I got confused and thought this was one of the oddball repos that was developing on a staging branch. Anyway, closing.

@lamb-j
Copy link
Contributor

lamb-j commented May 16, 2025

@lancesix does https://github.com/AMD-ROCm-Internal/clr/commit/c35e9643ecaba41a69da3fc4ac01203586454176 fix this issue?

It looks like we're still using .not_s_trap instead of .no_skip_debug_trap

@lancesix
Copy link
Contributor

It should fix the assembling issue, doesn't it? .not_s_trap and .no_skip_debug_trap do refer to the same point in the code. I just used another label to make it slightly clearer. Some pseudo code for what this does is:

if (ttmp1.trap_id != 0)
  {
    // we entered the trap handler because we executed `s_trap X`, X being available in ttmp1
    ttmp_0_1.PC += 4; // the wave's saved PC points to the s_trap instruction itself, advance past it.

    if (ttmp1.trap_id == debug_trap && !ttmp11.debug_enabled)
      {
        // this is __builtin_debugtrap (), but the debugger is not attached, just skip this trap.
       // Logically return from the trap handler, we need to jump to .exit to do that.
       return;
      }
    .no_skip_debug_trap
  }
.not_s_trap

// continue processing the trap

@lamb-j
Copy link
Contributor

lamb-j commented May 16, 2025

Great, thanks for clarifying @lancesix!

@lamb-j lamb-j closed this May 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.