Skip to content

Add CoreCLR support for android GC bridge #116310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Jul 2, 2025

Conversation

BrzVlad
Copy link
Member

@BrzVlad BrzVlad commented Jun 4, 2025

This change adds runtime support for the GCBridge api described in #115506 and to be used on android. It includes most of initial work from #114184.

When the GCBridge feature is used, at the start of the application JavaMarshal.Initialize is called. This will provide to the runtime a native callback (markCrossReferences) to be called during the GC when the collection takes place. During GC, we compute the set of strongly connected components containing bridge objects that are dead in the .NET space. These SCCs are passed to the callback so the .NET android implementation would reproduce the links between the java counterparts in order to determine whether the .NET object needs to be collected or not (The constraint is that the C# peer keeps the Java Peer alive and vice verse. We make no effort to handle finalization, so a resurrected C# object can have the Java Peer collected). Once the .NET Android runtime does the java collection it will report back to the runtime with the list of bridge objects that can be freed and with the previously passed SCC related pointers to be freed.

A bridge object is an object that can have a JavaPeer. The CoreCLR runtime has no insight into this, the only thing it understands are cross reference handles. These are GCHandles that have an additional pointer associated with them, so additional information related to the java peer can be attached. Objects that have a cross reference handle allocated, will always survive the current GC collection, because we can't collect them until we get permission from the Java world. Once the cross reference gchandle is freed, the associated object becomes ordinary, detached from any java peer, and it is free for collection in the .NET heap.

At the end of mark phase, during GC, we iterate over all cross reference handles. When we encounter a handle with target that hasn't yet been marked, we add it to a list (these objects will have to be marked so they remain alive after this collection, given we need to probe the java world first). Once we obtained the set of dead bridge objects, we apply the tarjan algorithm (this algorithm is ported directly from mono's implementation). This algorithm will operate on the dead object graph, reachable from the initial set of dead bridge objects. In order to implement this secondary scanning mechanism, for objects that we reach, we hijack the object header with a ScanData that contains all information relevant to the SCC building algorithm. Once we finished building the SCCs, while still in the GC, we callback into the .NET Android via TriggerClientBridgeProcessing that will end up calling the mark cross reference callback provided by JavaMarshal.Initialize. This callback will have to dispatch the neceesary work for another thread to run, since it needs to return quickly, for the C# GC to continue its execution.

Because the world gets resumed without having decided yet whether the bridge objects will be alive or dead, for weak references, we would need to wait for the java bridge processing to finish before we can resolve the Target. Aside from the general problem of resurrecting a C# peer that has the Java Peer collected, this mechanism will be used internally by the .NET Android in order to manually manage liveness of these bridge objects, in the scenario of calling Dispose on an object. This synchronisation will be used at the core of .NET Android Runtime interop. In order to implement this, weak refs for bridge objects are not nulled during GC (these objects are promoted during collection) but rather at the finishing stage of bridge processing. This change is conservative and adds bridge waiting only for WeakReference, not when using GCHandle, following the existing approach in COM.

This PR adds a few tests in the runtime tests. The tests have a native counterpart that acts as the client bridge, not doing anything, just doing random sleep instead of doing the Java collection. The test creates a set of objects with certain links between then, creates weak refs to the BridgeObjects and then doesn't reference anything else. Depending on the built graph, it expects a certain number of SCCs and cross refs constructed by the tarjan algorithm, and then reports all bridge objects as alive or dead. The test will also check to see if the Target for all the weak refs is the expected one.

The gcbridge doesn't consume much memory. A collection for a heavier app can end with hundreds of SCCs and xrefs. For such a scenario, the gcbridge is expected to consume hundreds of KBs. Most of this memory is represented by data for ScanData, ColorData and stacks used by tarjan algorithm. These data structures have their capacity increased when necessary, so for most collections there is no new memory allocation, the existing storage is reused. For a few other data structures, like xrefs arrays and data allocated to be passed to the bridge client, new allocations from scratch happen at each collection. While the gc bridge can end up consuming hundreds of KB for heavy scenarios, maybe a few MB in extreme theoretical cases, less than 10% of this memory is expected to be allocated during collection, the rest should be reused.

@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 4, 2025
@BrzVlad BrzVlad added area-Interop-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jun 4, 2025
BrzVlad added 13 commits June 12, 2025 09:54
From Aaron's implementation
Checking if the object is promoted was validating the next object header in debug builds. During bridge tarjan computation, we patch the object header for some objects in order to store data used by the bridge algorithm, so we need to disable this validation.
HANDLE_MAX_INTERNAL_TYPES value

new instead of malloc

assert for allocation failure

Reuse memory for ColorData and ScanData between collections. We still do alloc/free for other type of data, for example for arrays representing edges between SCCs.

Actually print class name when enabling tarjan bridge logs.

Add separate IsPromoted method to the gc interface

Rename TriggerGCBridge to TriggerClientBridgeProcessing to be more specific about what it is doing.
@jkotas
Copy link
Member

jkotas commented Jun 26, 2025

The new test is failing in coreclr-release-outerloop-nightly. It should be fixed (e.g. you can catch PNSE exception during initialization when not on Android and skip the rest of the test).

The test used to fail and the JavaMarshal api would crash due to missing internal calls.
@BrzVlad
Copy link
Member Author

BrzVlad commented Jun 26, 2025

Implemented test skip when PNSE is thrown and fixed the API so it throws this exception rather than crashing as it was doing before. Changed so we throw PlatformNotSupportedException rather than NotSupportedException since it seems more explicit. cc @AaronRobinsonMSFT if you think we should keep the previous exception type.

@jkotas
Copy link
Member

jkotas commented Jun 26, 2025

/azp run coreclr-release-outerloop-nightly

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Maoni0
Copy link
Member

Maoni0 commented Jun 27, 2025

sorry, I didn't get a chance to look at the commits related to my feedback till now. the only comment I have is please move the GetHighPrecisionTimeStamp impl to gccommon.cpp so all the code in the gc dir can share it instead of multiple files defining their own duplicated copies. see log_init_error_to_host as an example.

@vitek-karas
Copy link
Member

@Maoni0 @jkotas - is this ready? Could it be approved and merged?

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best for @Maoni0 and @AaronRobinsonMSFT to sign-off on this one, but they are oof currently. If there is any additional feedback, it can be incorporated once they are back.

@mangod9
Copy link
Member

mangod9 commented Jun 30, 2025

/azp run runtime-coreclr outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mangod9
Copy link
Member

mangod9 commented Jun 30, 2025

I have just triggered an outerloop run given the surface area of this change. if that is looking good we can certainly merge and do subsequent changes subsequently if needed

@mangod9
Copy link
Member

mangod9 commented Jul 1, 2025

outerloop is good. Assume browser-wasm are known issues?

@BrzVlad
Copy link
Member Author

BrzVlad commented Jul 1, 2025

Yes, failures are unrelated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants