Skip to content

CoordinatedShutdown: return non-zero (error) exit code when ActorSystem is removed due to DOWNing #7517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Aaronontheweb opened this issue Mar 11, 2025 · 2 comments

Comments

@Aaronontheweb
Copy link
Member

Is your feature request related to a problem? Please describe.

This issue was raised by a support customer - TL;DR; if Akka.NET could force the process to exit when the Split Brain Resolver (SBR) kicks in using a non-zero exit code in situations where the node is downed, that would allow Windows Services to automatically detect a failure and restart the service more easily.

Describe the solution you'd like

We'd need some way of passing into a parameter like int exitCode when we invoke the CoordinatedShutdown routine if one doesn't already exist. That could be propagated all the way through until the clr-exit stage if it's enabled - and the return code could be set there.

Describe alternatives you've considered

N/A

Additional context

This might be useful in non-Windows contexts too, like K8s supervision

@Aaronontheweb
Copy link
Member Author

Looks like we already pass in a Reason parameter to the CoordinatedShutdown when we exit - and we have one specifically for DOWNing:

public class ClusterDowningReason : Reason
{
public static readonly Reason Instance = new ClusterDowningReason();
private ClusterDowningReason()
{
}
}

This will get invoked by the ClusterDaemon here:

protected override void PostStop()
{
_clusterPromise.TrySetResult(Done.Instance);
if (_settings.RunCoordinatedShutdownWhenDown)
{
// if it was stopped due to leaving CoordinatedShutdown was started earlier
_coordShutdown.Run(CoordinatedShutdown.ClusterDowningReason.Instance);
}
}
}

And we know that this setting is effective when nodes are downed due to this test here:

[Fact]
public async Task A_cluster_must_terminate_ActorSystem_via_Down_CoordinatedShutdown()
{
var sys3 = ActorSystem.Create("ClusterSpec3", ConfigurationFactory.ParseString(@"
akka.actor.provider = ""cluster""
akka.remote.dot-netty.tcp.port = 0
akka.coordinated-shutdown.terminate-actor-system = on
akka.cluster.run-coordinated-shutdown-when-down = on
akka.loglevel=DEBUG
").WithFallback(Akka.TestKit.Configs.TestConfigs.DefaultConfig));
try
{
var probe = CreateTestProbe(sys3);
Cluster.Get(sys3).Subscribe(probe.Ref, typeof(ClusterEvent.IMemberEvent));
await probe.ExpectMsgAsync<ClusterEvent.CurrentClusterState>();
await Cluster.Get(sys3).JoinAsync(Cluster.Get(sys3).SelfAddress).ShouldCompleteWithin(10.Seconds());
await probe.ExpectMsgAsync<ClusterEvent.MemberUp>();
Cluster.Get(sys3).Down(Cluster.Get(sys3).SelfAddress);
await probe.ExpectMsgAsync<ClusterEvent.MemberDowned>();
await probe.ExpectMsgAsync<ClusterEvent.MemberRemoved>();
await sys3.WhenTerminated.ShouldCompleteWithin(10.Seconds());
Cluster.Get(sys3).IsTerminated.Should().BeTrue();
CoordinatedShutdown.Get(sys3).ShutdownReason.Should().BeOfType<CoordinatedShutdown.ClusterDowningReason>();
}
finally
{
Shutdown(sys3);
}

That gives us a good basis, I think, for being able to implement this perhaps.

@Aaronontheweb
Copy link
Member Author

We'd need to set the exit code in here someplace:

internal static void InitClrHook(ActorSystem system, Config conf, CoordinatedShutdown coord)
{
var runByClrShutdownHook = conf.GetBoolean("run-by-clr-shutdown-hook", false);
if (runByClrShutdownHook)
{
var exitTask = TerminateOnClrExit(coord);
// run all hooks during termination sequence
AppDomain.CurrentDomain.ProcessExit += exitTask;
system.WhenTerminated.ContinueWith(_ =>
{
AppDomain.CurrentDomain.ProcessExit -= exitTask;
});
coord.AddClrShutdownHook(() =>
{
coord._runningClrHook = true;
return Task.Run(() =>
{
if (!system.WhenTerminated.IsCompleted)
{
coord.Log.Info("Starting coordinated shutdown from CLR termination hook.");
try
{
coord.Run(ClrExitReason.Instance).Wait(coord.TotalTimeout);
}
catch (Exception ex)
{
coord.Log.Warning("CoordinatedShutdown from CLR shutdown failed: {0}", ex.Message);
}
}
return Done.Instance;
});
});
}
}
private static EventHandler TerminateOnClrExit(CoordinatedShutdown coord)
{
return (_, _) =>
{
// have to block, because if this method exits the process exits.
coord.RunClrHooks().Wait(coord.TotalTimeout);
};
}
}

My guess is - we could tag some of the Reasons with an interface, INaughtyExitCode or something and return an error code based on that.

@Aaronontheweb Aaronontheweb modified the milestones: 1.5.40, 1.5.41 Mar 18, 2025
@Aaronontheweb Aaronontheweb modified the milestones: 1.5.41, 1.5.42 May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant