-
Notifications
You must be signed in to change notification settings - Fork 518
Purging Hanging Replicas (On Startup and via Command) #155
Conversation
ad2712d
to
694723f
Compare
There seems to be an edge case when I use
In the tests. I've wrapped the Dispose with try/catch for now |
I'll take a look at this later tonite. @davidfowl - you should look too 😆 since you asked for it! |
@rynowak @davidfowl Are you guys gonna look at it. Just want to know whether to fix the merge conflicts or not :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, we do want this feature, and thanks for the time you put into this.
} | ||
} | ||
|
||
public ValueTask<IDictionary<string, string>> SerializeReplica(ReplicaEvent replicaEvent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this just be its own data type instead of piggy-backing on ReplicaEvent
?
This doesn't roundtrip since we're just using the service name instead of the service definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The service name of the PID if it's a process.
But yeah, most of the fields are not necessary.
I could 1) Just use a dictionary or 2) Use a string (that contains the container name or PID).
In both cases, the serialize and deserialize will become obsolete and you would only have a HandleStaleReplica
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could also just be its own type. It's a little strange to use ReplicaEvent
because you're not going to rehydrate all of its fields, just the basic stuff like the PID. I think either using a dictionary or using a new type would both be improvments.
src/Microsoft.Tye.Hosting/TyeHost.cs
Outdated
var runners = CreateRunners(_application, _args, app.Logger, app.Configuration); | ||
|
||
using var replicaRegistry = new ReplicaRegistry(_application, runners); | ||
return replicaRegistry.Reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta-question:
All this work is done to preserve the layering that each kind of runner knows how to clean up it's own services, and that is expressed in the abstraction now.
As a result, we've got to read all of the configuration, and create the runners, and create a registry just so we can loop through the events and kill processes.
I don't think it's terribly wrong, it's just different from what I expected when I think about "cleanup" of lingering services.
# Conflicts: # src/Microsoft.Tye.Hosting/TyeHost.cs
bc4da7e
to
0eb52e2
Compare
0eb52e2
to
15622f0
Compare
@rynowak yo think it’s cleaner to pass a storage interface to all of the runners to save and restore? |
ed87291
to
c3ecf8c
Compare
{ | ||
var startedTask = new TaskCompletionSource<bool>(); | ||
var alreadyStarted = 0; | ||
var totalReplicas = host.Application.Services.Sum(s => s.Value.Description.Replicas); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like there's a logic error here - shouldn't the success condition be totalReplicas - alreadStarted == 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's counting up to totalReplicas and these description.replicas is the desired state, it's correct.
test/E2ETest/TestHelpers.cs
Outdated
Interlocked.Decrement(ref alreadyStopped); | ||
} | ||
|
||
if (alreadyStopped == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this seems misleading, like it should be remaining
@@ -33,5 +43,91 @@ public static string GetSolutionRootDirectory(string solution) | |||
|
|||
throw new Exception($"Solution file {solution}.sln could not be found in {applicationBasePath} or its parent directories."); | |||
} | |||
|
|||
public static async Task StartHostAndWaitForReplicasToStart(TyeHost host) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach used here better than polling with random HTTP clients and retries 👍
test/E2ETest/TyePurgeTests.cs
Outdated
{ | ||
var pids = GetAllPids(host.Application); | ||
|
||
Assert.True(Directory.Exists(tyeDir.FullName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also verify that the directory gets deleted
But yes, that might be a better idea. |
# Conflicts: # src/Microsoft.Tye.Hosting/TyeHost.cs # test/E2ETest/TestHelpers.cs
@rynowak @davidfowl I don't mind calling the Here's a funny story. Turns out that it was partly due to this
Instead of awaiting Be careful with those new C# 8 features 😆 |
src/tye/Program.PurgeCommand.cs
Outdated
{ | ||
static partial class Program | ||
{ | ||
public static Command CreatePurgeCommand(string[] args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of this command? Or maybe I'm just not a fan of the fact that we need to call into the runner to to a purge because the logic is there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the latter. The other question is whether we'd auto-purge if you do run again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidfowl It sounds reasonable that the entity which knows how to launch a certain type of replica would know how to kill replicas of that type.
I don't think I like the idea of having a class like ReplicaKiller
that switch/cases through the types of replicas.
The best alternative that I can think of is, you can have a virtual Kill
method in the ReplicaStatus
class (ProcessStatus
and DockerStatus
provide their own implementation)
Then, the WriteReplicaEvent
in ReplicaRegistry
would serialize the entire replica (as JSON?) together with its type.
GetEvents
will return a list of ReplicaStatus
(instead of list of StoreEvent
) and the purge method will loop over all ReplicaStatus
es and call .Kill()
How does it sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a cleaner design is just to pass the registry to the docker runner and process runner. I recall me guiding you this direction but it seems like it might be much simpler now to delete the state when replicas are spun down in both the docker and process runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidfowl I could do it but how does it address your initial concern about the purge command or the fact that the runners contain the purge logic? Unless I'm misunderstanding something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets stick to auto purge on run and remove the purge command for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidfowl Sure thing. I've removed the purge command but for now I've left the PurgeAsyc
method in the host.
This is because the purging logic is in the runners and the host has access to create the runners, unless we change something structurally.
This can also be changed if I implement this
The best alternative that I can think of is, you can have a virtual Kill method in the ReplicaStatus class (ProcessStatus and DockerStatus provide their own implementation)
By the way, I've made this modification
while (!dockerInfo.StoppingTokenSource.Token.IsCancellationRequested)
{
var logsRes = await ProcessUtil.RunAsync("docker", $"logs -f {containerId}",
outputDataReceived: data => service.Logs.OnNext($"[{replica}]: {data}"),
errorDataReceived: data => service.Logs.OnNext($"[{replica}]: {data}"),
throwOnError: false,
cancellationToken: dockerInfo.StoppingTokenSource.Token);
if (logsRes.ExitCode != 0)
{
break;
}
}
to #209 (added a break from the loop if the exit code is non zero)
Turns out that if you kill the container externally, it enters an infinite loop.
It caused my tests to fail but I also think that it's a good check anyway.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- There should be no more purge command
Turns out that if you kill the container externally, it enters an infinite loop.
It caused my tests to fail but I also think that it's a good check anyway.
What do you think?
This seems fine if it normally exit with a zero exit code, when the container is restarting and failing.
Sweet! |
using Microsoft.Tye.Hosting.Model; | ||
using Microsoft.Extensions.Logging; | ||
using System.Threading; | ||
using System.Collections.Concurrent; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Sort usings
private readonly ConcurrentDictionary<string, SemaphoreSlim> _fileWriteSemaphores; | ||
private readonly string _tyeFolderPath; | ||
|
||
public ReplicaRegistry(Model.Application application, ILogger logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Directly pass in the directory, not the application.
public class ReplicaRegistry : IDisposable | ||
{ | ||
private readonly ILogger _logger; | ||
private readonly ConcurrentDictionary<string, SemaphoreSlim> _fileWriteSemaphores; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make these normal locks, not semaphores.
var contents = JsonSerializer.Serialize(replicaRecord, new JsonSerializerOptions { WriteIndented = false }); | ||
var semaphore = GetSempahoreForStore(storeName); | ||
|
||
semaphore.Wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a semaphore vs a lock? Lets make this a regular lock object.
|
||
public static async Task StartHostAndWaitForReplicasToStart(TyeHost host) | ||
{ | ||
var startedTask = new TaskCompletionSource<bool>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should pass TaskCreationOptions.RunContinuationsAsynchronously
This is copy + improvement on davidfowl/Micronetes#48
If the host closes unexpectedly, it leaves processes and containers hanging and it might cause errors like this
Or complain that a port is in use in subsequent runs.
To address this I register all
ReplicaState.Started
events in a file (TBD: Use Sqlite or some other established db) and delete (docker rm
orpkill
) the registered replicas whenever the host starts.I also expose a command (
tye purge
) that does the same thing.