Skip to content
This repository was archived by the owner on Nov 20, 2023. It is now read-only.

Purging Hanging Replicas (On Startup and via Command) #155

Merged
merged 21 commits into from
Mar 28, 2020

Conversation

areller
Copy link
Contributor

@areller areller commented Mar 20, 2020

This is copy + improvement on davidfowl/Micronetes#48

If the host closes unexpectedly, it leaves processes and containers hanging and it might cause errors like this

image

Or complain that a port is in use in subsequent runs.

To address this I register all ReplicaState.Started events in a file (TBD: Use Sqlite or some other established db) and delete (docker rm or pkill) the registered replicas whenever the host starts.

I also expose a command (tye purge) that does the same thing.

@areller areller force-pushed the clean-previous-run branch from ad2712d to 694723f Compare March 20, 2020 04:19
@areller
Copy link
Contributor Author

areller commented Mar 20, 2020

There seems to be an edge case when I use

using var tempDirectory = TempDirectory.Create();

In the tests.
Sometimes the dispose fails with UnauthorizedAccessException

I've wrapped the Dispose with try/catch for now

@rynowak
Copy link
Member

rynowak commented Mar 20, 2020

I'll take a look at this later tonite. @davidfowl - you should look too 😆 since you asked for it!

@areller
Copy link
Contributor Author

areller commented Mar 23, 2020

@rynowak @davidfowl Are you guys gonna look at it. Just want to know whether to fix the merge conflicts or not :)

Copy link
Member

@rynowak rynowak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, we do want this feature, and thanks for the time you put into this.

}
}

public ValueTask<IDictionary<string, string>> SerializeReplica(ReplicaEvent replicaEvent)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be its own data type instead of piggy-backing on ReplicaEvent?

This doesn't roundtrip since we're just using the service name instead of the service definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The service name of the PID if it's a process.
But yeah, most of the fields are not necessary.
I could 1) Just use a dictionary or 2) Use a string (that contains the container name or PID).
In both cases, the serialize and deserialize will become obsolete and you would only have a HandleStaleReplica method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could also just be its own type. It's a little strange to use ReplicaEvent because you're not going to rehydrate all of its fields, just the basic stuff like the PID. I think either using a dictionary or using a new type would both be improvments.

var runners = CreateRunners(_application, _args, app.Logger, app.Configuration);

using var replicaRegistry = new ReplicaRegistry(_application, runners);
return replicaRegistry.Reset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta-question:

All this work is done to preserve the layering that each kind of runner knows how to clean up it's own services, and that is expressed in the abstraction now.

As a result, we've got to read all of the configuration, and create the runners, and create a registry just so we can loop through the events and kill processes.

I don't think it's terribly wrong, it's just different from what I expected when I think about "cleanup" of lingering services.

@areller areller force-pushed the clean-previous-run branch from bc4da7e to 0eb52e2 Compare March 24, 2020 21:37
@areller areller force-pushed the clean-previous-run branch from 0eb52e2 to 15622f0 Compare March 24, 2020 21:42
@davidfowl
Copy link
Member

@rynowak yo think it’s cleaner to pass a storage interface to all of the runners to save and restore?

@areller areller force-pushed the clean-previous-run branch from ed87291 to c3ecf8c Compare March 25, 2020 02:46
{
var startedTask = new TaskCompletionSource<bool>();
var alreadyStarted = 0;
var totalReplicas = host.Application.Services.Sum(s => s.Value.Description.Replicas);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there's a logic error here - shouldn't the success condition be totalReplicas - alreadStarted == 0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's counting up to totalReplicas and these description.replicas is the desired state, it's correct.

Interlocked.Decrement(ref alreadyStopped);
}

if (alreadyStopped == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this seems misleading, like it should be remaining

@@ -33,5 +43,91 @@ public static string GetSolutionRootDirectory(string solution)

throw new Exception($"Solution file {solution}.sln could not be found in {applicationBasePath} or its parent directories.");
}

public static async Task StartHostAndWaitForReplicasToStart(TyeHost host)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach used here better than polling with random HTTP clients and retries 👍

{
var pids = GetAllPids(host.Application);

Assert.True(Directory.Exists(tyeDir.FullName));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also verify that the directory gets deleted

@rynowak
Copy link
Member

rynowak commented Mar 25, 2020

@rynowak yo think it’s cleaner to pass a storage interface to all of the runners to save and restore?

Maybe you want to make that suggestion to @areller since this is his PR :)

@rynowak
Copy link
Member

rynowak commented Mar 25, 2020

But yes, that might be a better idea.

areller added 2 commits March 25, 2020 21:37
# Conflicts:
#	src/Microsoft.Tye.Hosting/TyeHost.cs
#	test/E2ETest/TestHelpers.cs
@areller
Copy link
Contributor Author

areller commented Mar 26, 2020

@rynowak @davidfowl I don't mind calling the WriteEvent function directly from the runners if you want, but I'm not sure there is a benefit to it.
The benefit of doing it via listening to the ReplicaEvents is that it saves you maintenance in the case you want to edit/refactor the runners or in the case you want to add more tasks that should run when a replica is started/stopped.

Here's a funny story.
I've spent over an hour trying to find a race condition that caused the purge tests to fail sometimes.

Turns out that it was partly due to this

using var replicaRegistry = new ReplicaRegistry(_application, app.Logger, runners);
return replicaRegistry.Delete();

Instead of awaiting Delete() I used to return the task directly, but it meant that replicaRegistry got cleaned before Delete() was completed which triggered some other race condition.

Be careful with those new C# 8 features 😆

{
static partial class Program
{
public static Command CreatePurgeCommand(string[] args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this command? Or maybe I'm just not a fan of the fact that we need to call into the runner to to a purge because the logic is there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the latter. The other question is whether we'd auto-purge if you do run again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidfowl It sounds reasonable that the entity which knows how to launch a certain type of replica would know how to kill replicas of that type.
I don't think I like the idea of having a class like ReplicaKiller that switch/cases through the types of replicas.

The best alternative that I can think of is, you can have a virtual Kill method in the ReplicaStatus class (ProcessStatus and DockerStatus provide their own implementation)

Then, the WriteReplicaEvent in ReplicaRegistry would serialize the entire replica (as JSON?) together with its type.
GetEvents will return a list of ReplicaStatus (instead of list of StoreEvent) and the purge method will loop over all ReplicaStatuses and call .Kill()

How does it sound?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a cleaner design is just to pass the registry to the docker runner and process runner. I recall me guiding you this direction but it seems like it might be much simpler now to delete the state when replicas are spun down in both the docker and process runner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidfowl I could do it but how does it address your initial concern about the purge command or the fact that the runners contain the purge logic? Unless I'm misunderstanding something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets stick to auto purge on run and remove the purge command for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidfowl Sure thing. I've removed the purge command but for now I've left the PurgeAsyc method in the host.
This is because the purging logic is in the runners and the host has access to create the runners, unless we change something structurally.
This can also be changed if I implement this

The best alternative that I can think of is, you can have a virtual Kill method in the ReplicaStatus class (ProcessStatus and DockerStatus provide their own implementation)

By the way, I've made this modification

while (!dockerInfo.StoppingTokenSource.Token.IsCancellationRequested)
{
    var logsRes = await ProcessUtil.RunAsync("docker", $"logs -f {containerId}",
        outputDataReceived: data => service.Logs.OnNext($"[{replica}]: {data}"),
        errorDataReceived: data => service.Logs.OnNext($"[{replica}]: {data}"),
        throwOnError: false,
        cancellationToken: dockerInfo.StoppingTokenSource.Token);

    if (logsRes.ExitCode != 0)
    {
        break;
    }
}

to #209 (added a break from the loop if the exit code is non zero)

Turns out that if you kill the container externally, it enters an infinite loop.
It caused my tests to fail but I also think that it's a good check anyway.
What do you think?

Copy link
Member

@davidfowl davidfowl Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • There should be no more purge command

Turns out that if you kill the container externally, it enters an infinite loop.
It caused my tests to fail but I also think that it's a good check anyway.
What do you think?

This seems fine if it normally exit with a zero exit code, when the container is restarting and failing.

@areller areller requested a review from jkotalik as a code owner March 27, 2020 14:24
@davidfowl
Copy link
Member

Sweet!

using Microsoft.Tye.Hosting.Model;
using Microsoft.Extensions.Logging;
using System.Threading;
using System.Collections.Concurrent;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Sort usings

private readonly ConcurrentDictionary<string, SemaphoreSlim> _fileWriteSemaphores;
private readonly string _tyeFolderPath;

public ReplicaRegistry(Model.Application application, ILogger logger)
Copy link
Member

@davidfowl davidfowl Mar 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directly pass in the directory, not the application.

public class ReplicaRegistry : IDisposable
{
private readonly ILogger _logger;
private readonly ConcurrentDictionary<string, SemaphoreSlim> _fileWriteSemaphores;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make these normal locks, not semaphores.

var contents = JsonSerializer.Serialize(replicaRecord, new JsonSerializerOptions { WriteIndented = false });
var semaphore = GetSempahoreForStore(storeName);

semaphore.Wait();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a semaphore vs a lock? Lets make this a regular lock object.


public static async Task StartHostAndWaitForReplicasToStart(TyeHost host)
{
var startedTask = new TaskCompletionSource<bool>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should pass TaskCreationOptions.RunContinuationsAsynchronously

@davidfowl davidfowl merged commit 42bcc8a into dotnet:master Mar 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants