Purging Hanging Replicas (On Startup and via Command) #155

areller · 2020-03-20T02:47:03Z

This is copy + improvement on davidfowl/Micronetes#48

If the host closes unexpectedly, it leaves processes and containers hanging and it might cause errors like this

Or complain that a port is in use in subsequent runs.

To address this I register all ReplicaState.Started events in a file (TBD: Use Sqlite or some other established db) and delete (docker rm or pkill) the registered replicas whenever the host starts.

I also expose a command (tye purge) that does the same thing.

areller · 2020-03-20T04:24:27Z

There seems to be an edge case when I use

using var tempDirectory = TempDirectory.Create();

In the tests.
Sometimes the dispose fails with UnauthorizedAccessException

I've wrapped the Dispose with try/catch for now

rynowak · 2020-03-20T23:56:29Z

I'll take a look at this later tonite. @davidfowl - you should look too 😆 since you asked for it!

areller · 2020-03-23T23:17:07Z

@rynowak @davidfowl Are you guys gonna look at it. Just want to know whether to fix the merge conflicts or not :)

rynowak

Sorry for the delay, we do want this feature, and thanks for the time you put into this.

src/Microsoft.Tye.Hosting/DockerRunner.cs

src/Microsoft.Tye.Hosting/ReplicaRegistry.cs

src/Microsoft.Tye.Hosting/IReplicaInstantiator.cs

rynowak · 2020-03-24T04:17:17Z

src/Microsoft.Tye.Hosting/DockerRunner.cs

+            }
+        }
+
+        public ValueTask<IDictionary<string, string>> SerializeReplica(ReplicaEvent replicaEvent)


Should this just be its own data type instead of piggy-backing on ReplicaEvent?

This doesn't roundtrip since we're just using the service name instead of the service definition.

The service name of the PID if it's a process.
But yeah, most of the fields are not necessary.
I could 1) Just use a dictionary or 2) Use a string (that contains the container name or PID).
In both cases, the serialize and deserialize will become obsolete and you would only have a HandleStaleReplica method

It could also just be its own type. It's a little strange to use ReplicaEvent because you're not going to rehydrate all of its fields, just the basic stuff like the PID. I think either using a dictionary or using a new type would both be improvments.

src/Microsoft.Tye.Hosting/ProcessRunner.cs

src/Microsoft.Tye.Hosting/ReplicaRegistry.cs

src/Microsoft.Tye.Hosting/ReplicaStateRecorder.cs

rynowak · 2020-03-24T04:48:21Z

src/Microsoft.Tye.Hosting/TyeHost.cs

+            var runners = CreateRunners(_application, _args, app.Logger, app.Configuration);
+
+            using var replicaRegistry = new ReplicaRegistry(_application, runners);
+            return replicaRegistry.Reset();


meta-question:

All this work is done to preserve the layering that each kind of runner knows how to clean up it's own services, and that is expressed in the abstraction now.

As a result, we've got to read all of the configuration, and create the runners, and create a registry just so we can loop through the events and kill processes.

I don't think it's terribly wrong, it's just different from what I expected when I think about "cleanup" of lingering services.

# Conflicts: # src/Microsoft.Tye.Hosting/TyeHost.cs

davidfowl · 2020-03-25T02:15:35Z

@rynowak yo think it’s cleaner to pass a storage interface to all of the runners to save and restore?

…n tests

rynowak · 2020-03-25T05:27:04Z

test/E2ETest/TestHelpers.cs

+        {
+            var startedTask = new TaskCompletionSource<bool>();
+            var alreadyStarted = 0;
+            var totalReplicas = host.Application.Services.Sum(s => s.Value.Description.Replicas);


It seems like there's a logic error here - shouldn't the success condition be totalReplicas - alreadStarted == 0?

It's counting up to totalReplicas and these description.replicas is the desired state, it's correct.

rynowak · 2020-03-25T05:28:30Z

test/E2ETest/TestHelpers.cs

+                    Interlocked.Decrement(ref alreadyStopped);
+                }
+
+                if (alreadyStopped == 0)


The name of this seems misleading, like it should be remaining

rynowak · 2020-03-25T05:28:55Z

test/E2ETest/TestHelpers.cs

@@ -33,5 +43,91 @@ public static string GetSolutionRootDirectory(string solution)

            throw new Exception($"Solution file {solution}.sln could not be found in {applicationBasePath} or its parent directories.");
        }
+
+        public static async Task StartHostAndWaitForReplicasToStart(TyeHost host)


I like the approach used here better than polling with random HTTP clients and retries 👍

rynowak · 2020-03-25T05:29:28Z

test/E2ETest/TyePurgeTests.cs

+            {
+                var pids = GetAllPids(host.Application);
+
+                Assert.True(Directory.Exists(tyeDir.FullName));


Can we also verify that the directory gets deleted

rynowak · 2020-03-25T05:32:02Z

@rynowak yo think it’s cleaner to pass a storage interface to all of the runners to save and restore?

Maybe you want to make that suggestion to @areller since this is his PR :)

rynowak · 2020-03-25T05:32:21Z

But yes, that might be a better idea.

…ng stale replicas

# Conflicts: # src/Microsoft.Tye.Hosting/TyeHost.cs # test/E2ETest/TestHelpers.cs

areller · 2020-03-26T02:24:51Z

@rynowak @davidfowl I don't mind calling the WriteEvent function directly from the runners if you want, but I'm not sure there is a benefit to it.
The benefit of doing it via listening to the ReplicaEvents is that it saves you maintenance in the case you want to edit/refactor the runners or in the case you want to add more tasks that should run when a replica is started/stopped.

Here's a funny story.
I've spent over an hour trying to find a race condition that caused the purge tests to fail sometimes.

Turns out that it was partly due to this

using var replicaRegistry = new ReplicaRegistry(_application, app.Logger, runners);
return replicaRegistry.Delete();

Instead of awaiting Delete() I used to return the task directly, but it meant that replicaRegistry got cleaned before Delete() was completed which triggered some other race condition.

Be careful with those new C# 8 features 😆

davidfowl · 2020-03-26T18:06:05Z

src/tye/Program.PurgeCommand.cs

+{
+    static partial class Program
+    {
+        public static Command CreatePurgeCommand(string[] args)


I'm not a fan of this command? Or maybe I'm just not a fan of the fact that we need to call into the runner to to a purge because the logic is there.

I think it's the latter. The other question is whether we'd auto-purge if you do run again.

@davidfowl It sounds reasonable that the entity which knows how to launch a certain type of replica would know how to kill replicas of that type.
I don't think I like the idea of having a class like ReplicaKiller that switch/cases through the types of replicas.

The best alternative that I can think of is, you can have a virtual Kill method in the ReplicaStatus class (ProcessStatus and DockerStatus provide their own implementation)

Then, the WriteReplicaEvent in ReplicaRegistry would serialize the entire replica (as JSON?) together with its type.
GetEvents will return a list of ReplicaStatus (instead of list of StoreEvent) and the purge method will loop over all ReplicaStatuses and call .Kill()

How does it sound?

Maybe a cleaner design is just to pass the registry to the docker runner and process runner. I recall me guiding you this direction but it seems like it might be much simpler now to delete the state when replicas are spun down in both the docker and process runner.

@davidfowl I could do it but how does it address your initial concern about the purge command or the fact that the runners contain the purge logic? Unless I'm misunderstanding something.

Lets stick to auto purge on run and remove the purge command for now.

@davidfowl Sure thing. I've removed the purge command but for now I've left the PurgeAsyc method in the host.
This is because the purging logic is in the runners and the host has access to create the runners, unless we change something structurally.
This can also be changed if I implement this

The best alternative that I can think of is, you can have a virtual Kill method in the ReplicaStatus class (ProcessStatus and DockerStatus provide their own implementation)

By the way, I've made this modification

while (!dockerInfo.StoppingTokenSource.Token.IsCancellationRequested) { var logsRes = await ProcessUtil.RunAsync("docker", $"logs -f {containerId}", outputDataReceived: data => service.Logs.OnNext($"[{replica}]: {data}"), errorDataReceived: data => service.Logs.OnNext($"[{replica}]: {data}"), throwOnError: false, cancellationToken: dockerInfo.StoppingTokenSource.Token); if (logsRes.ExitCode != 0) { break; } }

to #209 (added a break from the loop if the exit code is non zero)

Turns out that if you kill the container externally, it enters an infinite loop.
It caused my tests to fail but I also think that it's a good check anyway.
What do you think?

There should be no more purge command

Turns out that if you kill the container externally, it enters an infinite loop.
It caused my tests to fail but I also think that it's a good check anyway.
What do you think?

This seems fine if it normally exit with a zero exit code, when the container is restarting and failing.

src/Microsoft.Tye.Hosting/ReplicaStateRecorder.cs

src/Microsoft.Tye.Hosting/ReplicaRegistry.cs

src/Microsoft.Tye.Hosting/TyeHost.cs

davidfowl · 2020-03-28T02:13:45Z

Sweet!

davidfowl · 2020-03-28T02:19:18Z

src/Microsoft.Tye.Hosting/ReplicaRegistry.cs

+using Microsoft.Tye.Hosting.Model;
+using Microsoft.Extensions.Logging;
+using System.Threading;
+using System.Collections.Concurrent;


nit: Sort usings

davidfowl · 2020-03-28T02:19:31Z

src/Microsoft.Tye.Hosting/ReplicaRegistry.cs

+        private readonly ConcurrentDictionary<string, SemaphoreSlim> _fileWriteSemaphores;
+        private readonly string _tyeFolderPath;
+
+        public ReplicaRegistry(Model.Application application, ILogger logger)


Directly pass in the directory, not the application.

davidfowl · 2020-03-28T02:20:34Z

src/Microsoft.Tye.Hosting/ReplicaRegistry.cs

+    public class ReplicaRegistry : IDisposable
+    {
+        private readonly ILogger _logger;
+        private readonly ConcurrentDictionary<string, SemaphoreSlim> _fileWriteSemaphores;


Make these normal locks, not semaphores.

davidfowl · 2020-03-28T02:21:09Z

src/Microsoft.Tye.Hosting/ReplicaRegistry.cs

+            var contents = JsonSerializer.Serialize(replicaRecord, new JsonSerializerOptions { WriteIndented = false });
+            var semaphore = GetSempahoreForStore(storeName);
+
+            semaphore.Wait();


Why a semaphore vs a lock? Lets make this a regular lock object.

davidfowl · 2020-03-28T02:33:57Z

test/E2ETest/TestHelpers.cs

+
+        public static async Task StartHostAndWaitForReplicasToStart(TyeHost host)
+        {
+            var startedTask = new TaskCompletionSource<bool>();


This should pass TaskCreationOptions.RunContinuationsAsynchronously

areller added 4 commits March 19, 2020 22:41

purge hanging replicas

0838693

merge with master

c66997f

use dictionary for event serialization

7513773

fix tests throwing unauthorized access

694723f

areller force-pushed the clean-previous-run branch from ad2712d to 694723f Compare March 20, 2020 04:19

areller mentioned this pull request Mar 20, 2020

Cleaning Previous Run Leftovers (e.g. Containers) davidfowl/Micronetes#48

Closed

rynowak reviewed Mar 24, 2020

View reviewed changes

areller added 3 commits March 24, 2020 16:26

Merge branch 'master' into clean-previous-run

aee2129

# Conflicts: # src/Microsoft.Tye.Hosting/TyeHost.cs

merge with master works

0c38bd7

license headers

f100f75

areller force-pushed the clean-previous-run branch from bc4da7e to 0eb52e2 Compare March 24, 2020 21:37

fix PR comments

15622f0

areller force-pushed the clean-previous-run branch from 0eb52e2 to 15622f0 Compare March 24, 2020 21:42

areller added 3 commits March 24, 2020 22:20

subscribing to replica events instead of waiting for arbitrary time i…

1c1ed5c

…n tests

only return list of running containers from DockerAssert

4c5f7c6

formatting and build warnings

c3ecf8c

areller force-pushed the clean-previous-run branch from ed87291 to c3ecf8c Compare March 25, 2020 02:46

rynowak reviewed Mar 25, 2020

View reviewed changes

areller added 3 commits March 25, 2020 21:12

fix edge cases that caused tests to fail sometimes

497a6f2

use IDictionary instead of deserializing to ReplicaStatus when removi…

8b8e46f

…ng stale replicas

format

8f462ac

areller added 2 commits March 25, 2020 21:37

Merge branch 'master' into clean-previous-run

23fa349

# Conflicts: # src/Microsoft.Tye.Hosting/TyeHost.cs # test/E2ETest/TestHelpers.cs

fix merge warnings/errors and format

bd0e31c

davidfowl reviewed Mar 26, 2020

View reviewed changes

areller added 3 commits March 26, 2020 16:50

Merge branch 'master' into clean-previous-run

01e0a51

exit logs loop in DockerRunner if container is killed

2716f41

remove purge command

74f7198

areller requested a review from jkotalik as a code owner March 27, 2020 14:24

davidfowl reviewed Mar 27, 2020

View reviewed changes

src/Microsoft.Tye.Hosting/ReplicaStateRecorder.cs Outdated Show resolved Hide resolved

davidfowl reviewed Mar 27, 2020

View reviewed changes

src/Microsoft.Tye.Hosting/ReplicaRegistry.cs Show resolved Hide resolved

davidfowl reviewed Mar 27, 2020

View reviewed changes

src/Microsoft.Tye.Hosting/TyeHost.cs Outdated Show resolved Hide resolved

davidfowl reviewed Mar 27, 2020

View reviewed changes

src/Microsoft.Tye.Hosting/TyeHost.cs Outdated Show resolved Hide resolved

areller added 2 commits March 27, 2020 20:14

merge with master

bda5168

runners interact directly with replica registry

c008e52

davidfowl reviewed Mar 28, 2020

View reviewed changes

davidfowl approved these changes Mar 28, 2020

View reviewed changes

davidfowl merged commit 42bcc8a into dotnet:master Mar 28, 2020

rynowak mentioned this pull request Mar 29, 2020

Regular status updates from the Tye team #251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Purging Hanging Replicas (On Startup and via Command) #155

Purging Hanging Replicas (On Startup and via Command) #155

areller commented Mar 20, 2020

areller commented Mar 20, 2020

rynowak commented Mar 20, 2020

areller commented Mar 23, 2020

rynowak left a comment

rynowak Mar 24, 2020

areller Mar 24, 2020

rynowak Mar 25, 2020

rynowak Mar 24, 2020

davidfowl commented Mar 25, 2020

rynowak Mar 25, 2020

davidfowl Mar 28, 2020

rynowak Mar 25, 2020

rynowak Mar 25, 2020

rynowak Mar 25, 2020

rynowak commented Mar 25, 2020

rynowak commented Mar 25, 2020

areller commented Mar 26, 2020

davidfowl Mar 26, 2020

rynowak Mar 26, 2020

areller Mar 26, 2020

davidfowl Mar 27, 2020

areller Mar 27, 2020

davidfowl Mar 27, 2020

areller Mar 27, 2020

davidfowl Mar 27, 2020 •

edited

Loading

davidfowl commented Mar 28, 2020

davidfowl Mar 28, 2020

davidfowl Mar 28, 2020 •

edited

Loading

davidfowl Mar 28, 2020

davidfowl Mar 28, 2020

davidfowl Mar 28, 2020

Purging Hanging Replicas (On Startup and via Command) #155

Purging Hanging Replicas (On Startup and via Command) #155

Conversation

areller commented Mar 20, 2020

areller commented Mar 20, 2020

rynowak commented Mar 20, 2020

areller commented Mar 23, 2020

rynowak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidfowl commented Mar 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rynowak commented Mar 25, 2020

rynowak commented Mar 25, 2020

areller commented Mar 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidfowl Mar 27, 2020 • edited Loading

Choose a reason for hiding this comment

davidfowl commented Mar 28, 2020

Choose a reason for hiding this comment

davidfowl Mar 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidfowl Mar 27, 2020 •

edited

Loading

davidfowl Mar 28, 2020 •

edited

Loading