Description
This is consistently reproducible for me now. When a container is unpaused, the keys are first unwatched and possibly deleted before the unpause operation completes. At the same time as the unpause operation, a container remove can happen (still trying to figure out what triggers this, but my theory was that unwatching the container keys results in something watching it request to delete the container). When the unpause operation happens, it seems that no matter the result of the operation the future that requested it to be unpaused will always be returned successful. This results in the etcd key for the container being re-written as a part of the .map
after the unpause future completes. Since the container was deleted, this key is now no longer tracked and will live indefinitely unless the component is restarted. This is a critical bug for me moving forward with the scheduler since it's so easily reproducible and can block activations from processing for an affected action if the memory queue still thinks a container is running when it's not.
Relevant logs (a couple of them I added locally to follow the exact code path better):
[2022-09-14T00:00:12.526Z] [INFO] [#tid_sid_unknown] [ContainerManager] Choose a warmed container whisk/warmed
[2022-09-14T00:00:12.580Z] [INFO] [#tid_sid_unknown] [FunctionPullingContainerPool] Choose warmed container
[2022-09-14T00:00:12.581Z] [INFO] [#tid_sid_containerCreation] [DockerClientWithFileAccess] running /usr/bin/docker unpause ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d (timeout: 10 seconds)
#these two unwatch logs are from above where I said the unregister data can happen before the register #data when the unpause completes
[2022-09-14T00:00:12.581Z] [INFO] [#tid_sid_unknown] [WatcherService] unwatch endpoint: UnwatchEndpoint(whisk/warmed
[2022-09-14T00:00:12.581Z] [INFO] [#tid_sid_unknown] [WatcherService] unwatch endpoint: UnwatchEndpoint(whisk/namespace
[2022-09-14T00:00:12.581Z] [INFO] [#tid_sid_invokerNanny] [DockerClientWithFileAccess] running /usr/bin/docker rm -f ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d (timeout: 1 minute)
#added this log locally indicating the etcd key is removed
[2022-09-14T00:00:12.585Z] [INFO] [#tid_sid_unknown] [EtcdWorker] successfully removed whisk/warmed
[2022-09-14T00:00:12.588Z] [INFO] [#tid_sid_unknown] [EtcdWorker] successfully removed whisk/namespace
[2022-09-14T00:00:12.595Z] [INFO] [#tid_DgxZq5kjsHIwixkSsI5onK8zYSgpOkxM] [MemoryQueue] Remove containerId because ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d is not alive
#unpause operation completes
[2022-09-14T00:00:12.705Z] [INFO] [#tid_sid_containerCreation] [FunctionPullingContainerProxy] Resumed container ContainerId(ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d)
[2022-09-14T00:00:12.706Z] [INFO] [#tid_sid_unknown] [WatcherService] watch endpoint: WatchEndpoint(whisk/namespace
#key is added back to etcd for deleted container
[2022-09-14T00:00:12.711Z] [INFO] [#tid_sid_unknown] [EtcdWorker] successfully stored whisk/namespace
#added this log locally to denote when the memory queue sees an existing container and adds it to #its existing set
[2022-09-14T00:00:12.711Z] [INFO] [#tid_sid_unknown] [MemoryQueue] [redacted:Running] start tracking container ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d
[2022-09-14T00:11:09.557Z] [INFO] [#tid_sid_unknown] [MemoryQueue] [redacted:Running] The queue is timed out but there are still 0 activation messages or (running: 1 -> Set(ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d), in-progress: 0 -> Set()) containers
Relevant code:
when(Paused) {
case Event(job: Initialize, data: WarmData) =>
implicit val transId = job.transId
val parent = context.parent
cancelTimer(IdleTimeoutName)
cancelTimer(KeepingTimeoutName)
data.container
.resume()
.map { _ =>
logging.info(this, s"Resumed container ${data.container.containerId}")
// put existing key again
dataManagementService ! RegisterData(
s"${ContainerKeys.existingContainers(data.invocationNamespace, data.action.fullyQualifiedName(true), data.action.rev, Some(instance), Some(data.container.containerId))}",
"")
parent ! Resumed(data)
// the new queue may locates on an different scheduler, so recreate the activation client when necessary
// since akka port will no be used, we can put any value except 0 here
data.clientProxy ! RequestActivation(
newScheduler = Some(SchedulerEndpoints(job.schedulerHost, job.rpcPort, 10)))
startSingleTimer(UnusedTimeoutName, StateTimeout, unusedTimeout)
timedOut = false
}
.recover {
case t: Throwable =>
logging.error(this, s"Failed to resume container ${data.container.containerId}, error: $t")
parent ! ResumeFailed(data)
self ! ResumeFailed(data)
}
// always clean data in etcd regardless of success and failure
dataManagementService ! UnregisterData(
ContainerKeys.warmedContainers(
data.invocationNamespace,
data.action.fullyQualifiedName(false),
data.revision,
instance,
data.container.containerId))
goto(Running)
You can see here that dataManagementService sends a request to unregister the etcd keys not chained to the unpause operation so this can happen before the docker unpause occurs. The .map
is always entered even if the docker unpause fails (verifiable through metric emission for unpause error from LoggingMarkers.INVOKER_DOCKER_CMD
. The only thing left for me to really figure out is why docker rm
occurs at the same time as the unpause operation. Maybe it's as simple as the unpause fails for whatever reason triggering the removal in which case maybe the fix is as simple as making sure the etcd write doesn't occur if the unpause fails; but I'm not yet convinced that the docker rm does not occur because of some side effect