Description
I had an action queue get into a stuck state after about two seconds of etcd downtime while other actions were able to recover gracefully. Essentially what it appears happens is that the queue endpoint key times out in etcd and no longer exists, but the controller doesn't hear this and continues to think the scheduler endpoint of the queue is on the same scheduler endpoint (I believe WatchEndpointRemoved should be sent to the controllers in this case but that doesn't seem to have happened). In the QueueManager of the scheduler the activation it was sent to, it then hits this code path because the queue doesn't exist on the host it sent it to and tries to remotely resolve it through etcd but the queue endpoint doesn't exist in etcd:
case t =>
logging.warn(this, s"[${msg.activationId}] activation has been dropped (${t.getMessage})")
completeErrorActivation(msg, "The activation has not been processed: failed to get the queue endpoint.")
}}
All requests for this action will then be dropped in the QueueManager unless the schedulers are restarted. Is there any way to make this more resilient so that if something gets stuck in this edge case, we can recover somehow without requiring a restart? @style95
Additional logs for the timeline:
How I know connectivity to etcd fails is this logs emits for two seconds from the controller before all activations for that action begin to fail and all other actions become fine again.
[WARN] [#tid_T8Gk2BdDdf4PIjq8W8Ta12kD0ZAOMgwE] [ActionsApi] No scheduler endpoint available [marker:controller_loadbalancer_error:2:0]
Then from the invoker this log is emitted from the containers that exist for the action until the schedulers are restarted:
[ActivationClientProxy] The queue of action [REDACTED] does not exist. Check for queues in other schedulers.