-
Notifications
You must be signed in to change notification settings - Fork 305
Stuck orchestrations at random on control-queue #903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @sainipankaj90k. Just noting here for visibility that we discussed internally to work together to create a private release with extra logs to help get to the bottom of this issue. We can keep this issue open while the investigation is active. |
Hi @davidmrdavid - wondering if you have any updates here as we are experiencing similar issues? |
So far, I tried to monitor the situation on the box and found some times the c# await tasks goes stuck. (Sounds very weird, but experienced it first hand). Realized it is stuck at various calls, e.g., fetching applease from storage, stopping/starting taskhub worker. [Through our local testing, its not predictable but reproduced]. In DTF here, it is usually found stuck in 'GetMessagesAsync' of 'ControlQueue.cs' which are also using awaiting c# tasks. Hypothesis so far is, these stuck c# await tasks are causing them. Right now, I am planning of using configureawait as false in our code, and as per success we should have it used in this DTF library. |
Please don't do this. It is likely to cause your orchestration to get stuck 100% of the time since orchestration code MUST always run in the orchestration's synchronization context. The problem you're experiencing sounds more like an issue with the DurableTask.AzureStorage partition manager. Which version of the |
Interesting. |
We are on Microsoft.Azure.DurableTask.AzureStorage 1.12.0 |
Okay, and just to confirm, are you still encountering this problem periodically with the latest version(s)? |
Yes, we are encountering these with latest versions too. Another observation: We have multiple task-hub-worker (consider like microservices) at each node. could be awaiting thread issue causing deadlock. I see no other reason any async request to get stuck when no logic is such that which can keep it on hold so far. |
Hi,
We are facing issues at random, not very frequent but once in 15 days around.
All of a sudden, some control-queue gets stuck and orchestrations queued on it doesn't get processed until we restart the node itself.
This is not near shutdown of service etc.
Also, the lease ownership status shows success in the impacted duration for the control-queue as well.
No predictable repro so far, but quite consistent with above mentioned frequency.
Thanks,
Pankaj
The text was updated successfully, but these errors were encountered: