Stuck orchestrations at random on control-queue #903

sainipankaj90k · 2023-05-05T18:34:22Z

Hi,
We are facing issues at random, not very frequent but once in 15 days around.
All of a sudden, some control-queue gets stuck and orchestrations queued on it doesn't get processed until we restart the node itself.
This is not near shutdown of service etc.
Also, the lease ownership status shows success in the impacted duration for the control-queue as well.

No predictable repro so far, but quite consistent with above mentioned frequency.

Thanks,
Pankaj

davidmrdavid · 2023-05-16T17:30:41Z

Thanks @sainipankaj90k. Just noting here for visibility that we discussed internally to work together to create a private release with extra logs to help get to the bottom of this issue. We can keep this issue open while the investigation is active.

leskil · 2023-07-03T20:47:49Z

Hi @davidmrdavid - wondering if you have any updates here as we are experiencing similar issues?

sainipankaj90k · 2023-07-03T20:54:44Z

So far, I tried to monitor the situation on the box and found some times the c# await tasks goes stuck. (Sounds very weird, but experienced it first hand).

Realized it is stuck at various calls, e.g., fetching applease from storage, stopping/starting taskhub worker. [Through our local testing, its not predictable but reproduced].

In DTF here, it is usually found stuck in 'GetMessagesAsync' of 'ControlQueue.cs' which are also using awaiting c# tasks.

Hypothesis so far is, these stuck c# await tasks are causing them.
As per configure await 'https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.task.configureawait?view=net-6.0', I think we should use 'ConfigureAwait' as false to reduce probability of stuck tasks.

Right now, I am planning of using configureawait as false in our code, and as per success we should have it used in this DTF library.

cgillum · 2023-07-03T20:58:09Z

Right now, I am planning of using configureawait as false in our code, and as per success we should have it used in this DTF library.

Please don't do this. It is likely to cause your orchestration to get stuck 100% of the time since orchestration code MUST always run in the orchestration's synchronization context.

The problem you're experiencing sounds more like an issue with the DurableTask.AzureStorage partition manager. Which version of the Microsoft.Azure.DurableTask.AzureStorage nuget package are you using?

sainipankaj90k · 2023-07-03T21:00:58Z

Interesting.
For us, it is Microsoft.Azure.DurableTask.AzureStorage Version=1.13.6'.

leskil · 2023-07-03T21:10:51Z

We are on Microsoft.Azure.DurableTask.AzureStorage 1.12.0

cgillum · 2023-07-03T21:13:41Z

Okay, and just to confirm, are you still encountering this problem periodically with the latest version(s)?

sainipankaj90k · 2023-07-04T05:31:23Z

Yes, we are encountering these with latest versions too.

Another observation: We have multiple task-hub-worker (consider like microservices) at each node.
We've found if control-queues are stuck, multiple control-queue on same node gets stuck.

could be awaiting thread issue causing deadlock. I see no other reason any async request to get stuck when no logic is such that which can keep it on hold so far.

davidmrdavid added the bug label May 16, 2023

pasaini-microsoft mentioned this issue Apr 22, 2024

[Proposal] Detect and fail fast for dangerous to deserialize types in task activity and task orchestration. #1070

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck orchestrations at random on control-queue #903

Stuck orchestrations at random on control-queue #903

sainipankaj90k commented May 5, 2023

davidmrdavid commented May 16, 2023

leskil commented Jul 3, 2023

sainipankaj90k commented Jul 3, 2023 •

edited

Loading

cgillum commented Jul 3, 2023

sainipankaj90k commented Jul 3, 2023 •

edited

Loading

leskil commented Jul 3, 2023 •

edited

Loading

cgillum commented Jul 3, 2023

sainipankaj90k commented Jul 4, 2023

Stuck orchestrations at random on control-queue #903

Stuck orchestrations at random on control-queue #903

Comments

sainipankaj90k commented May 5, 2023

davidmrdavid commented May 16, 2023

leskil commented Jul 3, 2023

sainipankaj90k commented Jul 3, 2023 • edited Loading

cgillum commented Jul 3, 2023

sainipankaj90k commented Jul 3, 2023 • edited Loading

leskil commented Jul 3, 2023 • edited Loading

cgillum commented Jul 3, 2023

sainipankaj90k commented Jul 4, 2023

sainipankaj90k commented Jul 3, 2023 •

edited

Loading

sainipankaj90k commented Jul 3, 2023 •

edited

Loading

leskil commented Jul 3, 2023 •

edited

Loading