[Bug]: unrecoverable The collection's state is no longer correct, some Entities return errors until DAB is restarted #2694

golfalot · 2025-05-20T17:49:57Z

What happened?

Background

We are running the official image mcr.microsoft.com/azure-databases/data-api-builder:1.4.27
This issue is not new to this release, it's long standing across multiple releases of DAB and .Net 6 and .Net 8
We are querying a single CosmosDB database , DAB configuration Entities map to around 20 Cosmos containers.
All the queries are point reads by primary key
The CosmosDB is not being altered during the period the error occurs

Error message
"Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct."

Symptoms

The request to /graphql returns HTTP 200, but only some Entities are resolved in the data property, with errors field populated
Once the error has occurred once, those Entities that failed with consistently fail until the container is restarted/redeployed
Curiously, entities in the error state includes some which have not been queried.

"errors": [
        {
            "message": "Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.",
            "locations": [
                {
                    "line": 46,
                    "column": 3
                }
            ],
            "path": [
                "copernicusSlope_by_pk"
            ]
        },
        {
            "message": "Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.",
            "locations": [
                {
                    "line": 102,
                    "column": 3
                }
            ],
            "path": [
                "hadUKgroundfrost_by_pk"
            ]
        },
...

Repeatability

We cannot reproduce at will, hundreds of thousand of requests are successfully handled over a period of a week or two
It appears to coincide with peak concurrent requests, in the region of 200-300 requests per minute

Desired behaviour

The odd error here and there is acceptable, but getting stuck in a persistent errored state, whilst still returning HTTP 200 response codes is difficult to manage operationally
Possibly give some consideration to a /health api that can report on this persistent error state

Version

1.4.27.0

What database are you using?

CosmosDB NoSQL

What hosting model are you using?

Container Apps

Which API approach are you accessing DAB through?

GraphQL

Relevant log output

[{
        "severityLevel": "Error",
        "outerId": "0",
        "message": "Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.",
        "type": "System.InvalidOperationException",
        "id": "60417033",
        "parsedStack": [{
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.ThrowHelper.ThrowInvalidOperationException_ConcurrentOperationsNotSupported",
                "level": 0,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Collections.Generic.Dictionary`2.FindValue",
                "level": 1,
                "line": 0
            }, {
                "assembly": "Azure.DataApiBuilder.Core, Version=1.4.27.0, Culture=neutral, PublicKeyToken=null",
                "method": "Azure.DataApiBuilder.Core.Resolvers.CosmosQueryEngine+<GetPartitionKeyPath>d__16.MoveNext",
                "level": 2,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw",
                "level": 3,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess",
                "level": 4,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification",
                "level": 5,
                "line": 0
            }, {
                "assembly": "Azure.DataApiBuilder.Core, Version=1.4.27.0, Culture=neutral, PublicKeyToken=null",
                "method": "Azure.DataApiBuilder.Core.Resolvers.CosmosQueryEngine+<GetIdAndPartitionKey>d__17.MoveNext",
                "level": 6,
                "line": 344,
                "fileName": "/_/src/Core/Resolvers/CosmosQueryEngine.cs"
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw",
                "level": 7,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess",
                "level": 8,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification",
                "level": 9,
                "line": 0
            }, {
                "assembly": "Azure.DataApiBuilder.Core, Version=1.4.27.0, Culture=neutral, PublicKeyToken=null",
                "method": "Azure.DataApiBuilder.Core.Resolvers.CosmosQueryEngine+<ExecuteAsync>d__8.MoveNext",
                "level": 10,
                "line": 81,
                "fileName": "/_/src/Core/Resolvers/CosmosQueryEngine.cs"
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw",
                "level": 11,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess",
                "level": 12,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification",
                "level": 13,
                "line": 0
            }, {
                "assembly": "Azure.DataApiBuilder.Core, Version=1.4.27.0, Culture=neutral, PublicKeyToken=null",
                "method": "Azure.DataApiBuilder.Service.Services.ExecutionHelper+<ExecuteQueryAsync>d__5.MoveNext",
                "level": 14,
                "line": 79,
                "fileName": "/_/src/Core/Services/ExecutionHelper.cs"
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw",
                "level": 15,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess",
                "level": 16,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification",
                "level": 17,
                "line": 0
            }, {
                "assembly": "Azure.DataApiBuilder.Core, Version=1.4.27.0, Culture=neutral, PublicKeyToken=null",
                "method": "ResolverTypeInterceptor+<>c__DisplayClass5_1+<<-ctor>b__5>d.MoveNext",
                "level": 18,
                "line": 23,
                "fileName": "/_/src/Core/Services/ResolverTypeInterceptor.cs"
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw",
                "level": 19,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess",
                "level": 20,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification",
                "level": 21,
                "line": 0
            }, {
                "assembly": "HotChocolate.Execution, Version=12.22.6.0, Culture=neutral, PublicKeyToken=null",
                "method": "HotChocolate.Execution.Processing.Tasks.ResolverTask+<ExecuteResolverPipelineAsync>d__58.MoveNext",
                "level": 22,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw",
                "level": 23,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess",
                "level": 24,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification",
                "level": 25,
                "line": 0
            }, {
                "assembly": "HotChocolate.Execution, Version=12.22.6.0, Culture=neutral, PublicKeyToken=null",
                "method": "HotChocolate.Execution.Processing.Tasks.ResolverTask+<TryExecuteAsync>d__57.MoveNext",
                "level": 26,
                "line": 0
            }
        ]
    }
]

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

michaelstaib · 2025-05-20T21:22:22Z

I analyzed a bit and it seems that this could be cause by the metastore that is accessed within the cosmos provider ... the stack trace point to a concurrency issue when accessing the dictionary.

The GetIdAndPartitionKey access the metastore with dictionary

        public string? GetPartitionKeyPath(string database, string container)
        {
            _partitionKeyPaths.TryGetValue($"{database}/{container}", out string? partitionKeyPath);
            return partitionKeyPath;
        }

        /// <inheritdoc />
        public void SetPartitionKeyPath(string database, string container, string partitionKeyPath)
        {
            if (!_partitionKeyPaths.TryAdd($"{database}/{container}", partitionKeyPath))
            {
                _partitionKeyPaths[$"{database}/{container}"] = partitionKeyPath;
            }
        }

This dictionary should be a concurrent dictionary.

michaelstaib · 2025-05-20T21:24:23Z

@Aniruddh25 I am closing the issue on the HotChocolate repo but i can hop on a call if you guys need some pointers as I have analyzed the issue.

golfalot added bug Something isn't working triage issues to be triaged labels May 20, 2025

golfalot mentioned this issue May 20, 2025

unrecoverable error - A concurrent update was performed on this collection and corrupted its state ChilliCream/graphql-platform#8299

Closed

michaelstaib linked a pull request May 21, 2025 that will close this issue

Fixed CosmosSqlMetadataProvider concurrency issue. #2695

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: unrecoverable The collection's state is no longer correct, some Entities return errors until DAB is restarted #2694

[Bug]: unrecoverable The collection's state is no longer correct, some Entities return errors until DAB is restarted #2694

golfalot commented May 20, 2025

michaelstaib commented May 20, 2025

Uh oh!

michaelstaib commented May 20, 2025

Uh oh!

[Bug]: unrecoverable The collection's state is no longer correct, some Entities return errors until DAB is restarted #2694

[Bug]: unrecoverable The collection's state is no longer correct, some Entities return errors until DAB is restarted #2694

Comments

golfalot commented May 20, 2025

What happened?

Version

What database are you using?

What hosting model are you using?

Which API approach are you accessing DAB through?

Relevant log output

Code of Conduct

michaelstaib commented May 20, 2025

Uh oh!

michaelstaib commented May 20, 2025

Uh oh!