Skip to content

[Bug]: unrecoverable The collection's state is no longer correct, some Entities return errors until DAB is restarted #2694

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
golfalot opened this issue May 20, 2025 · 2 comments · May be fixed by #2695
Open
1 task done
Labels
bug Something isn't working triage issues to be triaged

Comments

@golfalot
Copy link

What happened?

Background

  • We are running the official image mcr.microsoft.com/azure-databases/data-api-builder:1.4.27
  • This issue is not new to this release, it's long standing across multiple releases of DAB and .Net 6 and .Net 8
  • We are querying a single CosmosDB database , DAB configuration Entities map to around 20 Cosmos containers.
  • All the queries are point reads by primary key
  • The CosmosDB is not being altered during the period the error occurs

Error message
"Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct."

Symptoms

  1. The request to /graphql returns HTTP 200, but only some Entities are resolved in the data property, with errors field populated
  2. Once the error has occurred once, those Entities that failed with consistently fail until the container is restarted/redeployed
  3. Curiously, entities in the error state includes some which have not been queried.
"errors": [
        {
            "message": "Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.",
            "locations": [
                {
                    "line": 46,
                    "column": 3
                }
            ],
            "path": [
                "copernicusSlope_by_pk"
            ]
        },
        {
            "message": "Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.",
            "locations": [
                {
                    "line": 102,
                    "column": 3
                }
            ],
            "path": [
                "hadUKgroundfrost_by_pk"
            ]
        },
...

Repeatability

  • We cannot reproduce at will, hundreds of thousand of requests are successfully handled over a period of a week or two
  • It appears to coincide with peak concurrent requests, in the region of 200-300 requests per minute

Desired behaviour

  • The odd error here and there is acceptable, but getting stuck in a persistent errored state, whilst still returning HTTP 200 response codes is difficult to manage operationally
  • Possibly give some consideration to a /health api that can report on this persistent error state

Version

1.4.27.0

What database are you using?

CosmosDB NoSQL

What hosting model are you using?

Container Apps

Which API approach are you accessing DAB through?

GraphQL

Relevant log output

[{
        "severityLevel": "Error",
        "outerId": "0",
        "message": "Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.",
        "type": "System.InvalidOperationException",
        "id": "60417033",
        "parsedStack": [{
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.ThrowHelper.ThrowInvalidOperationException_ConcurrentOperationsNotSupported",
                "level": 0,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Collections.Generic.Dictionary`2.FindValue",
                "level": 1,
                "line": 0
            }, {
                "assembly": "Azure.DataApiBuilder.Core, Version=1.4.27.0, Culture=neutral, PublicKeyToken=null",
                "method": "Azure.DataApiBuilder.Core.Resolvers.CosmosQueryEngine+<GetPartitionKeyPath>d__16.MoveNext",
                "level": 2,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw",
                "level": 3,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess",
                "level": 4,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification",
                "level": 5,
                "line": 0
            }, {
                "assembly": "Azure.DataApiBuilder.Core, Version=1.4.27.0, Culture=neutral, PublicKeyToken=null",
                "method": "Azure.DataApiBuilder.Core.Resolvers.CosmosQueryEngine+<GetIdAndPartitionKey>d__17.MoveNext",
                "level": 6,
                "line": 344,
                "fileName": "/_/src/Core/Resolvers/CosmosQueryEngine.cs"
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw",
                "level": 7,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess",
                "level": 8,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification",
                "level": 9,
                "line": 0
            }, {
                "assembly": "Azure.DataApiBuilder.Core, Version=1.4.27.0, Culture=neutral, PublicKeyToken=null",
                "method": "Azure.DataApiBuilder.Core.Resolvers.CosmosQueryEngine+<ExecuteAsync>d__8.MoveNext",
                "level": 10,
                "line": 81,
                "fileName": "/_/src/Core/Resolvers/CosmosQueryEngine.cs"
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw",
                "level": 11,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess",
                "level": 12,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification",
                "level": 13,
                "line": 0
            }, {
                "assembly": "Azure.DataApiBuilder.Core, Version=1.4.27.0, Culture=neutral, PublicKeyToken=null",
                "method": "Azure.DataApiBuilder.Service.Services.ExecutionHelper+<ExecuteQueryAsync>d__5.MoveNext",
                "level": 14,
                "line": 79,
                "fileName": "/_/src/Core/Services/ExecutionHelper.cs"
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw",
                "level": 15,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess",
                "level": 16,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification",
                "level": 17,
                "line": 0
            }, {
                "assembly": "Azure.DataApiBuilder.Core, Version=1.4.27.0, Culture=neutral, PublicKeyToken=null",
                "method": "ResolverTypeInterceptor+<>c__DisplayClass5_1+<<-ctor>b__5>d.MoveNext",
                "level": 18,
                "line": 23,
                "fileName": "/_/src/Core/Services/ResolverTypeInterceptor.cs"
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw",
                "level": 19,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess",
                "level": 20,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification",
                "level": 21,
                "line": 0
            }, {
                "assembly": "HotChocolate.Execution, Version=12.22.6.0, Culture=neutral, PublicKeyToken=null",
                "method": "HotChocolate.Execution.Processing.Tasks.ResolverTask+<ExecuteResolverPipelineAsync>d__58.MoveNext",
                "level": 22,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw",
                "level": 23,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess",
                "level": 24,
                "line": 0
            }, {
                "assembly": "System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e",
                "method": "System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification",
                "level": 25,
                "line": 0
            }, {
                "assembly": "HotChocolate.Execution, Version=12.22.6.0, Culture=neutral, PublicKeyToken=null",
                "method": "HotChocolate.Execution.Processing.Tasks.ResolverTask+<TryExecuteAsync>d__57.MoveNext",
                "level": 26,
                "line": 0
            }
        ]
    }
]

Code of Conduct

  • I agree to follow this project's Code of Conduct
@michaelstaib
Copy link
Collaborator

I analyzed a bit and it seems that this could be cause by the metastore that is accessed within the cosmos provider ... the stack trace point to a concurrency issue when accessing the dictionary.

The GetIdAndPartitionKey access the metastore with dictionary

        public string? GetPartitionKeyPath(string database, string container)
        {
            _partitionKeyPaths.TryGetValue($"{database}/{container}", out string? partitionKeyPath);
            return partitionKeyPath;
        }

        /// <inheritdoc />
        public void SetPartitionKeyPath(string database, string container, string partitionKeyPath)
        {
            if (!_partitionKeyPaths.TryAdd($"{database}/{container}", partitionKeyPath))
            {
                _partitionKeyPaths[$"{database}/{container}"] = partitionKeyPath;
            }
        }

This dictionary should be a concurrent dictionary.

@michaelstaib
Copy link
Collaborator

@Aniruddh25 I am closing the issue on the HotChocolate repo but i can hop on a call if you guys need some pointers as I have analyzed the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage issues to be triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants