Skip to content

[BUG] Detector creation gets stuck on clusters with large shards and heavy ingestion #870

Closed
@eirsep

Description

@eirsep

What is the bug?

curl localhost:9200/_cat/tasks?v | less
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
action                                                     task_id                        parent_task_id                 type      start_time    timestamp running_time ip            node
cluster:admin/opensearch/securityanalytics/detector/write  NS5L3EYoSM2ED7Ivqq1snQ:990205  -                              transport 1706578051254 01:27:31  5.2h         10.212.107.51 5112ba5b511cfd4495
cluster:admin/opensearch/securityanalytics/detector/write  NS5L3EYoSM2ED7Ivqq1snQ:991197  -                              transport 1706578171258 01:29:31  5.2h         10.212.107.51 5112ba5b511cfd4495
cluster:admin/opensearch/securityanalytics/rule/search     wWwwf7eRSD2oo8KulgOF7Q:917083  -                              transport 1706578176167 01:29:36  5.2h         10.212.27.178 a173576e5c9b149d2e
cluster:admin/opendistro/alerting/monitor/write            NS5L3EYoSM2ED7Ivqq1snQ:991834  -                              transport 1706578242304 01:30:42  5.2h         10.212.107.51 5112ba5b511cfd4495
cluster:admin/opensearch/securityanalytics/detector/write  6x4YBILlRNqCh-H5SEGz4g:929275  -                              transport 1706578277667 01:31:17  5.2h         10.212.98.228 f8a85ed4b86db333fc
cluster:admin/opensearch/securityanalytics/mapping/get     NS5L3EYoSM2ED7Ivqq1snQ:992971  -                              transport 1706578360527 01:32:40  5.1h         10.212.107.51 5112ba5b511cfd4495
indices:admin/mappings/get   

How can one reproduce the bug?

There are a few blocking calls (due to invocation of actionGet()) which are causing deadlocks in detector creation flow. On clusters with heavy ingestion and large shards this problem is magnified and causes cluster to choke up and run out of resources stuck in deadlocks

What is the expected behavior?
Code should be event driven using the listener-based SPIs exposed by opensearch transport client

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions