Skip to content

Cardinality Aggregation dynamic pruning null pointer exception #17739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bowenlan-amzn opened this issue Mar 31, 2025 · 7 comments · Fixed by #17775
Closed

Cardinality Aggregation dynamic pruning null pointer exception #17739

bowenlan-amzn opened this issue Mar 31, 2025 · 7 comments · Fixed by #17775
Assignees
Labels
Search:Aggregations v3.0.0 Issues and PRs related to version 3.0.0

Comments

@bowenlan-amzn
Copy link
Member

We got a report of null pointer exception when performing dynamic pruning in cardinality aggregation.

The exception looks like below

Caused by: java.lang.NullPointerException: Cannot read field \"doc\" because \"top\" is null
at org.opensearch.search.aggregations.metrics.CardinalityAggregator$PruningCollector.prune(CardinalityAggregator.java:379)
at org.opensearch.search.aggregations.metrics.CardinalityAggregator$PruningCollector.collect(CardinalityAggregator.java:371)
at org.opensearch.search.aggregations.LeafBucketCollector.collect(LeafBucketCollector.java:123)
at org.apache.lucene.search.BooleanScorer$DocIdStreamView.forEach(BooleanScorer.java:178)
at org.apache.lucene.search.LeafCollector.collect(LeafCollector.java:106)
at org.apache.lucene.search.BooleanScorer.scoreWindowIntoBitSetAndReplay(BooleanScorer.java:257)
at org.apache.lucene.search.BooleanScorer.scoreWindowMultipleScorers(BooleanScorer.java:309)
at org.apache.lucene.search.BooleanScorer.scoreWindow(BooleanScorer.java:362)
at org.apache.lucene.search.BooleanScorer.score(BooleanScorer.java:374)
at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:38)
at org.opensearch.search.aggregations.metrics.CardinalityAggregator.tryScoreWithPruningCollector(CardinalityAggregator.java:229)\n\t... 41 more\n

Notice the code path starts from BooleanScorer. This is missed during development where we focused mostly just on DefaultBulkScorer. Although DefaultBulkScorer guarantee no NPE at that place, but the other implementations of BulkScorer may not have that guarantees.

The fix is to simply add a null check.

@bowenlan-amzn bowenlan-amzn self-assigned this Mar 31, 2025
@bowenlan-amzn bowenlan-amzn converted this from a draft issue Mar 31, 2025
@bowenlan-amzn bowenlan-amzn removed their assignment Mar 31, 2025
@jainankitk
Copy link
Contributor

@bowenlan-amzn - Thanks for reporting this issue. We should prioritize the fix for this NPE issue. Is someone working on it already? If yes, I can assign this issue

@bowenlan-amzn
Copy link
Member Author

@jainankitk no one has started on this for now.
I'm cutting this to see if anyone is interested to pick up the small fix. If we don't see any interest this week, I will get this going.

@ajleong623
Copy link
Contributor

I would like to work on this.

@ajleong623
Copy link
Contributor

ajleong623 commented Apr 3, 2025

@bowenlan-amzn I made a draft pull request, but I am curious about what request caused this null pointer exception? And what were the documents?

@bowenlan-amzn
Copy link
Member Author

bowenlan-amzn commented Apr 4, 2025

@ajleong623 I don't have a dataset for reproduction as it's from an internal customer.

The request looks like this

POST <index>/_search
{
    "size": 0,
    "aggs": {
        "<agg-name>": {
            "cardinality": {
                "field": "<keyword_1>"
            }
        }
    },
    "query": {
        "bool": {
            "filter": {
                "terms": {
                    "<keyword_2>": [
                        "<term a>",
                        "<term b>",
                        "<term c>"
                    ]
                }
            }
        }
    }
}

I realise you probably anyway need to have a unit test case to hit the null check added in the PR so to pass the code coverage check.
Please see if you can come up with some dataset that can cause the NPE when doing this above query.

@ajleong623
Copy link
Contributor

ajleong623 commented Apr 5, 2025

@bowenlan-amzn Ok, So I have been trying to write tests using the boolean query and have been looking through the logic of how dynamic pruning happens and what could cause the DISIPriorityQueue to become empty while documents are still being collected (I believe this is why the error occurred unless there was somehow a null value pushed onto the queue). However, I also noticed that the query you provided above would be evaluated in the conjunction scorer because multiple filter boolean queries would end of being conjunctive. But BooleanScorer (shown in the stack trace) is used for disjunctive queries.

Is it certain that the query was in that structure? Also, when I tried experimenting with disjunctive queries by using the SHOULD occur value, the DISIPriorityQueue pushed documentIDs with values of -1 onto the queue, and from my experimenting, this will happen if I used a disjunctive query with only should boolean occurs for the clauses. When the -1 are pushed onto the queue, the queue cannot be emptied. I also tried to make a TermInSetQuery, and the same thing happened to the priority queue. Let me know if you have any other ideas on how I should try to reproduce the NPE. Are there any other boolean query structures I should try? How could I empty the queue but still have documents being collected when the queue is the size of the amount of terms in the query during construction of the collector?

@bowenlan-amzn
Copy link
Member Author

bowenlan-amzn commented Apr 9, 2025

Based on the exception trace, I looked into the BooleanScorer's implementation.
When I try to reproduce the exception, I realise BooleanScorer doens't even use competitiveIterator provided by pruning collector, so competitiveIterator should be never advanced and would stay at docId -1, which means prune would never prune anything either, theoretically.
I'm not sure why the size of queue in dynamic pruning would ever decrease when BooleanScorer is used to score a segment.

For unit test coverage, @ajleong623 maybe try create a customized BulkScorer to just hit the null check path. Or try ask maintainer to merge the change overriding the coverage check. Since how we end up having a NPE is still a mystery.

@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Apr 11, 2025
@github-project-automation github-project-automation bot moved this from In Progress to Done in Performance Roadmap Apr 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Search:Aggregations v3.0.0 Issues and PRs related to version 3.0.0
Projects
Status: Done
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants