Description
Describe the bug
Issue was found by @dbwiddis while reviewing #16818 , raised this bug issue to track and proper fix this:
A race condition exists in the QueryRewriteContext class in the method executeAsyncActions(). When multiple threads concurrently interact with the asyncActions list (via the registerAsyncAction() and executeAsyncActions() methods), it can result in actions being lost, or failure notifications not being delivered to the listener. This issue can occur under certain conditions where actions are added to the list while the method executeAsyncActions() is in progress.
Related component
Search:Query Capabilities
To Reproduce
- Thread 1 calls executeAsyncActions() and retrieves the number of registered async actions (e.g., 3 actions) on line 113, and stores this number in the countDown object.
- Thread 2 (in a different part of the system) calls registerAsyncAction() and adds a 4th action to the asyncActions list.
- Thread 1 proceeds and copies the current asyncActions list (which now contains 4 actions) into biConsumers. The count of actions in biConsumers is now 4, but the count stored in countDown is still 3, reflecting the actions at the time executeAsyncActions() started.
- Thread 3 (in another part of the system) calls registerAsyncAction() again, adding a 5th action to the asyncActions list.
- Thread 1 clears the asyncActions list, so the 5th action added in step 4 is lost.
- Thread 1 starts iterating over biConsumers (which contains only 4 actions, not 5) and processes the actions.
- As countDown reaches 0 after processing the 3rd action, the listener's onResponse() is triggered, but the 4th action is still executed.
- If the 4th action fails, the failure is ignored because the listener is already completed (onResponse() has been triggered).
- Expected Behavior
- Action 5 should not be lost, even if executeAsyncActions() is in progress when new actions are added to the list.
- If any action fails (including the 4th action), the failure should be properly reported via the onFailure() method of the listener.
- Actual Behavior
- Action 5 is lost because it was added to asyncActions after executeAsyncActions() had already copied the list to biConsumers.
- If the 4th action fails, the failure is not reported to the user because the listener has already been notified with onResponse() when countDown reaches 0.
Thread 1 (executeAsyncActions) Thread 2 (registerAsyncAction) Thread 3 (registerAsyncAction)
| Call executeAsyncActions() | |
| Retrieve countDown = 3 | |
| Copy asyncActions (3 items) -> biConsumers | |
| Thread 1 stores countDown, iterates biConsumers| |
| ---------------------------------------------> | Add Action 4 to asyncActions |
| Process Action 1 | |
| Process Action 2 | |
| Process Action 3 | |
| Action 3 -> countDown reaches 0, listener | |
| onResponse() | |
| Continue executing Action 4 | Add Action 5 to asyncActions |
| Clear asyncActions (loses Action 5) | |
| ---------------------------------------------> | |
| Fail Action 4 -> No failure notification | |
Root Cause
The issue arises from the following race condition:
Thread 1 is executing executeAsyncActions(), but other threads (Thread 2 and Thread 3) are concurrently adding new actions to asyncActions.
The list is copied to biConsumers before the list is cleared, leading to discrepancies between the actions being iterated over and the actual actions that should have been processed.
The countDown value is based on the state of asyncActions when executeAsyncActions() starts, but the list may have changed by the time the actions are processed, leading to missing actions and incomplete failure reporting.
Expected behavior
1. Quick Fix: Synchronization Around Copying and Clearing
Problem: The race condition occurs because the
asyncActions
list can be modified whileexecuteAsyncActions()
is processing it.Solution:
Wrap the
copy
andclear
operations ofasyncActions
in a synchronized block. This will ensure that no other thread can modify the list while it's being copied or cleared.Also wrap the
registerAsyncAction
method (or just the part where actions are added to the list) with a synchronized block to prevent concurrent modifications.javaCopysynchronized (asyncActions) { // Copy and clear the list in executeAsyncActions List<BiConsumer<Client, ActionListener<?>>> biConsumers = new ArrayList<>(asyncActions); asyncActions.clear(); }
synchronized (asyncActions) {
// Add new actions in registerAsyncAction
asyncActions.add(asyncAction);
}
Further Improvement:
- Move the calculation of
countDown
to after the list is copied, right after the synchronized block. This ensures you have full control over the list at that point and avoids any race when counting actions.
- Move the calculation of
2. Alternate Fix: Synchronize registerAsyncAction
and Refactor into a New Method
- Problem: Modifying
asyncActions
inregisterAsyncAction
while it is being copied and cleared causes the race condition. - Solution:
Add the
synchronized
keyword toregisterAsyncAction
to prevent concurrent modifications to the list while it's being processed.Create a new synchronized helper method,
copyAndClearAsyncActions()
, which handles copying the list, clearing it, and returning the copied list. This method will encapsulate the synchronization logic.javaCopyprivate synchronized List<BiConsumer<Client, ActionListener<?>>> copyAndClearAsyncActions() {
List<BiConsumer<Client, ActionListener<?>>> copy = new ArrayList<>(asyncActions);
asyncActions.clear();
return copy;
}// In executeAsyncActions
List<BiConsumer<Client, ActionListener<?>>> biConsumers = copyAndClearAsyncActions();
Move the countdown calculation after calling
copyAndClearAsyncActions()
, so you are working with the list in a consistent state.
3. Simpler Alternative: Use AtomicReference
to Manage the List (proposed by @msfroh )
- Problem: The current approach with a plain list allows race conditions when modifying the list.
- Solution:
Replace
asyncActions
with anAtomicReference<List<BiConsumer<Client, ActionListener<?>>>>
to safely manage concurrent access to the list without explicit synchronization.Use the
AtomicReference
to atomically set and get the list. This eliminates the need to copy the list manually and avoids the race condition.javaCopyprivate final AtomicReference<List<BiConsumer<Client, ActionListener<?>>>> asyncActionsRef = new AtomicReference<>(new ArrayList<>());
// In registerAsyncAction
asyncActionsRef.get().add(asyncAction);// In executeAsyncActions
List<BiConsumer<Client, ActionListener<?>>> asyncActions = asyncActionsRef.getAndSet(new ArrayList<>());
Benefits:
- You no longer need to manually synchronize or copy the list.
AtomicReference
ensures that updates to the list are thread-safe without explicit locks.- Eliminates the risk of race conditions and makes the code simpler and more efficient.
Comparison of Solutions
Fix | Pros | Cons |
---|---|---|
Quick Fix (Synchronization) | - Simple to implement.- No major refactoring needed. | - May introduce performance overhead due to synchronization.- Synchronization may become complex with multiple threads accessing shared resources. |
Alternate Fix (Method Refactor) | - Clearer separation of concerns with a new method.- Keeps synchronization in one place. | - Adds additional methods and complexity.- Slightly more verbose than the quick fix. |
Simpler Alternative (AtomicReference) | - Thread-safe without needing explicit synchronization.- Eliminates race conditions with minimal code changes. | - Requires replacing the list with an AtomicReference.- Might need some adaptation in other parts of the code. |
Recommendation
- The simpler alternative using
AtomicReference
seems like the best option. It requires minimal changes, eliminates the need for manual synchronization, and provides better thread safety. This approach is also cleaner and more efficient, as it avoids the overhead associated with usingsynchronized
blocks.
Additional Details
Plugins
core
Additional context
discussion was raised from https://github.com/opensearch-project/OpenSearch/pull/16818/files#r1919283967
Metadata
Metadata
Assignees
Type
Projects
Status