-
Notifications
You must be signed in to change notification settings - Fork 307
Bug fix: Fixing stale cache issue post snapshot restore #5307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks @nagarajg17 for the PR! can you take a look at failing tests? |
The unit tests are failing because they perform the security index initialization using the code from here. To solve the issue, 1 solution may be to differentiate between when the security index is created using a CreateIndexRequest vs when its created from a RestoreSnapshotRequest. I was able to write a quite demo to differentiate between these, but the code looks a bit strange because of needing to rewrite the name of the thread. I borrowed the idea from existing code in the security plugin here. |
I am not sure if the described behavior should be considered as a bug. Also all other operations which write to the security index do not automatically trigger a reload of the security configuration. In all cases, a subsequent call to https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/rest/SecurityConfigUpdateAction.java is performed to update the effective configuration. |
4935f77
to
e4cc98b
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5307 +/- ##
==========================================
+ Coverage 71.69% 71.74% +0.05%
==========================================
Files 384 384
Lines 23866 23884 +18
Branches 3646 3649 +3
==========================================
+ Hits 17111 17136 +25
+ Misses 4924 4913 -11
- Partials 1831 1835 +4
🚀 New features to boost your workflow:
|
@shikharj05 Have fixed the failing tests. Please take a look Thanks @cwperks for the suggestion of fixing failing UTs |
} | ||
} | ||
|
||
private boolean isSecurityIndexRestoredFromSnapshot(Index index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this into SnapshotRestoreHelper and keep setCurrentThreadName
private?
src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java
Outdated
Show resolved
Hide resolved
final String threadName = Thread.currentThread().getName(); | ||
|
||
try { | ||
setCurrentThreadName("[" + ThreadPool.Names.GENERIC + "]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit unconventional. Why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this all the UTs started failing with error https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java#L443
Its due to difference in the way security index is created in test vs actual flow. @cwperks has explained here. For this same reason in SnapshotHelper also they have done similar way https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/support/SnapshotRestoreHelper.java#L74-L87
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the stack trace of the error?
Usually, these checks are there for a reason. OpenSearch uses quite a few tightly bounded thread pools for its async processing architecture.
The cluster state applier thread pool has a max size of 1:
Any blocking operation on these threads bears the high risk of causing deadlocks, as the thread pool gets exhausted (as its only thread is blocked and cannot be returned into the pool). As soon as another operation triggered by this tries to modify the cluster state again, that operation will wait forever for the cluster state applier thread to become available again.
These situations are extremely nasty, as the conditions are often non-deterministic. In some cases, you will face a deadlock, in others you won't. Debugging this is no fun ;-)
That's why the OpenSearch core code contains assertions like this.
Possible solutions might be:
- Refactor the whole code to be async
- Offload blocking operations to a separate thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I asked here a question to clarify whether we actually face a bug:
Would be cool if we could clarify this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll re-run the scenario and post the error message shortly, but I also wanted to pose another strategy as well.
The error seen in the UTs was due to the security index being initialized as empty for test cases subclassing AbstractSecurityUnitTest. The goal of this block is to differentiate between the security index being restored from a snapshot vs being created with a CreateIndex request. Should we consider simplifying this logic to check if the security index is not empty and then trigger the config reload?
In either case I think renaming is necessary here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snapshot restore flow is different from flow of index update. What do you suggest?
Possibly I am missing an important detail here. Could you elaborate what's the essential difference between snapshot restore and normal index update that configuration should be automatically reloaded on snapshot restore, but not on index update?
What approach would we want to take on fixing this error
I guess the easiest way would be to offload the reloading to a separate thread. I see no reason why it needs to be sync to the cluster state applier thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate what's the essential difference between snapshot restore and normal index update that configuration should be automatically reloaded on snapshot restore, but not on index update?
Restore from a snapshot is a more common operational action then direct index update, but you are right that the same applies for both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For index update TransportConfigUpdateAction
is taking care of reloading configuration. Do we want to use this only for Snapshot restore? Since snapshot restore creates new index, TransportConfigUpdateAction
flow won't be invoked for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For index update TransportConfigUpdateAction is taking care of reloading configuration.
I am sorry, but I do think this is not accurate :-) TransportConfigUpdateAction is never automatically called if you consider basic index operations.
If you do an index update (via PUT /.../_index/...
or via PUT /_bulk
or via one of its transport counterparts), only the index will be updated. However, the security plugin does not reload the security configuration. It only does if after the update an additional call to the TransportConfigUpdateAction
is performed).
You can see that in the code of the securityadmin
tool:
security/src/main/java/org/opensearch/security/tools/SecurityAdmin.java
Lines 1362 to 1392 in 6e78dd9
private static int upload(RestHighLevelClient tc, String index, String cd, int expectedNodeCount, boolean resolveEnvVars) | |
throws IOException { | |
boolean success = uploadFile(tc, cd + "config.yml", index, "config", resolveEnvVars); | |
success = uploadFile(tc, cd + "roles.yml", index, "roles", resolveEnvVars) && success; | |
success = uploadFile(tc, cd + "roles_mapping.yml", index, "rolesmapping", resolveEnvVars) && success; | |
success = uploadFile(tc, cd + "internal_users.yml", index, "internalusers", resolveEnvVars) && success; | |
success = uploadFile(tc, cd + "action_groups.yml", index, "actiongroups", resolveEnvVars) && success; | |
success = uploadFile(tc, cd + "tenants.yml", index, "tenants", resolveEnvVars) && success; | |
success = uploadFile(tc, cd + "nodes_dn.yml", index, "nodesdn", resolveEnvVars, true) && success; | |
if (new File(cd + "audit.yml").exists()) { | |
success = uploadFile(tc, cd + "audit.yml", index, "audit", resolveEnvVars) && success; | |
} | |
if (new File(cd + "allowlist.yml").exists()) { | |
success = uploadFile(tc, cd + "allowlist.yml", index, "allowlist", resolveEnvVars) && success; | |
} | |
if (!success) { | |
System.out.println("ERR: cannot upload configuration, see errors above"); | |
return -1; | |
} | |
Response cur = tc.getLowLevelClient() | |
.performRequest(new Request("PUT", "/_plugins/_security/configupdate?config_types=" + Joiner.on(",").join(getTypes()))); | |
success = checkConfigUpdateResponse(cur, expectedNodeCount, getTypes().length) && success; | |
System.out.println("Done with " + (success ? "success" : "failures")); | |
return (success ? 0 : -1); | |
} |
In lines 1364 to 1379, the index is modified. Only afterwards, in line 1386 and 1387, there's a separate call to PUT /_plugins/_security/configupdate
.
Of course, this is a pattern that could be also applied to snapshot restore operations: Just do the snapshot restore and call PUT /_plugins/_security/configupdate
afterwards.
My general opinion on this is:
- We should strive for consistent implementations. Consistent code will be easier to understand for developers and thus more efficient to maintain. Of course, inconsistencies are sometimes inevitable. But we should have good reasons to do so and name these (possibly even in the code as comment).
- Following the KISS principle, simple solutions should be preferred over complex or challenging solutions. At the moment, the proposed solution does complex (and possibly time consuming) operations in the very low level cluster state applier thread. This involves accessing the index itself, that's a task the cluster state applier thread was originally not created for (as it by default refuses these operations, compare the calls to
setCurrentThreadName()
). There might be ways to solve this, but research/tests are necessary to find these. This is also an indicator for complexity. On the other hand, there's the very simple solution, which is extremely easy to understand and very robust: Just call the additionalTransportConfigUpdateAction
after any manipulation of the index. That's a solution that would be available already today, without any code change.
Note: I won't veto against a solution with built-in automatic refresh of the configuration. However, I strongly believe that we should be able to give good reasons for introducing an inconsistency in the behavior between snapshot restore and index update. We should know how to react if afterwards a user files a bug report "If I do a PUT /.opendistro_security/_doc/...
, security configuration is not automatically refreshed. However, it is automatically refreshed for snapshot restore. Why isnt it happening for the normal PUT
?"
Note 2: I will be out of office for the next two days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I have only ever operationally done a direct index update to the security index once to resolve an issue with a corrupt index. In that case, there was a separate bug which led to invalid values in the index.
Restore from a snapshot is a more common operational action that could be automated and I think its ok to focus on this scenario.
Signed-off-by: Nagaraj G <[email protected]>
Description
Bug fix: Fixing stale cache issue post snapshot restore
Currently when a snapshot is restored on a domain including security index, the security index is updated but the in memory cache is not updated or reloaded with data from security index. Due to this AuthNZ for API requests and dashboard access is incorrect. It can either provide access when it shouldn't or deny access when it shouldn't
This change address this issue by adding a listener to index shard update events which reloads the cache with data from security index. This ensures cache is always updated when security index is updated
Issues Resolved
[Bug] Stale cache post snapshot restore
Testing
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.