Bug fix: Fixing stale cache issue post snapshot restore #5307

nagarajg17 · 2025-04-29T16:04:06Z

Description

Bug fix: Fixing stale cache issue post snapshot restore

Currently when a snapshot is restored on a domain including security index, the security index is updated but the in memory cache is not updated or reloaded with data from security index. Due to this AuthNZ for API requests and dashboard access is incorrect. It can either provide access when it shouldn't or deny access when it shouldn't

This change address this issue by adding a listener to index shard update events which reloads the cache with data from security index. This ensures cache is always updated when security index is updated

Issues Resolved

[Bug] Stale cache post snapshot restore

Testing

Tested with a local snapshot repository. Took a snapshot of current security index data. Made new changes to index by creating a new role with certain index permissions and mapping the role to existing user. Restored the snapshot and verified
- API call to access index data which the new role had permissions was not accessible (before it was accessible because cache wasn't updated)
- Verified cache and security index holds same data

Check List

New functionality includes testing
New functionality has been documented
New Roles/Permissions have a corresponding security dashboards plugin PR
API changes companion pull request created
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

shikharj05 · 2025-05-02T05:14:42Z

Thanks @nagarajg17 for the PR! can you take a look at failing tests?

cwperks · 2025-05-02T20:53:42Z

The unit tests are failing because they perform the security index initialization using the code from here. To solve the issue, 1 solution may be to differentiate between when the security index is created using a CreateIndexRequest vs when its created from a RestoreSnapshotRequest. I was able to write a quite demo to differentiate between these, but the code looks a bit strange because of needing to rewrite the name of the thread. I borrowed the idea from existing code in the security plugin here.

nibix · 2025-05-02T21:00:34Z

I am not sure if the described behavior should be considered as a bug. Also all other operations which write to the security index do not automatically trigger a reload of the security configuration. In all cases, a subsequent call to https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/rest/SecurityConfigUpdateAction.java is performed to update the effective configuration.

codecov · 2025-05-06T11:06:23Z

Codecov Report

Attention: Patch coverage is 72.22222% with 5 lines in your changes missing coverage. Please review.

Project coverage is 71.74%. Comparing base (835f899) to head (e4cc98b).

Files with missing lines	Patch %	Lines
...ecurity/configuration/ConfigurationRepository.java	70.58%	3 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5307      +/-   ##
==========================================
+ Coverage   71.69%   71.74%   +0.05%     
==========================================
  Files         384      384              
  Lines       23866    23884      +18     
  Branches     3646     3649       +3     
==========================================
+ Hits        17111    17136      +25     
+ Misses       4924     4913      -11     
- Partials     1831     1835       +4

Files with missing lines	Coverage Δ
.../opensearch/security/OpenSearchSecurityPlugin.java	`83.29% <100.00%> (+0.02%)`	⬆️
...search/security/support/SnapshotRestoreHelper.java	`82.75% <ø> (ø)`
...ecurity/configuration/ConfigurationRepository.java	`82.50% <70.58%> (+2.65%)`	⬆️

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

nagarajg17 · 2025-05-06T11:13:56Z

@shikharj05 Have fixed the failing tests. Please take a look

Thanks @cwperks for the suggestion of fixing failing UTs

cwperks · 2025-05-06T17:03:56Z

src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java

+        }
+    }
+
+    private boolean isSecurityIndexRestoredFromSnapshot(Index index) {


Can we move this into SnapshotRestoreHelper and keep setCurrentThreadName private?

src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java

nibix · 2025-05-07T05:17:39Z

src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java

+                final String threadName = Thread.currentThread().getName();
+
+                try {
+                    setCurrentThreadName("[" + ThreadPool.Names.GENERIC + "]");


This is a bit unconventional. Why is this necessary?

Without this all the UTs started failing with error https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java#L443

Its due to difference in the way security index is created in test vs actual flow. @cwperks has explained here. For this same reason in SnapshotHelper also they have done similar way https://github.com/opensearch-project/security/blob/main/src/main/java/org/opensearch/security/support/SnapshotRestoreHelper.java#L74-L87

What's the stack trace of the error?

Usually, these checks are there for a reason. OpenSearch uses quite a few tightly bounded thread pools for its async processing architecture.

The cluster state applier thread pool has a max size of 1:

https://github.com/opensearch-project/OpenSearch/blob/bf8cd257fd544e3bb33c062547be25bbe8b1045a/server/src/main/java/org/opensearch/cluster/service/ClusterApplierService.java#L178-L185

Any blocking operation on these threads bears the high risk of causing deadlocks, as the thread pool gets exhausted (as its only thread is blocked and cannot be returned into the pool). As soon as another operation triggered by this tries to modify the cluster state again, that operation will wait forever for the cluster state applier thread to become available again.

These situations are extremely nasty, as the conditions are often non-deterministic. In some cases, you will face a deadlock, in others you won't. Debugging this is no fun ;-)

That's why the OpenSearch core code contains assertions like this.

Possible solutions might be:

Refactor the whole code to be async

Offload blocking operations to a separate thread

BTW, I asked here a question to clarify whether we actually face a bug:

#5307 (comment)

Would be cool if we could clarify this :)

I'll re-run the scenario and post the error message shortly, but I also wanted to pose another strategy as well.

The error seen in the UTs was due to the security index being initialized as empty for test cases subclassing AbstractSecurityUnitTest. The goal of this block is to differentiate between the security index being restored from a snapshot vs being created with a CreateIndex request. Should we consider simplifying this logic to check if the security index is not empty and then trigger the config reload?

In either case I think renaming is necessary here

Snapshot restore flow is different from flow of index update. What do you suggest?

Possibly I am missing an important detail here. Could you elaborate what's the essential difference between snapshot restore and normal index update that configuration should be automatically reloaded on snapshot restore, but not on index update?

What approach would we want to take on fixing this error

I guess the easiest way would be to offload the reloading to a separate thread. I see no reason why it needs to be sync to the cluster state applier thread.

Could you elaborate what's the essential difference between snapshot restore and normal index update that configuration should be automatically reloaded on snapshot restore, but not on index update?

Restore from a snapshot is a more common operational action then direct index update, but you are right that the same applies for both cases.

For index update TransportConfigUpdateAction is taking care of reloading configuration. Do we want to use this only for Snapshot restore? Since snapshot restore creates new index, TransportConfigUpdateAction flow won't be invoked for this

For index update TransportConfigUpdateAction is taking care of reloading configuration.

I am sorry, but I do think this is not accurate :-) TransportConfigUpdateAction is never automatically called if you consider basic index operations.

If you do an index update (via PUT /.../_index/... or via PUT /_bulk or via one of its transport counterparts), only the index will be updated. However, the security plugin does not reload the security configuration. It only does if after the update an additional call to the TransportConfigUpdateAction is performed).

You can see that in the code of the securityadmin tool:

security/src/main/java/org/opensearch/security/tools/SecurityAdmin.java

Lines 1362 to 1392 in 6e78dd9

private static int upload(RestHighLevelClient tc, String index, String cd, int expectedNodeCount, boolean resolveEnvVars)

throws IOException {

boolean success = uploadFile(tc, cd + "config.yml", index, "config", resolveEnvVars);

success = uploadFile(tc, cd + "roles.yml", index, "roles", resolveEnvVars) && success;

success = uploadFile(tc, cd + "roles_mapping.yml", index, "rolesmapping", resolveEnvVars) && success;

success = uploadFile(tc, cd + "internal_users.yml", index, "internalusers", resolveEnvVars) && success;

success = uploadFile(tc, cd + "action_groups.yml", index, "actiongroups", resolveEnvVars) && success;

success = uploadFile(tc, cd + "tenants.yml", index, "tenants", resolveEnvVars) && success;

success = uploadFile(tc, cd + "nodes_dn.yml", index, "nodesdn", resolveEnvVars, true) && success;

if (new File(cd + "audit.yml").exists()) {

success = uploadFile(tc, cd + "audit.yml", index, "audit", resolveEnvVars) && success;

}

if (new File(cd + "allowlist.yml").exists()) {

success = uploadFile(tc, cd + "allowlist.yml", index, "allowlist", resolveEnvVars) && success;

}

if (!success) {

System.out.println("ERR: cannot upload configuration, see errors above");

return -1;

}

Response cur = tc.getLowLevelClient()

.performRequest(new Request("PUT", "/_plugins/_security/configupdate?config_types=" + Joiner.on(",").join(getTypes())));

success = checkConfigUpdateResponse(cur, expectedNodeCount, getTypes().length) && success;

System.out.println("Done with " + (success ? "success" : "failures"));

return (success ? 0 : -1);

}

In lines 1364 to 1379, the index is modified. Only afterwards, in line 1386 and 1387, there's a separate call to PUT /_plugins/_security/configupdate.

Of course, this is a pattern that could be also applied to snapshot restore operations: Just do the snapshot restore and call PUT /_plugins/_security/configupdate afterwards.

My general opinion on this is:

We should strive for consistent implementations. Consistent code will be easier to understand for developers and thus more efficient to maintain. Of course, inconsistencies are sometimes inevitable. But we should have good reasons to do so and name these (possibly even in the code as comment).

Following the KISS principle, simple solutions should be preferred over complex or challenging solutions. At the moment, the proposed solution does complex (and possibly time consuming) operations in the very low level cluster state applier thread. This involves accessing the index itself, that's a task the cluster state applier thread was originally not created for (as it by default refuses these operations, compare the calls to setCurrentThreadName()). There might be ways to solve this, but research/tests are necessary to find these. This is also an indicator for complexity. On the other hand, there's the very simple solution, which is extremely easy to understand and very robust: Just call the additional TransportConfigUpdateAction after any manipulation of the index. That's a solution that would be available already today, without any code change.

Note: I won't veto against a solution with built-in automatic refresh of the configuration. However, I strongly believe that we should be able to give good reasons for introducing an inconsistency in the behavior between snapshot restore and index update. We should know how to react if afterwards a user files a bug report "If I do a PUT /.opendistro_security/_doc/..., security configuration is not automatically refreshed. However, it is automatically refreshed for snapshot restore. Why isnt it happening for the normal PUT?"

Note 2: I will be out of office for the next two days.

FWIW I have only ever operationally done a direct index update to the security index once to resolve an issue with a corrupt index. In that case, there was a separate bug which led to invalid values in the index.

Restore from a snapshot is a more common operational action that could be automated and I think its ok to focus on this scenario.

Signed-off-by: Nagaraj G <[email protected]>

nagarajg17 requested review from cwperks, DarshitChanpura, derek-ho, nibix, peternied, RyanL1997, reta, shikharj05 and willyborankin as code owners April 29, 2025 16:04

cwperks mentioned this pull request May 2, 2025

Fix stale cache issue restoring security index from a snapshot cwperks/security#50

Draft

5 tasks

nagarajg17 force-pushed the main branch from 068c7a1 to 6aac446 Compare May 3, 2025 08:02

nagarajg17 closed this May 3, 2025

nagarajg17 reopened this May 3, 2025

nagarajg17 force-pushed the main branch 3 times, most recently from 4935f77 to e4cc98b Compare May 6, 2025 10:13

cwperks reviewed May 6, 2025

View reviewed changes

src/main/java/org/opensearch/security/configuration/ConfigurationRepository.java Outdated Show resolved Hide resolved

nibix reviewed May 7, 2025

View reviewed changes

Fixing stale cache issue post snapshot restore

730128a

Signed-off-by: Nagaraj G <[email protected]>

nagarajg17 force-pushed the main branch from e4cc98b to 730128a Compare May 7, 2025 06:17

Merge branch 'main' into main

f49acc9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix: Fixing stale cache issue post snapshot restore #5307

Bug fix: Fixing stale cache issue post snapshot restore #5307

nagarajg17 commented Apr 29, 2025

shikharj05 commented May 2, 2025

cwperks commented May 2, 2025 •

edited

Loading

nibix commented May 2, 2025

codecov bot commented May 6, 2025

nagarajg17 commented May 6, 2025

cwperks May 6, 2025

nibix May 7, 2025

nagarajg17 May 7, 2025

nibix May 7, 2025 •

edited

Loading

nibix May 7, 2025

cwperks May 7, 2025

nibix May 7, 2025

cwperks May 7, 2025

nagarajg17 May 7, 2025

nibix May 8, 2025

cwperks May 9, 2025

	private static int upload(RestHighLevelClient tc, String index, String cd, int expectedNodeCount, boolean resolveEnvVars)
	throws IOException {
	boolean success = uploadFile(tc, cd + "config.yml", index, "config", resolveEnvVars);
	success = uploadFile(tc, cd + "roles.yml", index, "roles", resolveEnvVars) && success;
	success = uploadFile(tc, cd + "roles_mapping.yml", index, "rolesmapping", resolveEnvVars) && success;

	success = uploadFile(tc, cd + "internal_users.yml", index, "internalusers", resolveEnvVars) && success;
	success = uploadFile(tc, cd + "action_groups.yml", index, "actiongroups", resolveEnvVars) && success;

	success = uploadFile(tc, cd + "tenants.yml", index, "tenants", resolveEnvVars) && success;

	success = uploadFile(tc, cd + "nodes_dn.yml", index, "nodesdn", resolveEnvVars, true) && success;
	if (new File(cd + "audit.yml").exists()) {
	success = uploadFile(tc, cd + "audit.yml", index, "audit", resolveEnvVars) && success;
	}
	if (new File(cd + "allowlist.yml").exists()) {
	success = uploadFile(tc, cd + "allowlist.yml", index, "allowlist", resolveEnvVars) && success;
	}

	if (!success) {
	System.out.println("ERR: cannot upload configuration, see errors above");
	return -1;
	}

	Response cur = tc.getLowLevelClient()
	.performRequest(new Request("PUT", "/_plugins/_security/configupdate?config_types=" + Joiner.on(",").join(getTypes())));
	success = checkConfigUpdateResponse(cur, expectedNodeCount, getTypes().length) && success;

	System.out.println("Done with " + (success ? "success" : "failures"));
	return (success ? 0 : -1);
	}

Bug fix: Fixing stale cache issue post snapshot restore #5307

Are you sure you want to change the base?

Bug fix: Fixing stale cache issue post snapshot restore #5307

Conversation

nagarajg17 commented Apr 29, 2025

Description

Issues Resolved

Testing

Check List

shikharj05 commented May 2, 2025

cwperks commented May 2, 2025 • edited Loading

nibix commented May 2, 2025

codecov bot commented May 6, 2025

Codecov Report

nagarajg17 commented May 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nibix May 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwperks commented May 2, 2025 •

edited

Loading

nibix May 7, 2025 •

edited

Loading