Skip to content

Commit 5a5c970

Browse files
committed
avoid expensive Solr join when guest users search (affect IP Groups) #10554
1 parent d72d347 commit 5a5c970

File tree

8 files changed

+52
-0
lines changed

8 files changed

+52
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
An experimental feature flag called "avoid-expensive-solr-join" has been added to change the way Solr queries are constructed for guest (unauthenticated) users. It is hoped that it will help with performance, reducing load on Solr.
2+
3+
From a search perspective, it disables IP Groups (collections, datasets, and files will not be discoverable) but it removes an expensive Solr join for the most common users, which are guests. After turning on this feature, you must perform a full reindex.

doc/sphinx-guides/source/admin/ip-groups.rst

+5
Original file line numberDiff line numberDiff line change
@@ -41,3 +41,8 @@ It is not recommended to delete an IP Group that has been assigned roles. If you
4141
To delete an IP Group with an alias of "ipGroup1", use the curl command below:
4242

4343
``curl -X DELETE http://localhost:8080/api/admin/groups/ip/ipGroup1``
44+
45+
Related Settings
46+
----------------
47+
48+
Be aware that enabling the feature flag ``avoid-expensive-solr-join`` will effectively prevent collection, datasets, and files from being found by members of IP Groups when searching, rendering IP Groups much less useful. See :ref:`feature-flags` in the Installation Guide for details.

doc/sphinx-guides/source/developers/performance.rst

+4
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,10 @@ Solr
118118

119119
While in the past Solr performance hasn't been much of a concern, in recent years we've noticed performance problems when Harvard Dataverse is under load. Improvements were made in `PR #10050 <https://github.com/IQSS/dataverse/pull/10050>`_, for example.
120120

121+
We are tracking performance problems in `#10469 <https://github.com/IQSS/dataverse/issues/10469>`_.
122+
123+
In a meeting with a Solr expert on 2024-05-10 we were advised to avoid joins as much as possible. (It was acknowledged that many Solr users make use of joins because they have to, like we do, to keep some documents private.) Toward that end we have added a feature flag called ``avoid-expensive-solr-join`` as explained under :ref:`feature-flags`.
124+
121125
Datasets with Large Numbers of Files or Versions
122126
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
123127

doc/sphinx-guides/source/installation/config.rst

+3
Original file line numberDiff line numberDiff line change
@@ -3248,6 +3248,9 @@ please find all known feature flags below. Any of these flags can be activated u
32483248
* - api-session-auth
32493249
- Enables API authentication via session cookie (JSESSIONID). **Caution: Enabling this feature flag exposes the installation to CSRF risks!** We expect this feature flag to be temporary (only used by frontend developers, see `#9063 <https://github.com/IQSS/dataverse/issues/9063>`_) and for the feature to be removed in the future.
32503250
- ``Off``
3251+
* - avoid-expensive-solr-join
3252+
- Changes the way Solr queries are constructed for guest (unauthenticated) users. From a search perspective, it disables :doc:`IP Groups </admin/ip-groups>` (collections, datasets, and files will not be discoverable) but it removes an expensive Solr join for the most common users, which are guests, to help improve overall performance. After turning on this feature, you must perform a full reindex. See :doc:`/admin/solr-search-index`.
3253+
- ``Off``
32513254

32523255
**Note:** Feature flags can be set via any `supported MicroProfile Config API source`_, e.g. the environment variable
32533256
``DATAVERSE_FEATURE_XXX`` (e.g. ``DATAVERSE_FEATURE_API_SESSION_AUTH=1``). These environment variables can be set in your shell before starting Payara. If you are using :doc:`Docker for development </container/dev-usage>`, you can set them in the `docker compose <https://docs.docker.com/compose/environment-variables/set-environment-variables/>`_ file.

src/main/java/edu/harvard/iq/dataverse/search/IndexServiceBean.java

+10
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
import edu.harvard.iq.dataverse.datavariable.VariableMetadataUtil;
1313
import edu.harvard.iq.dataverse.datavariable.VariableServiceBean;
1414
import edu.harvard.iq.dataverse.harvest.client.HarvestingClient;
15+
import edu.harvard.iq.dataverse.settings.FeatureFlags;
1516
import edu.harvard.iq.dataverse.settings.JvmSettings;
1617
import edu.harvard.iq.dataverse.settings.SettingsServiceBean;
1718
import edu.harvard.iq.dataverse.util.FileUtil;
@@ -214,6 +215,9 @@ public Future<String> indexDataverse(Dataverse dataverse, boolean processPaths)
214215
solrInputDocument.addField(SearchFields.DATAVERSE_CATEGORY, dataverse.getIndexableCategoryName());
215216
if (dataverse.isReleased()) {
216217
solrInputDocument.addField(SearchFields.PUBLICATION_STATUS, PUBLISHED_STRING);
218+
if (FeatureFlags.AVOID_EXPENSIVE_SOLR_JOIN.enabled()) {
219+
solrInputDocument.addField(SearchFields.PUBLIC_OBJECT, true);
220+
}
217221
solrInputDocument.addField(SearchFields.RELEASE_OR_CREATE_DATE, dataverse.getPublicationDate());
218222
} else {
219223
solrInputDocument.addField(SearchFields.PUBLICATION_STATUS, UNPUBLISHED_STRING);
@@ -887,6 +891,9 @@ public SolrInputDocuments toSolrDocs(IndexableDataset indexableDataset, Set<Long
887891

888892
if (state.equals(indexableDataset.getDatasetState().PUBLISHED)) {
889893
solrInputDocument.addField(SearchFields.PUBLICATION_STATUS, PUBLISHED_STRING);
894+
if (FeatureFlags.AVOID_EXPENSIVE_SOLR_JOIN.enabled()) {
895+
solrInputDocument.addField(SearchFields.PUBLIC_OBJECT, true);
896+
}
890897
// solrInputDocument.addField(SearchFields.RELEASE_OR_CREATE_DATE,
891898
// dataset.getPublicationDate());
892899
} else if (state.equals(indexableDataset.getDatasetState().WORKING_COPY)) {
@@ -1400,6 +1407,9 @@ public SolrInputDocuments toSolrDocs(IndexableDataset indexableDataset, Set<Long
14001407
if (indexableDataset.getDatasetState().equals(indexableDataset.getDatasetState().PUBLISHED)) {
14011408
fileSolrDocId = solrDocIdentifierFile + fileEntityId;
14021409
datafileSolrInputDocument.addField(SearchFields.PUBLICATION_STATUS, PUBLISHED_STRING);
1410+
if (FeatureFlags.AVOID_EXPENSIVE_SOLR_JOIN.enabled()) {
1411+
solrInputDocument.addField(SearchFields.PUBLIC_OBJECT, true);
1412+
}
14031413
// datafileSolrInputDocument.addField(SearchFields.PERMS, publicGroupString);
14041414
addDatasetReleaseDateToSolrDoc(datafileSolrInputDocument, dataset);
14051415
// has this published file been deleted from the current draft version?

src/main/java/edu/harvard/iq/dataverse/search/SearchFields.java

+9
Original file line numberDiff line numberDiff line change
@@ -217,6 +217,15 @@ public class SearchFields {
217217
public static final String DEFINITION_POINT_DVOBJECT_ID = "definitionPointDvObjectId";
218218
public static final String DISCOVERABLE_BY = "discoverableBy";
219219

220+
/**
221+
* publicObject_b is an experimental field tied to the
222+
* avoid-expensive-solr-join feature flag. Rather than discoverableBy which
223+
* is a field on permission documents, publicObject_b is a field on content
224+
* documents (dvObjects). By indexing publicObject_b=true, we can let guests
225+
* search on it, avoiding an expensive join for those (common) users.
226+
*/
227+
public static final String PUBLIC_OBJECT = "publicObject_b";
228+
220229
/**
221230
* i.e. "Unpublished", "Draft" (multivalued)
222231
*/

src/main/java/edu/harvard/iq/dataverse/search/SearchServiceBean.java

+9
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
import edu.harvard.iq.dataverse.authorization.users.PrivateUrlUser;
1717
import edu.harvard.iq.dataverse.authorization.users.User;
1818
import edu.harvard.iq.dataverse.engine.command.DataverseRequest;
19+
import edu.harvard.iq.dataverse.settings.FeatureFlags;
1920
import edu.harvard.iq.dataverse.util.BundleUtil;
2021
import edu.harvard.iq.dataverse.util.SystemConfig;
2122
import java.io.IOException;
@@ -1006,6 +1007,14 @@ private String getPermissionFilterQuery(DataverseRequest dataverseRequest, SolrQ
10061007
// Yes, see if GuestUser is part of any groups such as IP Groups.
10071008
// ----------------------------------------------------
10081009
if (user instanceof GuestUser) {
1010+
if (FeatureFlags.AVOID_EXPENSIVE_SOLR_JOIN.enabled()) {
1011+
/**
1012+
* Instead of doing an expensive join, narrow down to only
1013+
* public objects. This field is indexed on the content document
1014+
* itself, rather than a permission document.
1015+
*/
1016+
return SearchFields.PUBLIC_OBJECT + ":" + true;
1017+
}
10091018
String groupsFromProviders = "";
10101019
Set<Group> groups = groupService.collectAncestors(groupService.groupsFor(dataverseRequest));
10111020
StringBuilder sb = new StringBuilder();

src/main/java/edu/harvard/iq/dataverse/settings/FeatureFlags.java

+9
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,15 @@ public enum FeatureFlags {
3636
* @since Dataverse @TODO:
3737
*/
3838
API_BEARER_AUTH("api-bearer-auth"),
39+
/**
40+
* For Guest users, don't use a join when searching Solr. Disables the IP
41+
* Groups feature from a search perspective. Requires a reindex.
42+
*
43+
* @apiNote Raise flag by setting
44+
* "dataverse.feature.avoid-expensive-solr-join"
45+
* @since Dataverse @TODO:
46+
*/
47+
AVOID_EXPENSIVE_SOLR_JOIN("avoid-expensive-solr-join"),
3948
;
4049

4150
final String flag;

0 commit comments

Comments
 (0)