Skip to content

Commit 3c55c3f

Browse files
pdurbinlandreev
andauthored
avoid expensive Solr join for public dvObjects in search (experimental) (#10555)
* avoid expensive Solr join when guest users search (affect IP Groups) #10554 * fix copy/past error, target doc for file, not dataset #10554 * Checking a few experimental changes into the branch: Jim's soft commit fixes from 10547; A quick experiment, replacing join on public objects with a boolean publicObject_b:true for logged-in users as well (with a join added for just for their own personal documents; groups are ignored for now). #10554 * Step 3, of the performance improvement effort relying on a boolean "publicObject" flag for published documents - now for logged-in users, AND with support for groups. Group support experimental, but appears to be working. #10554 * Modified the implementation for the guest user, to support ip groups. #10554 * Removed the few autocommit-related changes previously borrowed from 10547, to keep things separate and clear, for testing etc. #10554 * Reorganized the optimized code in SearchServiceBean; combined the code block for the guest and authenticated users. #10554 * updated the release note. #10554 * Removed the warning from the ip groups guide about the effect of the new search optimization feture that was no longer true. #10554 * Updated the section of the guide describing the new Solr optimization feature flags. #10554 * Updated the performance section of the guide. #10554 * Modified IndexServiceBean to use the new feature flag, that has been separated from the flag that enables the search-side optimization; Fixed the groups sub-query for the guest user. #10554 * cosmetic #10554 * doc tweaks #10554 * no-op code cleanup, correct case of publicObject_b #10554 --------- Co-authored-by: Leonid Andreev <[email protected]>
1 parent 23a4d9b commit 3c55c3f

File tree

7 files changed

+181
-47
lines changed

7 files changed

+181
-47
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
Two experimental features flag called "add-publicobject-solr-field" and "avoid-expensive-solr-join" have been added to change how Solr documents are indexed for public objects and how Solr queries are constructed to accommodate access to restricted content (drafts, etc.). It is hoped that it will help with performance, especially on large instances and under load.
2+
3+
Before the search feature flag ("avoid-expensive...") can be turned on, the indexing flag must be enabled, and a full reindex performed. Otherwise publicly available objects are NOT going to be shown in search results.
4+
5+
For details see https://dataverse-guide--10555.org.readthedocs.build/en/10555/installation/config.html#feature-flags and #10555.

doc/sphinx-guides/source/developers/performance.rst

+4
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,10 @@ Solr
118118

119119
While in the past Solr performance hasn't been much of a concern, in recent years we've noticed performance problems when Harvard Dataverse is under load. Improvements were made in `PR #10050 <https://github.com/IQSS/dataverse/pull/10050>`_, for example.
120120

121+
We are tracking performance problems in `#10469 <https://github.com/IQSS/dataverse/issues/10469>`_.
122+
123+
In a meeting with a Solr expert on 2024-05-10 we were advised to avoid joins as much as possible. (It was acknowledged that many Solr users make use of joins because they have to, like we do, to keep some documents private.) Toward that end we have added two feature flags called ``avoid-expensive-solr-join`` and ``add-publicobject-solr-field`` as explained under :ref:`feature-flags`. It was confirmed experimentally that performing the join on all the public objects (published collections, datasets and files), i.e., the bulk of the content in the search index, was indeed very expensive, especially on a large instance the size of the IQSS prod. archive, especially under indexing load. We confirmed that it was in fact unnecessary and were able to replace it with a boolean field directly in the indexed documents, which is achieved by the two feature flags above. However, as of writing this, this mechanism should still be considered experimental.
124+
121125
Datasets with Large Numbers of Files or Versions
122126
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
123127

doc/sphinx-guides/source/installation/config.rst

+6
Original file line numberDiff line numberDiff line change
@@ -3268,6 +3268,12 @@ please find all known feature flags below. Any of these flags can be activated u
32683268
* - api-session-auth
32693269
- Enables API authentication via session cookie (JSESSIONID). **Caution: Enabling this feature flag exposes the installation to CSRF risks!** We expect this feature flag to be temporary (only used by frontend developers, see `#9063 <https://github.com/IQSS/dataverse/issues/9063>`_) and for the feature to be removed in the future.
32703270
- ``Off``
3271+
* - avoid-expensive-solr-join
3272+
- Changes the way Solr queries are constructed for public content (published Collections, Datasets and Files). It removes a very expensive Solr join on all such documents, improving overall performance, especially for large instances under heavy load. Before this feature flag is enabled, the corresponding indexing feature (see next feature flag) must be turned on and a full reindex performed (otherwise public objects are not going to be shown in search results). See :doc:`/admin/solr-search-index`.
3273+
- ``Off``
3274+
* - add-publicobject-solr-field
3275+
- Adds an extra boolean field `PublicObject_b:true` for public content (published Collections, Datasets and Files). Once reindexed with these fields, we can rely on it to remove a very expensive Solr join on all such documents in Solr queries, significantly improving overall performance (by enabling the feature flag above, `avoid-expensive-solr-join`). These two flags are separate so that an instance can reindex their holdings before enabling the optimization in searches, thus avoiding having their public objects temporarily disappear from search results while the reindexing is in progress.
3276+
- ``Off``
32713277

32723278
**Note:** Feature flags can be set via any `supported MicroProfile Config API source`_, e.g. the environment variable
32733279
``DATAVERSE_FEATURE_XXX`` (e.g. ``DATAVERSE_FEATURE_API_SESSION_AUTH=1``). These environment variables can be set in your shell before starting Payara. If you are using :doc:`Docker for development </container/dev-usage>`, you can set them in the `docker compose <https://docs.docker.com/compose/environment-variables/set-environment-variables/>`_ file.

src/main/java/edu/harvard/iq/dataverse/search/IndexServiceBean.java

+10
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
import edu.harvard.iq.dataverse.datavariable.VariableMetadataUtil;
1313
import edu.harvard.iq.dataverse.datavariable.VariableServiceBean;
1414
import edu.harvard.iq.dataverse.harvest.client.HarvestingClient;
15+
import edu.harvard.iq.dataverse.settings.FeatureFlags;
1516
import edu.harvard.iq.dataverse.settings.JvmSettings;
1617
import edu.harvard.iq.dataverse.settings.SettingsServiceBean;
1718
import edu.harvard.iq.dataverse.util.FileUtil;
@@ -214,6 +215,9 @@ public Future<String> indexDataverse(Dataverse dataverse, boolean processPaths)
214215
solrInputDocument.addField(SearchFields.DATAVERSE_CATEGORY, dataverse.getIndexableCategoryName());
215216
if (dataverse.isReleased()) {
216217
solrInputDocument.addField(SearchFields.PUBLICATION_STATUS, PUBLISHED_STRING);
218+
if (FeatureFlags.ADD_PUBLICOBJECT_SOLR_FIELD.enabled()) {
219+
solrInputDocument.addField(SearchFields.PUBLIC_OBJECT, true);
220+
}
217221
solrInputDocument.addField(SearchFields.RELEASE_OR_CREATE_DATE, dataverse.getPublicationDate());
218222
} else {
219223
solrInputDocument.addField(SearchFields.PUBLICATION_STATUS, UNPUBLISHED_STRING);
@@ -878,6 +882,9 @@ public SolrInputDocuments toSolrDocs(IndexableDataset indexableDataset, Set<Long
878882

879883
if (state.equals(indexableDataset.getDatasetState().PUBLISHED)) {
880884
solrInputDocument.addField(SearchFields.PUBLICATION_STATUS, PUBLISHED_STRING);
885+
if (FeatureFlags.ADD_PUBLICOBJECT_SOLR_FIELD.enabled()) {
886+
solrInputDocument.addField(SearchFields.PUBLIC_OBJECT, true);
887+
}
881888
// solrInputDocument.addField(SearchFields.RELEASE_OR_CREATE_DATE,
882889
// dataset.getPublicationDate());
883890
} else if (state.equals(indexableDataset.getDatasetState().WORKING_COPY)) {
@@ -1391,6 +1398,9 @@ public SolrInputDocuments toSolrDocs(IndexableDataset indexableDataset, Set<Long
13911398
if (indexableDataset.getDatasetState().equals(indexableDataset.getDatasetState().PUBLISHED)) {
13921399
fileSolrDocId = solrDocIdentifierFile + fileEntityId;
13931400
datafileSolrInputDocument.addField(SearchFields.PUBLICATION_STATUS, PUBLISHED_STRING);
1401+
if (FeatureFlags.ADD_PUBLICOBJECT_SOLR_FIELD.enabled()) {
1402+
datafileSolrInputDocument.addField(SearchFields.PUBLIC_OBJECT, true);
1403+
}
13941404
// datafileSolrInputDocument.addField(SearchFields.PERMS, publicGroupString);
13951405
addDatasetReleaseDateToSolrDoc(datafileSolrInputDocument, dataset);
13961406
// has this published file been deleted from the current draft version?

src/main/java/edu/harvard/iq/dataverse/search/SearchFields.java

+9
Original file line numberDiff line numberDiff line change
@@ -217,6 +217,15 @@ public class SearchFields {
217217
public static final String DEFINITION_POINT_DVOBJECT_ID = "definitionPointDvObjectId";
218218
public static final String DISCOVERABLE_BY = "discoverableBy";
219219

220+
/**
221+
* publicObject_b is an experimental field tied to the
222+
* avoid-expensive-solr-join feature flag. Rather than discoverableBy which
223+
* is a field on permission documents, publicObject_b is a field on content
224+
* documents (dvObjects). By indexing publicObject_b=true, we can let guests
225+
* search on it, avoiding an expensive join for those (common) users.
226+
*/
227+
public static final String PUBLIC_OBJECT = "publicObject_b";
228+
220229
/**
221230
* i.e. "Unpublished", "Draft" (multivalued)
222231
*/

src/main/java/edu/harvard/iq/dataverse/search/SearchServiceBean.java

+125-47
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
import edu.harvard.iq.dataverse.authorization.users.PrivateUrlUser;
1717
import edu.harvard.iq.dataverse.authorization.users.User;
1818
import edu.harvard.iq.dataverse.engine.command.DataverseRequest;
19+
import edu.harvard.iq.dataverse.settings.FeatureFlags;
1920
import edu.harvard.iq.dataverse.util.BundleUtil;
2021
import edu.harvard.iq.dataverse.util.SystemConfig;
2122
import java.io.IOException;
@@ -1001,14 +1002,132 @@ private String getPermissionFilterQuery(DataverseRequest dataverseRequest, SolrQ
10011002
user = GuestUser.get();
10021003
}
10031004

1005+
AuthenticatedUser au = null;
1006+
Set<Group> groups;
1007+
1008+
if (user instanceof GuestUser) {
1009+
// Yes, GuestUser may be part of one or more groups; such as IP Groups.
1010+
groups = groupService.collectAncestors(groupService.groupsFor(dataverseRequest));
1011+
} else {
1012+
if (!(user instanceof AuthenticatedUser)) {
1013+
logger.severe("Should never reach here. A User must be an AuthenticatedUser or a Guest");
1014+
throw new IllegalStateException("A User must be an AuthenticatedUser or a Guest");
1015+
}
1016+
1017+
au = (AuthenticatedUser) user;
1018+
1019+
// ----------------------------------------------------
1020+
// (3) Is this a Super User?
1021+
// If so, they can see everything
1022+
// ----------------------------------------------------
1023+
if (au.isSuperuser()) {
1024+
// Somewhat dangerous because this user (a superuser) will be able
1025+
// to see everything in Solr with no regard to permissions. But it's
1026+
// been this way since Dataverse 4.0. So relax. :)
1027+
1028+
return dangerZoneNoSolrJoin;
1029+
}
1030+
1031+
// ----------------------------------------------------
1032+
// (4) User is logged in AND onlyDatatRelatedToMe == true
1033+
// Yes, give back everything -> the settings will be in
1034+
// the filterqueries given to search
1035+
// ----------------------------------------------------
1036+
if (onlyDatatRelatedToMe == true) {
1037+
if (systemConfig.myDataDoesNotUsePermissionDocs()) {
1038+
logger.fine("old 4.2 behavior: MyData is not using Solr permission docs");
1039+
return dangerZoneNoSolrJoin;
1040+
} else {
1041+
// fall-through
1042+
logger.fine("new post-4.2 behavior: MyData is using Solr permission docs");
1043+
}
1044+
}
1045+
1046+
// ----------------------------------------------------
1047+
// (5) Work with Authenticated User who is not a Superuser
1048+
// ----------------------------------------------------
1049+
1050+
groups = groupService.collectAncestors(groupService.groupsFor(dataverseRequest));
1051+
}
1052+
1053+
if (FeatureFlags.AVOID_EXPENSIVE_SOLR_JOIN.enabled()) {
1054+
/**
1055+
* Instead of doing a super expensive join, we will rely on the
1056+
* new boolean field PublicObject:true for public objects. This field
1057+
* is indexed on the content document itself, rather than a permission
1058+
* document. An additional join will be added only for any extra,
1059+
* more restricted groups that the user may be part of.
1060+
* **Note the experimental nature of this optimization**.
1061+
*/
1062+
StringBuilder sb = new StringBuilder();
1063+
StringBuilder sbgroups = new StringBuilder();
1064+
1065+
// All users, guests and authenticated, should see all the
1066+
// documents marked as publicObject_b:true, at least:
1067+
sb.append(SearchFields.PUBLIC_OBJECT + ":" + true);
1068+
1069+
// One or more groups *may* also be available for this user. Once again,
1070+
// do note that Guest users may be part of some groups, such as
1071+
// IP groups.
1072+
1073+
int groupCounter = 0;
1074+
1075+
// An AuthenticatedUser should also be able to see all the content
1076+
// on which they have direct permissions:
1077+
if (au != null) {
1078+
groupCounter++;
1079+
sbgroups.append(IndexServiceBean.getGroupPerUserPrefix() + au.getId());
1080+
}
1081+
1082+
// In addition to the user referenced directly, we will also
1083+
// add joins on all the non-public groups that may exist for the
1084+
// user:
1085+
for (Group group : groups) {
1086+
String groupAlias = group.getAlias();
1087+
if (groupAlias != null && !groupAlias.isEmpty() && !groupAlias.startsWith("builtIn")) {
1088+
groupCounter++;
1089+
if (groupCounter > 1) {
1090+
sbgroups.append(" OR ");
1091+
}
1092+
sbgroups.append(IndexServiceBean.getGroupPrefix() + groupAlias);
1093+
}
1094+
}
1095+
1096+
if (groupCounter > 1) {
1097+
// If there is more than one group, the parentheses must be added:
1098+
sbgroups.insert(0, "(");
1099+
sbgroups.append(")");
1100+
}
1101+
1102+
if (groupCounter > 0) {
1103+
// If there are any groups for this user, an extra join must be
1104+
// added to the query, and the extra sub-query must be added to
1105+
// the combined Solr query:
1106+
sb.append(" OR {!join from=" + SearchFields.DEFINITION_POINT + " to=id v=$q1}");
1107+
// Add the subquery to the combined Solr query:
1108+
solrQuery.setParam("q1", SearchFields.DISCOVERABLE_BY + ":" + sbgroups.toString());
1109+
logger.info("The sub-query q1 set to " + SearchFields.DISCOVERABLE_BY + ":" + sbgroups.toString());
1110+
}
1111+
1112+
String ret = sb.toString();
1113+
logger.info("Returning experimental query: " + ret);
1114+
return ret;
1115+
}
1116+
1117+
// END OF EXPERIMENTAL OPTIMIZATION
1118+
1119+
// Old, un-optimized way of handling permissions.
1120+
// Largely left intact, minus the lookups that have already been performed
1121+
// above.
1122+
10041123
// ----------------------------------------------------
10051124
// (1) Is this a GuestUser?
1006-
// Yes, see if GuestUser is part of any groups such as IP Groups.
10071125
// ----------------------------------------------------
10081126
if (user instanceof GuestUser) {
1009-
String groupsFromProviders = "";
1010-
Set<Group> groups = groupService.collectAncestors(groupService.groupsFor(dataverseRequest));
1127+
10111128
StringBuilder sb = new StringBuilder();
1129+
1130+
String groupsFromProviders = "";
10121131
for (Group group : groups) {
10131132
logger.fine("found group " + group.getIdentifier() + " with alias " + group.getAlias());
10141133
String groupAlias = group.getAlias();
@@ -1025,51 +1144,11 @@ private String getPermissionFilterQuery(DataverseRequest dataverseRequest, SolrQ
10251144
return guestWithGroups;
10261145
}
10271146

1028-
// ----------------------------------------------------
1029-
// (2) Retrieve Authenticated User
1030-
// ----------------------------------------------------
1031-
if (!(user instanceof AuthenticatedUser)) {
1032-
logger.severe("Should never reach here. A User must be an AuthenticatedUser or a Guest");
1033-
throw new IllegalStateException("A User must be an AuthenticatedUser or a Guest");
1034-
}
1035-
1036-
AuthenticatedUser au = (AuthenticatedUser) user;
1037-
1038-
// if (addFacets) {
1039-
// // Logged in user, has publication status facet
1040-
// //
1041-
// solrQuery.addFacetField(SearchFields.PUBLICATION_STATUS);
1042-
// }
1043-
1044-
// ----------------------------------------------------
1045-
// (3) Is this a Super User?
1046-
// Yes, give back everything
1047-
// ----------------------------------------------------
1048-
if (au.isSuperuser()) {
1049-
// Somewhat dangerous because this user (a superuser) will be able
1050-
// to see everything in Solr with no regard to permissions. But it's
1051-
// been this way since Dataverse 4.0. So relax. :)
1052-
1053-
return dangerZoneNoSolrJoin;
1054-
}
1055-
1056-
// ----------------------------------------------------
1057-
// (4) User is logged in AND onlyDatatRelatedToMe == true
1058-
// Yes, give back everything -> the settings will be in
1059-
// the filterqueries given to search
1060-
// ----------------------------------------------------
1061-
if (onlyDatatRelatedToMe == true) {
1062-
if (systemConfig.myDataDoesNotUsePermissionDocs()) {
1063-
logger.fine("old 4.2 behavior: MyData is not using Solr permission docs");
1064-
return dangerZoneNoSolrJoin;
1065-
} else {
1066-
logger.fine("new post-4.2 behavior: MyData is using Solr permission docs");
1067-
}
1068-
}
1069-
10701147
// ----------------------------------------------------
10711148
// (5) Work with Authenticated User who is not a Superuser
1072-
// ----------------------------------------------------
1149+
// ----------------------------------------------------
1150+
// It was already confirmed, that if the user is not GuestUser, we
1151+
// have an AuthenticatedUser au which is not null.
10731152
/**
10741153
* @todo all this code needs cleanup and clarification.
10751154
*/
@@ -1100,7 +1179,6 @@ private String getPermissionFilterQuery(DataverseRequest dataverseRequest, SolrQ
11001179
* a given "content document" (dataset version, etc) in Solr.
11011180
*/
11021181
String groupsFromProviders = "";
1103-
Set<Group> groups = groupService.collectAncestors(groupService.groupsFor(dataverseRequest));
11041182
StringBuilder sb = new StringBuilder();
11051183
for (Group group : groups) {
11061184
logger.fine("found group " + group.getIdentifier() + " with alias " + group.getAlias());

src/main/java/edu/harvard/iq/dataverse/settings/FeatureFlags.java

+22
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,28 @@ public enum FeatureFlags {
3636
* @since Dataverse @TODO:
3737
*/
3838
API_BEARER_AUTH("api-bearer-auth"),
39+
/**
40+
* For published (public) objects, don't use a join when searching Solr.
41+
* Experimental! Requires a reindex with the following feature flag enabled,
42+
* in order to add the boolean publicObject_b:true field to all the public
43+
* Solr documents.
44+
*
45+
* @apiNote Raise flag by setting
46+
* "dataverse.feature.avoid-expensive-solr-join"
47+
* @since Dataverse 6.3
48+
*/
49+
AVOID_EXPENSIVE_SOLR_JOIN("avoid-expensive-solr-join"),
50+
/**
51+
* With this flag enabled, the boolean field publicObject_b:true will be
52+
* added to all the indexed Solr documents for publicly-available collections,
53+
* datasets and files. This flag makes it possible to rely on it in searches,
54+
* instead of the very expensive join (the feature flag above).
55+
*
56+
* @apiNote Raise flag by setting
57+
* "dataverse.feature.add-publicobject-solr-field"
58+
* @since Dataverse 6.3
59+
*/
60+
ADD_PUBLICOBJECT_SOLR_FIELD("add-publicobject-solr-field"),
3961
;
4062

4163
final String flag;

0 commit comments

Comments
 (0)