Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr support in AtoM #1817

Open
wants to merge 106 commits into
base: qa/2.x
Choose a base branch
from
Open

Solr support in AtoM #1817

wants to merge 106 commits into from

Conversation

anvit
Copy link
Contributor

@anvit anvit commented May 16, 2024

Work in progress branch for adding support for Solr for searching within AtoM

Completed:

  • Docker configuration that starts Solr and Zookeeper (Solr uses this for coordinating and syncing between multiple Solr nodes when run in the cloud mode) containers.
  • A Solr plugin (arSolrPlugin) which serves as the Solr equivalent of arElasticSearchPlugin. It talks to Solr and has functions that allow indexing and searching.
  • A solr:populate task (arSolrPopulateTask) which indexes AtoM data into Solr. The indexed data can be seen at the Solr dashboard at http://localhost:8983/solr. The solr dashboard also allows searching the indexed data.
  • A set of classes that act as the equivalent of Elastica within AtoM. These are located in the arSolrPlugin/lib/client folder. The query classes essentially set up query parameters for API requests to Solr, arSolrClient accepts configuration which would allow it to communicate with Solr, and has methods which allow sending different API requests to Solr.

Work in progress:

  • arSolrSearchTask is CLI task allows searching the solr index for a few query types. Since queries can get fairly complicated, especially with Boolean queries, this was meant for quick cli testing until Solr was officially supported by the AtoM interface, an so it isn't very customizable. However this could potentially be useful for writing tests in the future.
  • Unit tests for several solr query have been added. Solr's Boolean Query, Result and Result Set, and the Solr Client currently do not have any tests written for them.

TODO

Within arSolrPlugin

High priority (essential for browse or search actions):

  • Add a class for handling nested search: Currently there is no class for handling nested search in the query classes we have for Solr. Solr doesn't have a built in nested query like ElasticSearch does since it doesn't treat nested fields in a special way. This means that while it could be possible to perform those searches using a simple boolean query that targets those nested fields, we would need to ensure we'e matching results within the same nested unit (for instance, we would need to ensure when searching for date ranges that we don't mix one start date with an end date from a different event for the same information object).
  • Add authentication to Solr Client (arSolrPlugin): Currently username and password are ignored as the current solr setup doesn't set those up either.
  • Change getDateRangeQuery's Nested Query call (arSolrPluginQuery): Since there is no nested query class for solr yet, this will need to be updated once that functionality is in place.

Medium priority (not essential for basic search but still important):

  • updateByQuery method/function (arSolrPlugin.class): This class will need a method to handle updating specific documents by query.
  • Create Diacritics analyzer (arSolrPlugin.class)
  • Create Brazilian Portuguese analyzer (arSolrPlugin.class): Solr doesn't have a default pt_BR analyzer but has specific filter classes we can use.
  • Ensure pdfs are also indexed by solr (arSolrPlugin.class): Will need to use Apache Tika to work with external docs.

Low priority (used by CLI tasks or other non search specific actions within AtoM):

Lowest priority (good to have features):

  • Add support for Solr server mode: Currently the docker config as well as a couple of collection based things assume that it will only be run in cloud mode. (Cloud mode uses multiple Solr nodes which is most similar to how ElasticSearch usually be configured with AtoM, Server mode has a single node, uses some slightly different API end points for a few requests, and doesn't need zookeeper)

Outside arSolrPlugin

  • AtoM extensively references Elastica, and the arElasticSearchPlugin is also deeply integrated into it. As of now, this is a list of all of the places outside the plugin itself that would need updates:

  • apps/qubit/modules/digitalobject/actions/imageflowComponent.class.php uses arElasticSearchPluginQuery, QubitSearch.

  • apps/qubit/modules/clipboard/actions/viewAction.class.php uses Elastica ResultSet, Response, Query, QueryTerms, QubitSearchPager, arElasticSearchPluginConfiguration.

  • apps/qubit/modules/default/actions/moveAction.class.php uses Elastica Query, BoolQuery, QueryTerm, QubitSearchPager, arElasticSearchPluginUtil, arElasticSearchPluginConfiguration.

  • apps/qubit/modules/default/actions/fullTreeViewAction.class.php uses Elastica QueryTerm, Elastica ResultSet (as arguments to methods), has several method names which reference ElasticSearch, arElasticSearchPluginQuery.

  • apps/qubit/modules/default/actions/browseAction.class.php uses arElasticSearchPluginQuery, arElasticSearchPluginConfiguration, QubitSearch.

  • 👆🏼 NOTE: replace L#134-L#147 (the section that essentially removes must clauses for i18n.languages queries) with a call to the removeMustWithTermField method in arSolrBoolQuery

  • apps/qubit/modules/repository/actions/holdingsAction.class.php uses Elastica QueryBool, QueryMatchAll, QueryTerm, Query, QubitSearch, arElasticSearchPluginConfiguration.

  • apps/qubit/modules/repository/actions/browseAction.class.php uses Elastica QueryMatchAll, Query, QueryTerm, arElasticSearchPluginUtil, QubitSearch.

  • apps/qubit/modules/repository/actions/maintainedActorsAction.class.php uses Elastica Query, QueryTerm, QubitSearch, QubitSearchPager, arElasticSearchPluginConfiguration.

  • apps/qubit/modules/taxonomy/actions/indexAction.class.php uses Elastica Query, BoolQuery, QueryTerm, arElasticSearchPluginUtil, arElasticSearchPluginConfiguration, QubitSearch, QubitSearchPager.

  • apps/qubit/modules/actor/actions/browseAction.class.php uses Elastica BoolQuery, QueryTerm, QueryExists, NestedQuery, arElasticSearchPluginUtil, QubitSearch, QubitSearchPager.

  • apps/qubit/modules/actor/actions/relatedInformationObjectsAction.class.php uses Elastica Query, BoolQuery, QueryTerm, NestedQuery, QubitSearchPager, QubitSearch, arElasticSearchPluginConfiguration.

  • apps/qubit/modules/search/actions/errorAction.class.php uses Elastica Exception, references ElasticSearch in error message.

  • apps/qubit/modules/search/actions/indexAction.class.php uses Elastica QueryTerm, QubitSearch, arElasticSearchPluginUtil.

  • apps/qubit/modules/search/actions/autocompleteAction.class.php uses Elastica Search, MultiSearch, Query, BoolQuery, Match, Term, QubitSearch.

  • apps/qubit/modules/search/actions/descriptionUpdatesAction.class.php uses Elastica Query, BoolQuery, QueryTerm, QueryRange, QubitSearch, QubitSearchPager, arElasticSearchPluginConfiguration.

  • apps/qubit/modules/term/actions/navigateRelatedComponent.class.php uses Elastica QueryTerm, QubitSearch, arElasticSearchPluginQuery.

  • apps/qubit/modules/term/actions/indexAction.class.php uses Elastica QueryTerms, Query, BoolQuery, QueryTerm, QubitSearch, QubitSearchPager.

  • apps/qubit/modules/informationobject/actions/inventoryAction.class.php uses Elastica BoolQuery, Query, QueryTerm, QueryTerms, QubitSearch, QubitSearchPager, arElasticSearchPluginConfiguration.

  • apps/qubit/modules/informationobject/actions/autocompleteAction.class.php uses Elastica Query, BoolQuery, MatchAll, QueryTerm, arElasticSearchPluginUtil, QubitSearch, QubitSearchPager.

  • lib/filter/QubitMeta.class.php references Elastica Exception.

  • lib/QubitLftSyncer.class.php uses Elastica Bulk, QueryTerm, Document, QubitSearch, arElasticSearchPluginQuery.

  • lib/search/QubitSearchPager.class.php uses Elastica ResultSet.

  • lib/helper/QubitHelper.php references Elastica Result.

  • lib/job/arUpdateEsActorRelationsJob.class.php references Elastica exception, QubitSearch, arElasticSearchActorPdo.

  • lib/job/arActorExportJob.class.php uses Elastica QueryTerms, arElasticSearchPluginUtil, QubitSearch.

  • lib/job/arRepositoryCsvExportJob.class.php uses Elastica QueryTerms, arElasticSearchPluginQuery, arElasticSearchPluginUtil, QubitSearch.

  • lib/job/arUpdatePublicationStatusJob.class.php uses Elastica AbstractScript, QueryTerm, QubitSearch.

  • lib/job/arInformationObjectExportJob.class.php uses Elastica QueryTerm, QueryTerms, arElasticSearchPluginUtil, arElasticSearchPluginQuery, QubitSearch.

  • lib/task/tools/updatePublicationStatusTask.class.php uses Elastica AbstractScript, QueryTerm, QubitSearch.

  • lib/task/propel/propelGenerateSlugsTask.class.php uses Elastica Query, BoolQuery, QueryTerm, QubitSearch.

  • lib/model/QubitInformationObject.php uses Elastica BoolQuery, Query, QueryMatch, QubitSearch.

  • lib/model/QubitTerm.php uses Elastica BoolQuery, QueryTerm, QubitSearch.

  • lib/task/search/arSearchStatusTask.class.php uses arElasticSearchPluginConfiguration, looks for class names starting with arElasticSearch in objectsAvailableToIndex.

  • lib/task/tools/installTask.class.php uses arElasticSearchPluginConfiguration.

  • lib/job/arUpdateEsIoDocumentsJob.class.php uses arElasticSearchInformationObject.

  • lib/job/arUpdateEsActorRelationsJob.class.php uses arElasticSearchActorPdo.

  • lib/job/arActorExportJob.class.php uses arElasticSearchPluginUtil, arElasticSearchPluginQuery.

  • lib/arInstall.class.php references arElasticSearchPlugin's search.yml and uses arElasticSearchConfigHandler.

  • lib/task/import/csvImportTask.class.php uses arElasticSearchInformationObjectPdo, QubitSearch.

  • lib/QubitMetsParser.class.php uses arElasticSearchPluginUtil.

  • lib/search/QubitSearch.class.php uses arElasticSearchPlugin.

  • lib/search/QubitSearchEngine.class.php references ElasticSearch.

  • lib/QubitFlatfileImport.class.php references ElasticSearch.

  • lib/task/propel/propelGenerateSlugsTask.class.php references ElasticSearch

  • config/ProjectConfiguration.class.php sets up arElasticSearchPlugin.

  • plugins/qbAclPlugin/lib/QubitAclSearch.class.php uses Elastica Query, BoolQuery, QueryTerm.

  • plugins/sfSkosPlugin/test/unit/importTest.php uses Elastica Exception, QubitSearch.

  • plugins/arRestApiPlugin/lib/QubitApiAction.class.php uses Elastica Query.

  • plugins/arRestApiPlugin/modules/api/actions/informationobjectsBrowseAction.class.php uses arElasticSearchPluginConfiguration, arElasticSearchPluginQuery.

  • plugins/qtAccessionPlugin/modules/accession/actions/browseAction.class.php uses Elastica Query, BoolQuery, QueryMatchAll, QubitSearch, QubitSearchPager, arElasticSearchPluginUtil, arElasticSearchPluginConfiguration.

  • test/unit/escapeTermTest.php tests arElasticSearchPluginUtil::escapeTerm


In addition to the list above, other tasks that would need to be completed in order to switch to Solr:

  • Set solr to be a default plugin that is on by default
  • Update installTask to set up a config file for solr in the root config folder (similar to ES), and change the arSolPluginPluginConfiguration to point to this file
  • Create a new vagrant setup for development with solr
  • Update AtoM Docs: New documentation would need to be added that details installation and configuration. ElasticSearch advanced queries would also no longer work but could be replaced with documentation for solr's query syntax that would allow performing complex custom queries.

@anvit anvit added this to the 2.9.0 milestone May 16, 2024
@anvit anvit force-pushed the dev/solr-plugin-wip branch from 81d9072 to aa74860 Compare May 16, 2024 17:36
@anvit anvit requested a review from a team May 16, 2024 17:38
@anvit anvit self-assigned this May 16, 2024
@anvit anvit force-pushed the dev/solr-plugin-wip branch from aa74860 to 2ea4787 Compare May 16, 2024 17:42
@melaniekung melaniekung force-pushed the dev/solr-plugin-wip branch 4 times, most recently from 5102508 to db7a3b7 Compare May 23, 2024 14:20
@anvit anvit force-pushed the dev/solr-plugin-wip branch from bfeeaad to 9b2384c Compare May 23, 2024 19:02
@melaniekung melaniekung force-pushed the dev/solr-plugin-wip branch 4 times, most recently from ce00129 to 7e5d6d7 Compare May 31, 2024 09:30
@melaniekung melaniekung force-pushed the dev/solr-plugin-wip branch 6 times, most recently from 2443cec to 2990cdb Compare June 6, 2024 07:10
@anvit anvit force-pushed the dev/solr-plugin-wip branch 4 times, most recently from 5f1265a to b663d51 Compare June 15, 2024 00:17
@anvit anvit force-pushed the dev/solr-plugin-wip branch 2 times, most recently from da343ae to ca14a5b Compare June 18, 2024 22:55
@melaniekung melaniekung force-pushed the dev/solr-plugin-wip branch from 2fd485a to 3d605c3 Compare June 19, 2024 13:33
anvit and others added 23 commits August 29, 2024 13:58
Update arSolrRangeQuery and ArSolrRangeQueryTest to account for types
Update arSolrBoolQuery to use the query params for each of the clauses
instead of using edismax queries to extend support for query types that
do not use edismax. Also change the _addQuery method to allow all
queries that are instances of arSolrAbstractQuery instead of just
arSolrQuery
Add support for sorting and aggregations to arSolrBoolQuery.
TODO:
- Add tests for arSolrBoolQuery
Add arSolrTermsQuery and associated tests. Also fix typo in a property
name in arSolrTermQuery.
Add arSolrIdsQuery and associated tests
Add a method to arSolrBoolQuery that sets the types for its child
queries. Also add a method for setting filters for bool queries.
Add a method that appends types to the aggregations before the query
params are generated.
Add a metod to remove any term queries with a given field from the must
clause in arSolrBoolQuery.
Change the generateQueryString method in arSolrPluginUtil to create a
solr query from the input string and rename the method to generateQuery.
Set the version param from hit in arSolrResult
Change arSolrQuery to arSolrStringQuery to avoid confusion with
Elastica's Query
Refactor code in arSolrPlugin that talked to solr into arSolrClient.
Also renamed the query folder to client for clarity, and fixed a bug
that was skipping over autocomplete fields in arSolrResult.
@anvit anvit force-pushed the dev/solr-plugin-wip branch from fd9068d to d1ceb6c Compare August 29, 2024 20:59
@anvit anvit force-pushed the dev/solr-plugin-wip branch from 23f69d1 to f0876e4 Compare August 29, 2024 21:58
Update arSolrPluginQuery to use a single bool query directly instead of
using a query container and a separate boolean query.
Add updateDocument and updateDocumentById to arSolrClient which enable
updating existing documents in the solr index. Also add functions to
arSolrPlugin to call these from AtoM.
@anvit anvit removed this from the 2.9.0 milestone Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants