Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Enhance SOLR integration and add a Schema API #54

Merged
merged 25 commits into from
Mar 2, 2024

Conversation

syphax-bouazzouni
Copy link

@syphax-bouazzouni syphax-bouazzouni commented Feb 3, 2024

Prerequisites

Goals

  • Make it easier to configure SOLR programmatically.
  • Have the possibility to index any model on demand

Context

SOLR is the indexing tool, that we use for our search features, it works by defining a collection (a table in the databases world), and for each collection, a schema defines the properties to index by giving its type, a list, or not, ... and also some dynamic or special fields to handle fuzzy search, or other.

The requirement was that we were required to define the collection and the schema, in XML configuration files at the start, and then after we could not change it in the code. This meant we were limited in the action that we could do, and it was hard to add new search features to our system. as these files were static, and you had to update the schema and create the collection configuration files each time you wanted to add something into the index.

This PR, integration the SOLR Schema API, in this project, gives us the option to create/delete a collection and update a collection schema dynamically, the following actions were implemented (see the full list in the SOLR::Admin, SOLR::Schema and SOLR::SchemaGenerator modules in the code):

  • SOLR Administration API
    fetch_all_collections 
    
    create_collection(name = @collection_name, num_shards = 1, replication_factor = 1)
   
    delete_collection(collection_name = @collection_name)
    
    collection_exists?(collection_name)
  • SOLR Schema API
    add_field(name, type, indexed: true, stored: true, multi_valued: false)
    
    add_dynamic_field(name, type, indexed: true, stored: true, multi_valued: false)
    
    add_copy_field(source, dest)
    
    add_field_type(type_definition)

    delete_field(name)

    update_schema(schema_json)
   
    fetch_schema

In Addition to the implementation of the SOLR Schema and admin APIs, we added a dsl to the Goo model, to enable index for any model, either in a schemaless mode or in a custom schema model

  • Schemaless mode: This will generate a collection and schema using the settings/metadata of the model properties, it is uses SOLR dynamic fields to make this possible, below an example of how to do it.
  class TermSearch < Goo::Base::Resource
    model :term_search, name_with: :prefLabel
    attribute :prefLabel, enforce: [:existence], fuzzy_search: true # fuzzy search will permit to have autocomplete for this field 
    attribute :synonym, enforce: [:list]
    attribute :definition
    attribute :submissionAcronym, enforce: [:existence]
    attribute :submissionId, enforce: [:existence, :integer]
    attribute :private, enforce: [:boolean], default: false, index: false # this field not be indexed 
    attribute :semanticType
    attribute :cui

    enable_indexing(:my_collection) # will generate the collection called `my_collection` and index all the attributes of the model (except `:private` here)
  end
  • Custom Schema mode: This we generate a collection and a schema directly defined in the code and decelartion of the model, below an example
class TermSearch < Goo::Base::Resource
    model :term_search, name_with: :id
    attribute :prefLabel, enforce: [:existence]
    attribute :synonym, enforce: [:list] # array of strings
    attribute :definition  # array of strings
    attribute :submissionAcronym, enforce: [:existence]
    attribute :submissionId, enforce: [:existence, :integer]
    attribute :semanticType
    attribute :cui

    enable_indexing(:term_search) do
      schema_generator.add_field(:prefLabel, 'text_general', indexed: true, stored: true, multi_valued: false)
      schema_generator.add_field(:synonym, 'text_general', indexed: true, stored: true, multi_valued: true)
      schema_generator.add_field(:definition, 'string', indexed: true, stored: true, multi_valued: true)
      schema_generator.add_field(:submissionAcronym, 'string', indexed: true, stored: true, multi_valued: false)
      schema_generator.add_field(:submissionId, 'pint', indexed: true, stored: true, multi_valued: false)
      schema_generator.add_field(:cui, 'text_general', indexed: true, stored: true, multi_valued: true)
      schema_generator.add_field(:semanticType, 'text_general', indexed: true, stored: true, multi_valued: true)

  
      # Copy fields for fuzzy and autocomplete search
      schema_generator.add_copy_field('prefLabel', '_text_')
      schema_generator.add_copy_field('prefLabel', 'prefLabel_Exact')
      schema_generator.add_copy_field('prefLabel', 'prefLabel_Suggest')
      schema_generator.add_copy_field('prefLabel', 'prefLabel_SuggestEdge')
      schema_generator.add_copy_field('prefLabel', 'prefLabel_SuggestNgram')

      schema_generator.add_copy_field('synonym', '_text_')
      schema_generator.add_copy_field('synonym', 'synonym_Exact')
      schema_generator.add_copy_field('synonym', 'synonym_Suggest')
      schema_generator.add_copy_field('synonym', 'synonym_SuggestEdge')
      schema_generator.add_copy_field('synonym', 'synonym_SuggestNgram')

      schema_generator.add_copy_field('notation', '_text_')
    end
end 

Changes

  • add an abstraction to SOLR integration and add Schema API(b16ffbd)
  • add SOLR Schema API tests(0389943)
  • update SOLR backend configuration and init(f3815c4)
  • use the new Solr connector in the model search interface(9ca2e1c)
  • update search test to cover the new automatic indexing and unindexing (459c4ff)

Copy link

codecov bot commented Feb 3, 2024

Codecov Report

Attention: Patch coverage is 81.01604% with 71 lines in your changes are missing coverage. Please review.

Project coverage is 86.36%. Comparing base (1be1c83) to head (84aa3db).

Files Patch % Lines
lib/goo/search/solr/solr_schema.rb 72.34% 26 Missing ⚠️
lib/goo/search/search.rb 82.79% 16 Missing ⚠️
lib/goo/search/solr/solr_admin.rb 69.76% 13 Missing ⚠️
lib/goo/search/solr/solr_query.rb 87.50% 6 Missing ⚠️
lib/goo/search/solr/solr_schema_generator.rb 85.36% 6 Missing ⚠️
lib/goo.rb 92.30% 2 Missing ⚠️
lib/goo/search/solr/solr_connector.rb 91.66% 2 Missing ⚠️
Additional details and impacted files
@@               Coverage Diff               @@
##           development      #54      +/-   ##
===============================================
+ Coverage        85.93%   86.36%   +0.43%     
===============================================
  Files               41       46       +5     
  Lines             2702     3022     +320     
===============================================
+ Hits              2322     2610     +288     
- Misses             380      412      +32     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@syphax-bouazzouni syphax-bouazzouni force-pushed the feature/add-model-based-search branch from f83d7c4 to bde2bdf Compare February 3, 2024 14:21
@syphax-bouazzouni syphax-bouazzouni changed the title Feature: Enhance SOLR integration and add a Schema API [WIP] Feature: Enhance SOLR integration and add a Schema API Feb 3, 2024
@syphax-bouazzouni syphax-bouazzouni changed the title [WIP] Feature: Enhance SOLR integration and add a Schema API Feature: Enhance SOLR integration and add a Schema API Feb 5, 2024
@syphax-bouazzouni syphax-bouazzouni force-pushed the feature/add-model-based-search branch from e1ad9ac to a74645b Compare February 9, 2024 09:22
@mdorf
Copy link

mdorf commented Feb 15, 2024

@syphax-bouazzouni, this looks great! Really useful feature for us as well! It would be great to test it against the latest Solr 9.5.0 to make sure the code is compatible. I am wrapping up my work on a few search enhancements for the RADx project, which we plan to deploy to production shortly. The next major task is upgrading Solr to the latest version, which we may very well coincide with merging this pull request. Are you planning to make any other significant changes to this feature (you mentioned that this is a "first iteration)? Thank you!

@syphax-bouazzouni syphax-bouazzouni force-pushed the feature/add-model-based-search branch from a74645b to ec99ed8 Compare February 16, 2024 06:16
@syphax-bouazzouni syphax-bouazzouni force-pushed the feature/add-model-based-search branch from f25886a to ca32e8c Compare February 17, 2024 09:46
@syphax-bouazzouni syphax-bouazzouni force-pushed the feature/add-model-based-search branch 2 times, most recently from bc4f5a1 to 1fc6730 Compare February 25, 2024 12:57
@syphax-bouazzouni syphax-bouazzouni force-pushed the feature/add-model-based-search branch from 1fc6730 to 51fbfe4 Compare February 25, 2024 12:59
@syphax-bouazzouni
Copy link
Author

syphax-bouazzouni commented Mar 2, 2024

9.5.0

Hello, @mdorf, I tested it locally with Solr 9.5.0 and all the tests are green.
I removed the "first iteration" term, as I implemented everything I had in mind for now: administrate SOLR in code, be backward compatible with the existent, and index models automatically at save/delete.

@syphax-bouazzouni syphax-bouazzouni force-pushed the feature/add-model-based-search branch from 1836ce1 to 84aa3db Compare March 2, 2024 19:37
@syphax-bouazzouni syphax-bouazzouni merged commit 6c51346 into development Mar 2, 2024
49 of 50 checks passed
syphax-bouazzouni added a commit that referenced this pull request Apr 3, 2024
* add an abstraction to SOLR integeration and add Schema API

* add SOLR  Schema API tests

* update SOLR backend configuration and init

* use the new Solr connector in the model search interface

* update search test to cover the new automatic indexing and unindexing

* handle the solr container initialization when running docker for tests

* add  omit_norms options for SolrSchemaGenerator

* fix solr schema initial dynamic fields declaration and replace the usage of   mapping-ISOLatin1Accent

* delay the schema generation to after model declarations or in demand

* add solr edismax fitlers tests

* fix indexBatch and unindexBatch tests

* add security checks to the index and unindex functions

* change dynamic fields names to have less code migration

* update clear_all_schema to remove all copy and normal fields

* add an option to force solr initialization if wanted

* handle indexing embed objects of a model

* add index update option

* fix clear all schema to just remove all the fields and recreate them

* add index_enabled?  helper for models

* perform a status test  when initializing the solr connector

* extract init_search_connection function from init_search_connections

* fix typo in indexOptimize call

* add solr search using  HTTP post instead of GET for large queries
@mdorf
Copy link

mdorf commented Apr 9, 2024

@syphax-bouazzouni, I am working on merging this functionality into our develop branch. It's a bit tricky given that the pull request is not against our own repo. Probably will need to do a lot of manual merging. Were you planning on submitting this pull request against the ncbo repo? If so, should I wait for that or just proceed with my manual merging? Thank you!

@jonquet
Copy link

jonquet commented Apr 10, 2024

Hello, the proposition was to move and create PRs directly on the OntoPortal repo now that everyone is positioned under it. I think @syphax-bouazzouni has scheduled some time to create PR related to our work soon.

@syphax-bouazzouni
Copy link
Author

@syphax-bouazzouni, I am working on merging this functionality into our develop branch. It's a bit tricky given that the pull request is not against our own repo. Probably will need to do a lot of manual merging. Were you planning on submitting this pull request against the ncbo repo? If so, should I wait for that or just proceed with my manual merging? Thank you!

Hello @mdorf, I would suggest waiting at least 1 month before merging this, as it is only tested in our development environment, and will be released in our next release, see ontoportal-lirmm/ontologies_api#73

Once deployed to our production environment and tested, I will do a PR.

Is it good with you?

@mdorf
Copy link

mdorf commented Apr 10, 2024

@syphax-bouazzouni, @jonquet, no problem. In general, we are very interested in this feature to facilitate the functionality sought by the RADx project, in which our Solr index would be packaged in a way to be accessible by the third-party API. See bmir-radx/radx-project#49. However, based on my conversation with @alexskr, this does not have the immediate urgency. A month is definitely reasonable for us to wait to be able to merge this feature in its more stable and tested iteration.

syphax-bouazzouni added a commit that referenced this pull request May 22, 2024
…DF 3.0 and SOLR API (#58)

* Feature: Add  Virtuso, Allegrograph and Graphdb integration to GOO (#48)

* simplify the test configuration init

* add docker based tests rake task to run test against 4s, ag, gb, vo

* remove faraday gem usage

* update test CI to test against all the supported backends with diffirent slice sizes

* add high level helper to to know which backend we are currently using

* extract sparql processor module from where module

* handle language_match? value to upcase by default

* add support for virtuoso and graphdb sparql client

* replace delete sparql query by delete graph in the model complex test

* add some new edge cases tests t o test_where.rb and test_schemaless

* make test_chunks_write.rb tests support multiple backends

* replace native insert_data with execute_append_request in model save

* remove add_rules as it seems to no more be used

* move expand_equivalent_predicates from loader to builder module

* build two diffirent queries depending on which backend used

* update mapper to handle the two different queries depending on the backend used

* simplify the loader code, by removing inferable variables

* refactor and simplify map_attributes method

* fix test chunks write concenrency issues

* Refactor: clean model settings module code (#52)

* remove old file no more used

* extract attribute settings module from the model settings module

* remove the inmutable feature as deprecated and not used

* rename callbacks method names

* Feature: Add after_save and after_destroy hooks to models  (#53)

* remove old file no more used

* extract attribute settings module from the model settings module

* remove the inmutable feature as deprecated and not used

* rename callbacks method names

* add hooks module

* Feature: update rdf gem to latest version  (#56)

* un pin rdf version, to use the latest and add rdf vocab and xml

* update URI class monkey patch because Addressable does no more exist

* RDF::SKOS is replaced with RDF::Vocab::SKOS in the latest version of RDF

* pin rdf version to 3.2.11 the latest version that support ruby 2.7

* monkey path Literal::DateTime format to be supported by 4store

* remove  addressable dependency

* Fix: saving a model removing unmodified attributes after consecutive save

* Fix: enforce to use str() when doing a filter with a string value  (#57)

* enforce to use str() when doing a filter with a string

* update agraph version to 8.1.0

* Fix: monkey path RDF to not remove xsd:string by default

* Feature: Enhance SOLR integration and add a Schema API (#54)

* add an abstraction to SOLR integeration and add Schema API

* add SOLR  Schema API tests

* update SOLR backend configuration and init

* use the new Solr connector in the model search interface

* update search test to cover the new automatic indexing and unindexing

* handle the solr container initialization when running docker for tests

* add  omit_norms options for SolrSchemaGenerator

* fix solr schema initial dynamic fields declaration and replace the usage of   mapping-ISOLatin1Accent

* delay the schema generation to after model declarations or in demand

* add solr edismax fitlers tests

* fix indexBatch and unindexBatch tests

* add security checks to the index and unindex functions

* change dynamic fields names to have less code migration

* update clear_all_schema to remove all copy and normal fields

* add an option to force solr initialization if wanted

* handle indexing embed objects of a model

* add index update option

* fix clear all schema to just remove all the fields and recreate them

* add index_enabled?  helper for models

* perform a status test  when initializing the solr connector

* extract init_search_connection function from init_search_connections

* fix typo in indexOptimize call

* add solr search using  HTTP post instead of GET for large queries

* make indexed resource_id case insensitive (#59)

* Fix: Invalidating cache after insertion of a new  element (#60)

* create a test to reproduce the cache invalidate on insert bug

* use again insert_data instead of execute_append_request because the first invalidate the cache

* update sparql client to  version 3.2.0

* handle the case virtuoso insert data  bug

* use development branch of sparql-client

* fix search resource_id case insensitive by using string_ci instead
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants