Feature: Enhance SOLR integration and add a Schema API #54

syphax-bouazzouni · 2024-02-03T13:42:14Z

Prerequisites

Feature: Add after_save and after_destroy hooks to models #53
SOLR >= 8.11 and started in mode cloud

Goals

Make it easier to configure SOLR programmatically.
Have the possibility to index any model on demand

Context

SOLR is the indexing tool, that we use for our search features, it works by defining a collection (a table in the databases world), and for each collection, a schema defines the properties to index by giving its type, a list, or not, ... and also some dynamic or special fields to handle fuzzy search, or other.

The requirement was that we were required to define the collection and the schema, in XML configuration files at the start, and then after we could not change it in the code. This meant we were limited in the action that we could do, and it was hard to add new search features to our system. as these files were static, and you had to update the schema and create the collection configuration files each time you wanted to add something into the index.

This PR, integration the SOLR Schema API, in this project, gives us the option to create/delete a collection and update a collection schema dynamically, the following actions were implemented (see the full list in the SOLR::Admin, SOLR::Schema and SOLR::SchemaGenerator modules in the code):

SOLR Administration API

    fetch_all_collections 
    
    create_collection(name = @collection_name, num_shards = 1, replication_factor = 1)
   
    delete_collection(collection_name = @collection_name)
    
    collection_exists?(collection_name)

SOLR Schema API

    add_field(name, type, indexed: true, stored: true, multi_valued: false)
    
    add_dynamic_field(name, type, indexed: true, stored: true, multi_valued: false)
    
    add_copy_field(source, dest)
    
    add_field_type(type_definition)

    delete_field(name)

    update_schema(schema_json)
   
    fetch_schema

In Addition to the implementation of the SOLR Schema and admin APIs, we added a dsl to the Goo model, to enable index for any model, either in a schemaless mode or in a custom schema model

Schemaless mode: This will generate a collection and schema using the settings/metadata of the model properties, it is uses SOLR dynamic fields to make this possible, below an example of how to do it.

  class TermSearch < Goo::Base::Resource
    model :term_search, name_with: :prefLabel
    attribute :prefLabel, enforce: [:existence], fuzzy_search: true # fuzzy search will permit to have autocomplete for this field 
    attribute :synonym, enforce: [:list]
    attribute :definition
    attribute :submissionAcronym, enforce: [:existence]
    attribute :submissionId, enforce: [:existence, :integer]
    attribute :private, enforce: [:boolean], default: false, index: false # this field not be indexed 
    attribute :semanticType
    attribute :cui

    enable_indexing(:my_collection) # will generate the collection called `my_collection` and index all the attributes of the model (except `:private` here)
  end

Custom Schema mode: This we generate a collection and a schema directly defined in the code and decelartion of the model, below an example

class TermSearch < Goo::Base::Resource
    model :term_search, name_with: :id
    attribute :prefLabel, enforce: [:existence]
    attribute :synonym, enforce: [:list] # array of strings
    attribute :definition  # array of strings
    attribute :submissionAcronym, enforce: [:existence]
    attribute :submissionId, enforce: [:existence, :integer]
    attribute :semanticType
    attribute :cui

    enable_indexing(:term_search) do
      schema_generator.add_field(:prefLabel, 'text_general', indexed: true, stored: true, multi_valued: false)
      schema_generator.add_field(:synonym, 'text_general', indexed: true, stored: true, multi_valued: true)
      schema_generator.add_field(:definition, 'string', indexed: true, stored: true, multi_valued: true)
      schema_generator.add_field(:submissionAcronym, 'string', indexed: true, stored: true, multi_valued: false)
      schema_generator.add_field(:submissionId, 'pint', indexed: true, stored: true, multi_valued: false)
      schema_generator.add_field(:cui, 'text_general', indexed: true, stored: true, multi_valued: true)
      schema_generator.add_field(:semanticType, 'text_general', indexed: true, stored: true, multi_valued: true)

  
      # Copy fields for fuzzy and autocomplete search
      schema_generator.add_copy_field('prefLabel', '_text_')
      schema_generator.add_copy_field('prefLabel', 'prefLabel_Exact')
      schema_generator.add_copy_field('prefLabel', 'prefLabel_Suggest')
      schema_generator.add_copy_field('prefLabel', 'prefLabel_SuggestEdge')
      schema_generator.add_copy_field('prefLabel', 'prefLabel_SuggestNgram')

      schema_generator.add_copy_field('synonym', '_text_')
      schema_generator.add_copy_field('synonym', 'synonym_Exact')
      schema_generator.add_copy_field('synonym', 'synonym_Suggest')
      schema_generator.add_copy_field('synonym', 'synonym_SuggestEdge')
      schema_generator.add_copy_field('synonym', 'synonym_SuggestNgram')

      schema_generator.add_copy_field('notation', '_text_')
    end
end

Changes

add an abstraction to SOLR integration and add Schema API(b16ffbd)
add SOLR Schema API tests(0389943)
update SOLR backend configuration and init(f3815c4)
use the new Solr connector in the model search interface(9ca2e1c)
update search test to cover the new automatic indexing and unindexing (459c4ff)

codecov · 2024-02-03T13:42:23Z

Codecov Report

Attention: Patch coverage is 81.01604% with 71 lines in your changes are missing coverage. Please review.

Project coverage is 86.36%. Comparing base (1be1c83) to head (84aa3db).

Files	Patch %	Lines
lib/goo/search/solr/solr_schema.rb	72.34%	26 Missing ⚠️
lib/goo/search/search.rb	82.79%	16 Missing ⚠️
lib/goo/search/solr/solr_admin.rb	69.76%	13 Missing ⚠️
lib/goo/search/solr/solr_query.rb	87.50%	6 Missing ⚠️
lib/goo/search/solr/solr_schema_generator.rb	85.36%	6 Missing ⚠️
lib/goo.rb	92.30%	2 Missing ⚠️
lib/goo/search/solr/solr_connector.rb	91.66%	2 Missing ⚠️

Additional details and impacted files

@@               Coverage Diff               @@
##           development      #54      +/-   ##
===============================================
+ Coverage        85.93%   86.36%   +0.43%     
===============================================
  Files               41       46       +5     
  Lines             2702     3022     +320     
===============================================
+ Hits              2322     2610     +288     
- Misses             380      412      +32

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…age of mapping-ISOLatin1Accent

mdorf · 2024-02-15T21:57:24Z

@syphax-bouazzouni, this looks great! Really useful feature for us as well! It would be great to test it against the latest Solr 9.5.0 to make sure the code is compatible. I am wrapping up my work on a few search enhancements for the RADx project, which we plan to deploy to production shortly. The next major task is upgrading Solr to the latest version, which we may very well coincide with merging this pull request. Are you planning to make any other significant changes to this feature (you mentioned that this is a "first iteration)? Thank you!

syphax-bouazzouni · 2024-03-02T18:49:02Z

9.5.0

Hello, @mdorf, I tested it locally with Solr 9.5.0 and all the tests are green.
I removed the "first iteration" term, as I implemented everything I had in mind for now: administrate SOLR in code, be backward compatible with the existent, and index models automatically at save/delete.

* add an abstraction to SOLR integeration and add Schema API * add SOLR Schema API tests * update SOLR backend configuration and init * use the new Solr connector in the model search interface * update search test to cover the new automatic indexing and unindexing * handle the solr container initialization when running docker for tests * add omit_norms options for SolrSchemaGenerator * fix solr schema initial dynamic fields declaration and replace the usage of mapping-ISOLatin1Accent * delay the schema generation to after model declarations or in demand * add solr edismax fitlers tests * fix indexBatch and unindexBatch tests * add security checks to the index and unindex functions * change dynamic fields names to have less code migration * update clear_all_schema to remove all copy and normal fields * add an option to force solr initialization if wanted * handle indexing embed objects of a model * add index update option * fix clear all schema to just remove all the fields and recreate them * add index_enabled? helper for models * perform a status test when initializing the solr connector * extract init_search_connection function from init_search_connections * fix typo in indexOptimize call * add solr search using HTTP post instead of GET for large queries

mdorf · 2024-04-09T22:11:48Z

@syphax-bouazzouni, I am working on merging this functionality into our develop branch. It's a bit tricky given that the pull request is not against our own repo. Probably will need to do a lot of manual merging. Were you planning on submitting this pull request against the ncbo repo? If so, should I wait for that or just proceed with my manual merging? Thank you!

jonquet · 2024-04-10T05:07:21Z

Hello, the proposition was to move and create PRs directly on the OntoPortal repo now that everyone is positioned under it. I think @syphax-bouazzouni has scheduled some time to create PR related to our work soon.

syphax-bouazzouni · 2024-04-10T16:48:03Z

@syphax-bouazzouni, I am working on merging this functionality into our develop branch. It's a bit tricky given that the pull request is not against our own repo. Probably will need to do a lot of manual merging. Were you planning on submitting this pull request against the ncbo repo? If so, should I wait for that or just proceed with my manual merging? Thank you!

Hello @mdorf, I would suggest waiting at least 1 month before merging this, as it is only tested in our development environment, and will be released in our next release, see ontoportal-lirmm/ontologies_api#73

Once deployed to our production environment and tested, I will do a PR.

Is it good with you?

mdorf · 2024-04-10T19:52:21Z

@syphax-bouazzouni, @jonquet, no problem. In general, we are very interested in this feature to facilitate the functionality sought by the RADx project, in which our Solr index would be packaged in a way to be accessible by the third-party API. See bmir-radx/radx-project#49. However, based on my conversation with @alexskr, this does not have the immediate urgency. A month is definitely reasonable for us to wait to be able to merge this feature in its more stable and tested iteration.

…DF 3.0 and SOLR API (#58) * Feature: Add Virtuso, Allegrograph and Graphdb integration to GOO (#48) * simplify the test configuration init * add docker based tests rake task to run test against 4s, ag, gb, vo * remove faraday gem usage * update test CI to test against all the supported backends with diffirent slice sizes * add high level helper to to know which backend we are currently using * extract sparql processor module from where module * handle language_match? value to upcase by default * add support for virtuoso and graphdb sparql client * replace delete sparql query by delete graph in the model complex test * add some new edge cases tests t o test_where.rb and test_schemaless * make test_chunks_write.rb tests support multiple backends * replace native insert_data with execute_append_request in model save * remove add_rules as it seems to no more be used * move expand_equivalent_predicates from loader to builder module * build two diffirent queries depending on which backend used * update mapper to handle the two different queries depending on the backend used * simplify the loader code, by removing inferable variables * refactor and simplify map_attributes method * fix test chunks write concenrency issues * Refactor: clean model settings module code (#52) * remove old file no more used * extract attribute settings module from the model settings module * remove the inmutable feature as deprecated and not used * rename callbacks method names * Feature: Add after_save and after_destroy hooks to models (#53) * remove old file no more used * extract attribute settings module from the model settings module * remove the inmutable feature as deprecated and not used * rename callbacks method names * add hooks module * Feature: update rdf gem to latest version (#56) * un pin rdf version, to use the latest and add rdf vocab and xml * update URI class monkey patch because Addressable does no more exist * RDF::SKOS is replaced with RDF::Vocab::SKOS in the latest version of RDF * pin rdf version to 3.2.11 the latest version that support ruby 2.7 * monkey path Literal::DateTime format to be supported by 4store * remove addressable dependency * Fix: saving a model removing unmodified attributes after consecutive save * Fix: enforce to use str() when doing a filter with a string value (#57) * enforce to use str() when doing a filter with a string * update agraph version to 8.1.0 * Fix: monkey path RDF to not remove xsd:string by default * Feature: Enhance SOLR integration and add a Schema API (#54) * add an abstraction to SOLR integeration and add Schema API * add SOLR Schema API tests * update SOLR backend configuration and init * use the new Solr connector in the model search interface * update search test to cover the new automatic indexing and unindexing * handle the solr container initialization when running docker for tests * add omit_norms options for SolrSchemaGenerator * fix solr schema initial dynamic fields declaration and replace the usage of mapping-ISOLatin1Accent * delay the schema generation to after model declarations or in demand * add solr edismax fitlers tests * fix indexBatch and unindexBatch tests * add security checks to the index and unindex functions * change dynamic fields names to have less code migration * update clear_all_schema to remove all copy and normal fields * add an option to force solr initialization if wanted * handle indexing embed objects of a model * add index update option * fix clear all schema to just remove all the fields and recreate them * add index_enabled? helper for models * perform a status test when initializing the solr connector * extract init_search_connection function from init_search_connections * fix typo in indexOptimize call * add solr search using HTTP post instead of GET for large queries * make indexed resource_id case insensitive (#59) * Fix: Invalidating cache after insertion of a new element (#60) * create a test to reproduce the cache invalidate on insert bug * use again insert_data instead of execute_append_request because the first invalidate the cache * update sparql client to version 3.2.0 * handle the case virtuoso insert data bug * use development branch of sparql-client * fix search resource_id case insensitive by using string_ci instead

syphax-bouazzouni added 5 commits February 3, 2024 14:11

add an abstraction to SOLR integeration and add Schema API

b16ffbd

add SOLR Schema API tests

0389943

update SOLR backend configuration and init

f3815c4

use the new Solr connector in the model search interface

9ca2e1c

update search test to cover the new automatic indexing and unindexing

459c4ff

handle the solr container initialization when running docker for tests

bde2bdf

syphax-bouazzouni force-pushed the feature/add-model-based-search branch from f83d7c4 to bde2bdf Compare February 3, 2024 14:21

syphax-bouazzouni changed the title ~~Feature: Enhance SOLR integration and add a Schema API~~ [WIP] Feature: Enhance SOLR integration and add a Schema API Feb 3, 2024

syphax-bouazzouni mentioned this pull request Feb 3, 2024

Feature: Migrate SOLR configuration files to use SOLR Schema API ontoportal-lirmm/ontologies_linked_data#126

Merged

add omit_norms options for SolrSchemaGenerator

df31c26

syphax-bouazzouni mentioned this pull request Feb 5, 2024

Solr v8 schema update ncbo/ontologies_linked_data#177

Closed

syphax-bouazzouni added 5 commits February 5, 2024 19:27

fix solr schema initial dynamic fields declaration and replace the us…

adcf79d

…age of mapping-ISOLatin1Accent

delay the schema generation to after model declarations or in demand

67023bf

add solr edismax fitlers tests

bccf3e7

fix indexBatch and unindexBatch tests

d4f5e1d

add security checks to the index and unindex functions

8308441

syphax-bouazzouni mentioned this pull request Feb 5, 2024

Feature: use the new SOLR Schema API instead of SOLR config files ontoportal-lirmm/ontologies_api#68

Merged

syphax-bouazzouni changed the title ~~[WIP] Feature: Enhance SOLR integration and add a Schema API~~ Feature: Enhance SOLR integration and add a Schema API Feb 5, 2024

syphax-bouazzouni added 2 commits February 9, 2024 02:49

change dynamic fields names to have less code migration

877bbe1

update clear_all_schema to remove all copy and normal fields

9d5c23d

syphax-bouazzouni force-pushed the feature/add-model-based-search branch from e1ad9ac to a74645b Compare February 9, 2024 09:22

syphax-bouazzouni added 5 commits February 16, 2024 07:10

add an option to force solr initialization if wanted

23a0824

handle indexing embed objects of a model

664127c

add index update option

6a29f05

fix clear all schema to just remove all the fields and recreate them

d00f9a5

add index_enabled? helper for models

ec99ed8

syphax-bouazzouni force-pushed the feature/add-model-based-search branch from a74645b to ec99ed8 Compare February 16, 2024 06:16

perform a status test when initializing the solr connector

ca32e8c

syphax-bouazzouni force-pushed the feature/add-model-based-search branch from f25886a to ca32e8c Compare February 17, 2024 09:46

syphax-bouazzouni added 2 commits February 23, 2024 09:40

Merge branch 'development' into feature/add-model-based-search

83ac6f6

extract init_search_connection function from init_search_connections

e461c2d

syphax-bouazzouni force-pushed the feature/add-model-based-search branch 2 times, most recently from bc4f5a1 to 1fc6730 Compare February 25, 2024 12:57

fix typo in indexOptimize call

51fbfe4

syphax-bouazzouni force-pushed the feature/add-model-based-search branch from 1fc6730 to 51fbfe4 Compare February 25, 2024 12:59

Merge branch 'development' into feature/add-model-based-search

10b90c1

add solr search using HTTP post instead of GET for large queries

84aa3db

syphax-bouazzouni force-pushed the feature/add-model-based-search branch from 1836ce1 to 84aa3db Compare March 2, 2024 19:37

syphax-bouazzouni merged commit 6c51346 into development Mar 2, 2024
49 of 50 checks passed

alexskr mentioned this pull request Mar 22, 2024

adopt SOLR Schema API for configuring SOLR in bioportal ncbo/bioportal-project#310

Open

syphax-bouazzouni mentioned this pull request Apr 3, 2024

Merge to master: Release 2.4.0 - Multi-backend stores integrations, RDF 3.0 and SOLR API #58

Merged

syphax-bouazzouni mentioned this pull request May 1, 2024

Think about adding a search function in goo agroportal/project-management#160

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Enhance SOLR integration and add a Schema API #54

Feature: Enhance SOLR integration and add a Schema API #54

syphax-bouazzouni commented Feb 3, 2024 •

edited

Loading

codecov bot commented Feb 3, 2024 •

edited

Loading

mdorf commented Feb 15, 2024

syphax-bouazzouni commented Mar 2, 2024 •

edited

Loading

mdorf commented Apr 9, 2024

jonquet commented Apr 10, 2024

syphax-bouazzouni commented Apr 10, 2024

mdorf commented Apr 10, 2024

Feature: Enhance SOLR integration and add a Schema API #54

Feature: Enhance SOLR integration and add a Schema API #54

Conversation

syphax-bouazzouni commented Feb 3, 2024 • edited Loading

Prerequisites

Goals

Context

Changes

codecov bot commented Feb 3, 2024 • edited Loading

Codecov Report

mdorf commented Feb 15, 2024

syphax-bouazzouni commented Mar 2, 2024 • edited Loading

mdorf commented Apr 9, 2024

jonquet commented Apr 10, 2024

syphax-bouazzouni commented Apr 10, 2024

mdorf commented Apr 10, 2024

syphax-bouazzouni commented Feb 3, 2024 •

edited

Loading

codecov bot commented Feb 3, 2024 •

edited

Loading

syphax-bouazzouni commented Mar 2, 2024 •

edited

Loading