Skip to content

Solr Schema Extension for Analyzers

rdelbru edited this page Oct 16, 2011 · 9 revisions

Solr Schema Extension for Analyzers

This document covers the solutions we have found to extend the Solr schema in order to support the extended SIREn index and query analyzers.

The main difference between the original Lucene/Solr analyzers and the SIREn analyzers is that SIREn analyzers rely on secondary analyzers for each datatype. This means that the top level analyzers for indexing and querying need to be configured with multiple secondary analyzers, one per datatype.

Another difference is that we need two (or more) different query analyzers for a same SIREn FieldType. For example, SIREn allows to either use keyword query or ntriple query to search in one SIREn FieldType. Each of them requires a different analyzer.

SIREn FieldType

The SIREn FieldType defines the top level index and query analyzers. The configuration of these analyzers is defined in an external xml configuration file, i.e., analyzerConfig="ntriple-analyzers.xml". The SIREn FieldType requires a second xml configuration file that defines the analyzers for each datatype.

<fieldType name="ntriple" class="siren.SirenType" analyzerConfig="ntriple-analyzers.xml"
                                                  datatypeConfig="ntriple-datatypes.xml"/>

SIREn Analyzer Configuration File

The SIREn Analyzer Configuration File contains the configuration of the top level index and query analyzers for a SIREn FieldType. This file is parsed by a simplified version of the solr's IndexSchema class, and allows the definition of multiple query analyzers, e.g., one for keyword query and one for ntriple query.

<analyzer type="index">
  ...
</analyzer>
<analyzer type="keyword-query">
  ...
</analyzer>
<analyzer type="ntriple-query">
  ...
</analyzer>

UPDATE

The analyzer for ntriple-query has been removed. Usually, it was always defining the NTripleQueryTokenizerFactory as unique element in the analyzer. In fact, this tokenizer is tied with the NTriple query parser, and it is unlikely that something else needs to be defined. The analysis of the different parts of the NTriple query is configured through the Datatypes. Therefore, in order to simplify the configuration, we can remove it. Its instantiation is hardcoded within the NTripleQParser class.

SIREn Datatype

The SIREn Datatype is a modified version of the Solr FieldType for defining the index and query analyzers of each datatype.

All the Datatypes are defined in one specific Datatype Configuration file. This file is parsed by a modified version of the solr's IndexSchema class.

There is a need for different Datatype. For example, we must have a TextDatatype and a TrieDatatype. The latter one is necessary for Trie Range Queries. This special Datatype does not contain analyzer definitions, but instead it contains parameters for configuring the trie indexing and querying, such as primitive type, precision, etc.