-
Notifications
You must be signed in to change notification settings - Fork 36
Solr Schema
This document explains how to register a new field type for N-Triple indexing and querying in the Solr schema.xml.
In the example schema.xml, you can find the definition of field type named ntriple
.
You can notice that in its definition:
<fieldType name="ntriple" class="solr.TextField" omitNorms="true">
the parameter omitNorms
is set to true. Deactivating length normalisation on a SIREn field is recommended due to the particular way SIREn is indexing data within a field.
For index-time analysis, you have to define a TupleTokenizerFactory followed by a list of optional TokenFilterFactories that are applied in the listed order. The TupleTokenizerFactory creates and configures a TupleTokenizer.
<tokenizer class="org.sindice.siren.solr.analysis.TupleTokenizerFactory"
subschema="ntriple-schema.xml"
literal-fieldtype="ntriple-literal"/>
The TupleTokenizerFactory requires two parameters, subschema
and literal-fieldtype
. The subschema parameter indicates a secondary schema.xml file which includes the definition of the Literal analyzer. The literal-fieldtype parameter indicates which field type define the Literal analyzer.
The TokenTypeFilterFactory creates a TokenTypeFilter.
This filter can be configured to remove different token types created by SIREn. By default, it removes bnode, dot, datatype and language tag tokens. Such a behaviour can be changed by using one of the four optional parameters: bnode
, datatype
, languageTag
and dot
. By setting one of this parameter to 0, the TokenTypeFilter will not filter out the corresponding token type. For example, in the following example, the bnode token type will be kept and indexed by SIREn:
<filter class="org.sindice.siren.solr.analysis.TokenTypeFilterFactory"
bnode="0"/>
The URIEncodingFilterFactory creates a URIEncodingFilter.
This URIEncodingFilter decodes special characters in URIs such as '?' or '<' (except of the SPACE that can be encoded with '+') which are encoded by a '%' and followed by two characters in hexadecimal format.
The URILocalnameFilterFactory creates a URILocalnameFilter.
The URILocalnameFilter extracts the localname of an URI, and breaks it into smaller components based on delimiters, such as uppercase or integers. By default, it does not tokenise localname with a length superior to 64 characters. Such a behaviour can be changed with the parameter maxLength
.
The URITrailingSlashFilterFactory creates a URITrailingSlashFilter.
The URITrailingSlashFilter removes the trailing slash of an URI.
The MailtoFilterFactory creates a MailtoFilter.
The MailtoFilter tokenises the URIs with a mailto scheme.
The SirenDeltaPayloadFilterFactory creates a SirenDeltaPayloadFilter.
It is mandatory to have the SirenDeltaPayloadFilter as the last filter of the list. This filter is in charge of encoding the SIREn metadata (tuple and cell ids) into the payload of each token.
The analysis of Literals is orthogonal to the analysis of tuples. The Literal analyzer is defined in the secondary schema (ntriple-schema.xml):
<fieldType name="ntriple-literal" class="org.apache.solr.schema.SubTextField">
All field types in the secondary schema must be defined as class="org.apache.solr.schema.SubTextField"
.
The ASCIIFoldingExpansionFilterFactory creates a ASCIIFoldingExpansionFilter.
This filter expands accented tokens with a non-accented form. For example, if a literal contains the token 'café', it will create an additional token 'cafe' at the same position than the token 'café'.
For query-time analysis, SIREn relies on a MultiQueryAnalyzerWrapper.
<analyzer type="query" class="org.sindice.siren.solr.analysis.MultiQueryAnalyzerWrapper"/>
The MultiQueryAnalyzerWrapper is still in an experimental phase, and due to code restrictions in Solr, it contains an hard-coded reference to the secondary schema.xml (ntriple-schema.xml).
This MultiQueryAnalyzerWrapper delegates the analysis of the different type of queries to the different analyzers defined in the ntriple-schema.xml.
There are two types of queries:
- N-Triple query
- Keyword query
For N-Triple query, there are three query-time analyzers defined: ntriple-main
, ntriple-literal
and ntriple-uri
. For keyword query, there is a single query-time analyzer defined: ntriple-keyword
. The field type name of these analyzers must be suffixed with ntriple-
in order to be associated with the ntriple
field type that is defined in the schema.xml.