-
Notifications
You must be signed in to change notification settings - Fork 36
Solr Schema
This document explains how to register a new field type for N-Triple indexing and querying in the Solr schema.xml.
In the example schema.xml, you can find the definition of field type named 'ntriple'.
You can notice that in its definition:
<fieldType name="ntriple" class="solr.TextField" omitNorms="true">
the parameter omitNorms
is set to true. Deactivating length normalisation on a SIREn field is recommended due to the particular way SIREn is indexing data within a field.
For index-time analysis, you have to define a TupleTokenizerFactory followed by a list of optional TokenFilterFactories that are applied in the listed order. The TupleTokenizerFactory creates and configures a TupleTokenizer.
<tokenizer class="org.sindice.siren.solr.analysis.TupleTokenizerFactory"
subschema="ntriple-schema.xml"
literal-fieldtype="ntriple-literal"/>
The TupleTokenizerFactory requires two parameters, subschema
and literal-fieldtype
. The subschema parameter indicates a secondary schema.xml file which includes the definition of the Literal analyzer. The literal-fieldtype parameter indicates which field type define the Literal analyzer.
TokenTypeFilterFactory
The TokenTypeFilterFactory creates a TokenTypeFilter.
This filter can be configured to remove different token types created by SIREn. By default, it removes bnode, dot, datatype and language tag tokens. Such a behaviour can be changed by using one of the four optional parameters: bnode
, datatype
, languageTag
and dot
. By setting one of this parameter to 0, the TokenTypeFilter will not filter out the corresponding token type. For example, in the following example, the bnode token type will be kept and indexed by SIREn:
<filter class="org.sindice.siren.solr.analysis.TokenTypeFilterFactory"
bnode="0"/>
URIEncodingFilterFactory
The URIEncodingFilterFactory creates a URIEncodingFilter.
This URIEncodingFilter decodes special characters in URIs such as '?' or '<' (except of the SPACE that can be encoded with '+') which are encoded by a '%' and followed by two characters in hexadecimal format.
URILocalnameFilterFactory
The URILocalnameFilterFactory creates a URILocalnameFilter.
The URILocalnameFilter extracts the localname of an URI, and breaks it into smaller components based on delimiters, such as uppercase or integers. By default, it does not tokenise localname with a length superior to 64 characters. Such a behaviour can be changed with the parameter maxLength
.
URITrailingSlashFilterFactory
The URITrailingSlashFilterFactory creates a URITrailingSlashFilter.
The URITrailingSlashFilter removes the trailing slash of an URI.
MailtoFilterFactory
The MailtoFilterFactory creates a MailtoFilter.
The MailtoFilter tokenises the URIs with a mailto scheme.
SirenDeltaPayloadFilterFactory
The SirenDeltaPayloadFilterFactory creates a SirenDeltaPayloadFilter.
It is mandatory to have the SirenDeltaPayloadFilter as the last filter of the list. This filter is in charge of encoding the SIREn metadata (tuple and cell ids) into the payload of each token.
For query-time analysis, SIREn relies on a MultiQueryAnalyzerWrapper.
<analyzer type="query" class="org.sindice.siren.solr.analysis.MultiQueryAnalyzerWrapper"/>
The MultiQueryAnalyzerWrapper is still in an experimental phase, and due to code restrictions in Solr, it contains an hard-coded reference to the secondary schema.xml (ntriple-schema.xml).
This MultiQueryAnalyzerWrapper delegates the analysis