Skip to content
rdelbru edited this page Sep 8, 2011 · 14 revisions

SIREn Schema

This document explains how to register a new field type for N-Triple indexing and querying in the Solr schema.xml.

N-Triple Field Type

In the example schema.xml, you can find the definition of field type named 'ntriple'.

You can notice that in its definition:

    <fieldType name="ntriple" class="solr.TextField" omitNorms="true">

the parameter omitNorms is set to true. Deactivating length normalisation on a SIREn field is recommended due to the particular way SIREn is indexing data within a field.

Index-Time Analysis

TupleTokenizerFactory

For index-time analysis, you have to define a TupleTokenizerFactory followed by a list of optional TokenFilterFactories that are applied in the listed order. The TupleTokenizerFactory creates and configures a TupleTokenizer.

<tokenizer class="org.sindice.siren.solr.analysis.TupleTokenizerFactory"
           subschema="ntriple-schema.xml"
           literal-fieldtype="ntriple-literal"/>

The TupleTokenizerFactory requires two parameters, subschema and literal-fieldtype. The subschema parameter indicates a secondary schema.xml file which includes the definition of the Literal analyzer. The literal-fieldtype parameter indicates which field type define the Literal analyzer.

TokenTypeFilterFactory

The TokenTypeFilterFactory creates a TokenTypeFilter.

This filter can be configured to remove different token types created by SIREn. By default, it removes bnode, dot, datatype and language tag tokens. Such a behaviour can be changed by using one of the four optional parameters: bnode, datatype, languageTag and dot. By setting one of this parameter to 0, the TokenTypeFilter will not filter out the corresponding token type. For example, in the following example, the bnode token type will be kept and indexed by SIREn:

<filter class="org.sindice.siren.solr.analysis.TokenTypeFilterFactory"
        bnode="0"/>

URIEncodingFilterFactory

The URIEncodingFilterFactory creates a URIEncodingFilter.

This URIEncodingFilter decodes special characters in URIs such as '?' or '<' (except of the SPACE that can be encoded with '+') which are encoded by a '%' and followed by two characters in hexadecimal format.

URILocalnameFilterFactory

The URILocalnameFilterFactory creates a URILocalnameFilter.

The URILocalnameFilter extracts the localname of an URI, and breaks it into smaller components based on delimiters, such as uppercase or integers. By default, it does not tokenise localname with a length superior to 64 characters. Such a behaviour can be changed with the parameter maxLength.

URITrailingSlashFilterFactory

The URITrailingSlashFilterFactory creates a URITrailingSlashFilter.

The URITrailingSlashFilter removes the trailing slash of an URI.

MailtoFilterFactory

The MailtoFilterFactory creates a MailtoFilter.

The MailtoFilter tokenises the URIs with a mailto scheme.

SirenDeltaPayloadFilterFactory

The SirenDeltaPayloadFilterFactory creates a SirenDeltaPayloadFilter.

It is mandatory to have the SirenDeltaPayloadFilter as the last filter of the list. This filter is in charge of encoding the SIREn metadata (tuple and cell ids) into the payload of each token.

Query-Time Analysis

For query-time analysis, SIREn relies on a MultiQueryAnalyzerWrapper.

<analyzer type="query" class="org.sindice.siren.solr.analysis.MultiQueryAnalyzerWrapper"/>

The MultiQueryAnalyzerWrapper is still in an experimental phase, and due to code restrictions in Solr, it contains an hard-coded reference to the secondary schema.xml (ntriple-schema.xml).

This MultiQueryAnalyzerWrapper delegates the analysis