Skip to content
scampi edited this page Sep 15, 2011 · 14 revisions

SIREn Schema

This document explains how to register a new field type for N-Triple indexing and querying in the Solr schema.xml.

N-Triple Field Type

In the example schema.xml, you can find the definition of field type named ntriple.

You can notice that in its definition:

    <fieldType name="ntriple" class="solr.TextField" omitNorms="true">

the parameter omitNorms is set to true. Deactivating length normalisation on a SIREn field is recommended due to the particular way SIREn is indexing data within a field.

Index-Time Analysis

TupleTokenizerFactory

For index-time analysis, you have to define a TupleTokenizerFactory followed by a list of optional TokenFilterFactories that are applied in the listed order. The TupleTokenizerFactory creates and configures a TupleTokenizer.

<tokenizer class="org.sindice.siren.solr.analysis.TupleTokenizerFactory"
           subschema="ntriple-schema.xml"
           literal-fieldtype="ntriple-literal"/>

The TupleTokenizerFactory requires two parameters, subschema and literal-fieldtype. The subschema parameter indicates a secondary schema.xml file which includes the definition of the Literal analyzer. The literal-fieldtype parameter indicates which field type define the Literal analyzer.

TokenTypeFilterFactory

The TokenTypeFilterFactory creates a TokenTypeFilter.

This filter can be configured to remove different token types created by SIREn. By default, it removes bnode, dot, datatype and language tag tokens. Such a behaviour can be changed by using one of the four optional parameters: bnode, datatype, languageTag and dot. By setting one of this parameter to 0, the TokenTypeFilter will not filter out the corresponding token type. For example, in the following example, the bnode token type will be kept and indexed by SIREn:

<filter class="org.sindice.siren.solr.analysis.TokenTypeFilterFactory"
        bnode="0"/>

URIEncodingFilterFactory

The URIEncodingFilterFactory creates a URIEncodingFilter.

This URIEncodingFilter decodes special characters in URIs such as '?' or '<' (except of the SPACE that can be encoded with '+' or '%20') which are encoded by a '%' and followed by two characters in hexadecimal format.

URILocalnameFilterFactory

The URILocalnameFilterFactory creates a URILocalnameFilter.

The URILocalnameFilter extracts the localname of an URI, and breaks it into smaller components based on delimiters, such as uppercase or integers. By default, it does not tokenise localname with a length superior to 64 characters. Such a behaviour can be changed with the parameter maxLength.

URITrailingSlashFilterFactory

The URITrailingSlashFilterFactory creates a URITrailingSlashFilter.

The URITrailingSlashFilter removes the trailing slash of an URI.

MailtoFilterFactory

The MailtoFilterFactory creates a MailtoFilter.

The MailtoFilter tokenises the URIs with a mailto scheme.

SirenDeltaPayloadFilterFactory

The SirenDeltaPayloadFilterFactory creates a SirenDeltaPayloadFilter.

It is mandatory to have the SirenDeltaPayloadFilter as the last filter of the list. This filter is in charge of encoding the SIREn metadata (tuple and cell ids) into the payload of each token.

Literal Analysis

The analysis of Literals is orthogonal to the analysis of tuples. The Literal analyzer is defined in the secondary schema (ntriple-schema.xml):

<fieldType name="ntriple-literal" class="org.apache.solr.schema.SubTextField">

All field types in the secondary schema must be defined as class="org.apache.solr.schema.SubTextField".

ASCIIFoldingExpansionFilterFactory

The ASCIIFoldingExpansionFilterFactory creates a ASCIIFoldingExpansionFilter.

This filter expands accented tokens with a non-accented form. For example, if a literal contains the token 'café', it will create an additional token 'cafe' at the same position than the token 'café'.

Query-Time Analysis

For query-time analysis, SIREn relies on a MultiQueryAnalyzerWrapper.

<analyzer type="query" class="org.sindice.siren.solr.analysis.MultiQueryAnalyzerWrapper"/>

The MultiQueryAnalyzerWrapper is still in an experimental phase, and due to code restrictions in Solr, it contains an hard-coded reference to the secondary schema.xml (ntriple-schema.xml).

This MultiQueryAnalyzerWrapper delegates the analysis of the different type of queries to the different analyzers defined in the ntriple-schema.xml.

There are two types of queries:

  • N-Triple query
  • Keyword query

For N-Triple queries, there are three query-time analyzers defined in the following field types: ntriple-main, ntriple-literal and ntriple-uri. For keyword query, there is a single query-time analyzer defined in the field type: ntriple-keyword. The names of these field types must be suffixed with ntriple- in order to be associated with the ntriple field type that is defined in the schema.xml.

  • ntriple-main: This field type defines the tokenizer for NTriple queries. The analysis of URIs and Literals is defined in the other field types, i.e., ntriple-literal and ntriple-uri.
  • ntriple-uri: This field type defines the analyzer to use for URIs found in a NTriple query.
  • ntriple-literal: This field type defines the analyzer to use for Literals found in a NTriple query.
  • ntriple-keyword: This field type defines the analyzer to use whenever a keyword query is issued.