INDRA Indexer is divided into two modules:
- indra-preprocessing
- indra-index
The corpus pre-processor is responsible for defining the tokenisation strategy and the tokens’ subsequent transformations. It defines, for example, if United States of America corresponds to aunique token or to multiple. Stem and lowercase are two other popular transformations also supported by the pre-processor, whose full list is shown in the below table.
Parameter | Description/Options |
---|---|
input format | Wikipedia-dump format or plain texts from one or multiple files. |
language | 14 supported languages. |
set of stopwords | a set of tokens to be removed. |
set of multi-word expressions | set of sequences of tokens that should be considered a unique token. |
apply lowercase | lowercase the tokens. |
apply stemmer | applies the Poter Stemmer in the tokens. |
remove accents | remove the accents of words. |
replace numbers | replaces numbers for the place holder . |
min | set a minimum acceptable token size. |
max | set a maximum acceptable token size. |
The indra-index module is responsible for generating word embedding models and loading them into the Indra data sources. It defines a unified interface to generate predictive-based (e.g. Skip-gram and GloVe) and count-based (e.g. LSA and ESA) models whose implementation comes from the libraries DeepLearning4J and S-Space respectively. In addition to the unification of the interface, indra-index integrates the corpus preprocessor module.
The final generated model stores the set of applied transformations as a metadata information. During the consumption, Indra applies the same set of options to guarantee consistence. For instance, admitting that a given model was generated by applying the stemmer and lowercase to the tokens.
Indra loads the generated models into three types of data sources: annoy indexes (for dense vectors models), Lucene indexes (for sparse vectors models) or Mongo indexes (deprecated).
Please cite Indra, if you use it in your experiments or project.
@InProceedings{indra,
author="Sales, Juliano Efson and Souza, Leonardo and Barzegar, Siamak and Davis, Brian and Freitas, Andr{\'e} and Handschuh, Siegfried",
title="Indra: A Word Embedding and Semantic Relatedness Server",
booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
month = {May},
year = {2018},
address = {Miyazaki, Japan},
publisher = {European Language Resources Association (ELRA)},
}
- Andre Freitas
- Brian Davis
- Juliano Sales
- Leonardo Souza
- Siamak Barzegar
- Siegfried Handschuh