GitHub - Lambda-3/IndraIndexer: Indra modules responsible for pre-processing text, and indexing and loading semantic vector models.

INDRA INDEXER

INDRA Indexer is divided into two modules:

indra-preprocessing
indra-index

indra-preprocessing

The corpus pre-processor is responsible for defining the tokenisation strategy and the tokens’ subsequent transformations. It defines, for example, if United States of America corresponds to aunique token or to multiple. Stem and lowercase are two other popular transformations also supported by the pre-processor, whose full list is shown in the below table.

Parameter	Description/Options
input format	Wikipedia-dump format or plain texts from one or multiple files.
language	14 supported languages.
set of stopwords	a set of tokens to be removed.
set of multi-word expressions	set of sequences of tokens that should be considered a unique token.
apply lowercase	lowercase the tokens.
apply stemmer	applies the Poter Stemmer in the tokens.
remove accents	remove the accents of words.
replace numbers	replaces numbers for the place holder .
min	set a minimum acceptable token size.
max	set a maximum acceptable token size.

indra-index

The indra-index module is responsible for generating word embedding models and loading them into the Indra data sources. It defines a unified interface to generate predictive-based (e.g. Skip-gram and GloVe) and count-based (e.g. LSA and ESA) models whose implementation comes from the libraries DeepLearning4J and S-Space respectively. In addition to the unification of the interface, indra-index integrates the corpus preprocessor module.

The final generated model stores the set of applied transformations as a metadata information. During the consumption, Indra applies the same set of options to guarantee consistence. For instance, admitting that a given model was generated by applying the stemmer and lowercase to the tokens.

Indra loads the generated models into three types of data sources: annoy indexes (for dense vectors models), Lucene indexes (for sparse vectors models) or Mongo indexes (deprecated).

Citing Indra

Please cite Indra, if you use it in your experiments or project.

@InProceedings{indra,
author="Sales, Juliano Efson and Souza, Leonardo and Barzegar, Siamak and Davis, Brian and Freitas, Andr{\'e} and Handschuh, Siegfried",
title="Indra: A Word Embedding and Semantic Relatedness Server",
booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
month     = {May},
year      = {2018},
address   = {Miyazaki, Japan},
publisher = {European Language Resources Association (ELRA)},
}

Contributors (alphabetical order)

Andre Freitas
Brian Davis
Juliano Sales
Leonardo Souza
Siamak Barzegar
Siegfried Handschuh

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
indra-index		indra-index
indra-preprocessing		indra-preprocessing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pack.sh		pack.sh
pom.xml		pom.xml
shellindra.sh		shellindra.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

INDRA INDEXER

indra-preprocessing

indra-index

Citing Indra

Contributors (alphabetical order)

About

Releases

Packages

Contributors 3

Languages

License

Lambda-3/IndraIndexer

Folders and files

Latest commit

History

Repository files navigation

INDRA INDEXER

indra-preprocessing

indra-index

Citing Indra

Contributors (alphabetical order)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages