This project introduces a concurrent application for information retrieval, using the Standard Boolean Model. More precisely, this implementation offers the possibility of parallel query processing, over the Cranfield Collection of text documents, using Atomic Memory Transactions implemented in C++.
Based on Boolean logic and classical set theory, the Boolean Model corresponds documents and queries to set of terms. As a result, retrieval is based on whether documents contain query terms or not.
For example, given a set of documents Doci and a query Q:
- Doc1 -> {word1, word2, word3}
- Doc2 -> {word2, word3}
- Doc3 -> {word3}
- Q -> {word1, word2, word3}
The Boolean model would evaluate the documents as follows:
- Doc1 -> score = 3 (contains 3 terms)
- Doc2 -> score = 2 (contains 2 terms)
- Doc3 -> score = 1 (contains 3 terms)
The test collection of Cranfield includes 1400 abstracts of aeronautical journal articles, a set of 225 queries, and exhaustive relevance evaluations of all (query, document) pairs.
Initially, the Cranfield collection was stored in two files:
- cran.all.1400, which contains 1400 abstracts of aeronautical journal articles
- cran.qry, which contains 225 relevant queries
In order to facilitate parallel processing, documents and queries are splitted to 1400 text files for the documents and 225 for the queries.
Apart from splitting, the SnowballAnalyzer and StopAnalyzer classes of Apache Lucene are used for stemming and stop-words removal.