Skip to content

Document indexing and retireval using Lucene Seach API

Notifications You must be signed in to change notification settings

rajulkumar/LuceneSearchEngine

Repository files navigation

Index and Retrieve using Lucene
-------------------------------
This is an implementation of a Indexing and Retrieval system using Lucene, an information retrieval library.
This generates an inverted index of a document or a set of documents as the part of indexing by Lucene's indexing functions using Simple Analyzer. It also generates a term frequency file sorted by the term frequency of all the terms and their frequency in the corpus.
The retrieval process is done by scoring all the documnents on a score generated by Lucene's scoring function for a query. The documents are then ranked and retireved for the given count by the function and returned as the result in a file.

Build
-----
Developed in Java using jdk1.8.0_05.
Compiled for jre8.
Lucene's libraries used for indexing and scoring purpose. The jars are provided in lib forlder at this location.
 lucene-analyzers-common-4.7.2.jar
 lucene-core-4.7.2.jar
 lucene-queryparser-4.7.2.jar

Compile and Run
----------------
For windows:
1. execute executeIndexer.bat/executeRetrieve.bat
2. Provide the location of file or directory of the files to be indexed. e.g. C:/IR/Assignment/InR_Lucene or C:/IR/Assignment/InR_Lucene/CACM-0001.html
3. Provide the location where the index is to be created. e.g. C:/IR/Assignment/InR_Lucene/Index
4. Provide the location where the term frequency file is to be created. e.g. C:/IR/Assignment/InR_Lucene
5. Response files are generated at the given location.

For linux/unix*:
1. execute executeIndexer.sh/executeRetrieve.sh
2. Provide the location of file or directory of the files to be indexed. e.g. /IR/Assignment/InR_Lucene or /IR/Assignment/InR_Lucene/CACM-0001.html
3. Provide the location where the index is to be created. e.g. /IR/Assignment/InR_Lucene/Index
4. Provide the location where the term frequency file is to be created. e.g. /IR/Assignment/InR_Lucene
5. Response files are generated at the given location.

* not tested for linux/unix 

Alternatively,
The java files Indexer.java and Retrieve.java could be compiled from here by:
javac -cp ./lib/lucene-analyzers-common-4.7.2.jar;./lib/lucene-core-4.7.2.jar;./lib/lucene-queryparser-4.7.2.jar src/com/search/Indexer.java
javac -cp ./lib/lucene-analyzers-common-4.7.2.jar;./lib/lucene-core-4.7.2.jar;./lib/lucene-queryparser-4.7.2.jar src/com/search/Retrieve.java

This should be executed from this location by:
java -cp ./lib/lucene-analyzers-common-4.7.2.jar;./lib/lucene-core-4.7.2.jar;./lib/lucene-queryparser-4.7.2.jar;./src/ com/search/Indexer
java -cp ./lib/lucene-analyzers-common-4.7.2.jar;./lib/lucene-core-4.7.2.jar;./lib/lucene-queryparser-4.7.2.jar;./src/ com/search/Retrieve

Provide the paths required as given above.

* Index generated on the given CACM corpus is available at ./Index location
* Please change or clear the index location if the same files are being indexed again and stored at the same location as this may cuase multiple entries for the same doc giving undersired results

Results/response of execution
-----------------------------
1. The sorted by frequency list of term and term frequency is in the file "TermFreq.out" at the same location.
   The term frequency is stored as <rank>. <term>: <term frequency>

2. The plot of Zipfian curve for the terms obtained in the corpus is in the file "Zipfian curve.pdf"

3. The list of docs ranked by score for each query is at the location ./Lucene_query_results in the files <query>_<timestamp>.out

4. The table comparing the total number of docs retrieved per query by Lucene's scoring function and search engine built using BM25 is in the file "LuceneVsBM25.pdf"
   at the same location

References
-----------
1. JavaSE Documentation: https://docs.oracle.com/javase/8/docs/
2. Stackoverflow forum: http://stackoverflow.com
3. Lucene javadoc for V4.7.2: http://lucene.apache.org/core/4_7_2/core/index.html

About

Document indexing and retireval using Lucene Seach API

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published