-
Notifications
You must be signed in to change notification settings - Fork 1
rajulkumar/LuceneSearchEngine
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Index and Retrieve using Lucene ------------------------------- This is an implementation of a Indexing and Retrieval system using Lucene, an information retrieval library. This generates an inverted index of a document or a set of documents as the part of indexing by Lucene's indexing functions using Simple Analyzer. It also generates a term frequency file sorted by the term frequency of all the terms and their frequency in the corpus. The retrieval process is done by scoring all the documnents on a score generated by Lucene's scoring function for a query. The documents are then ranked and retireved for the given count by the function and returned as the result in a file. Build ----- Developed in Java using jdk1.8.0_05. Compiled for jre8. Lucene's libraries used for indexing and scoring purpose. The jars are provided in lib forlder at this location. lucene-analyzers-common-4.7.2.jar lucene-core-4.7.2.jar lucene-queryparser-4.7.2.jar Compile and Run ---------------- For windows: 1. execute executeIndexer.bat/executeRetrieve.bat 2. Provide the location of file or directory of the files to be indexed. e.g. C:/IR/Assignment/InR_Lucene or C:/IR/Assignment/InR_Lucene/CACM-0001.html 3. Provide the location where the index is to be created. e.g. C:/IR/Assignment/InR_Lucene/Index 4. Provide the location where the term frequency file is to be created. e.g. C:/IR/Assignment/InR_Lucene 5. Response files are generated at the given location. For linux/unix*: 1. execute executeIndexer.sh/executeRetrieve.sh 2. Provide the location of file or directory of the files to be indexed. e.g. /IR/Assignment/InR_Lucene or /IR/Assignment/InR_Lucene/CACM-0001.html 3. Provide the location where the index is to be created. e.g. /IR/Assignment/InR_Lucene/Index 4. Provide the location where the term frequency file is to be created. e.g. /IR/Assignment/InR_Lucene 5. Response files are generated at the given location. * not tested for linux/unix Alternatively, The java files Indexer.java and Retrieve.java could be compiled from here by: javac -cp ./lib/lucene-analyzers-common-4.7.2.jar;./lib/lucene-core-4.7.2.jar;./lib/lucene-queryparser-4.7.2.jar src/com/search/Indexer.java javac -cp ./lib/lucene-analyzers-common-4.7.2.jar;./lib/lucene-core-4.7.2.jar;./lib/lucene-queryparser-4.7.2.jar src/com/search/Retrieve.java This should be executed from this location by: java -cp ./lib/lucene-analyzers-common-4.7.2.jar;./lib/lucene-core-4.7.2.jar;./lib/lucene-queryparser-4.7.2.jar;./src/ com/search/Indexer java -cp ./lib/lucene-analyzers-common-4.7.2.jar;./lib/lucene-core-4.7.2.jar;./lib/lucene-queryparser-4.7.2.jar;./src/ com/search/Retrieve Provide the paths required as given above. * Index generated on the given CACM corpus is available at ./Index location * Please change or clear the index location if the same files are being indexed again and stored at the same location as this may cuase multiple entries for the same doc giving undersired results Results/response of execution ----------------------------- 1. The sorted by frequency list of term and term frequency is in the file "TermFreq.out" at the same location. The term frequency is stored as <rank>. <term>: <term frequency> 2. The plot of Zipfian curve for the terms obtained in the corpus is in the file "Zipfian curve.pdf" 3. The list of docs ranked by score for each query is at the location ./Lucene_query_results in the files <query>_<timestamp>.out 4. The table comparing the total number of docs retrieved per query by Lucene's scoring function and search engine built using BM25 is in the file "LuceneVsBM25.pdf" at the same location References ----------- 1. JavaSE Documentation: https://docs.oracle.com/javase/8/docs/ 2. Stackoverflow forum: http://stackoverflow.com 3. Lucene javadoc for V4.7.2: http://lucene.apache.org/core/4_7_2/core/index.html
About
Document indexing and retireval using Lucene Seach API
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published