Distributed Search System – Scalable Academic Search Engine

A MapReduce-based distributed search engine designed for efficiently discovering and indexing academic papers from Usenix conferences. This system leverages parallel processing and distributed computing to handle large-scale datasets, improving search accuracy and retrieval speed.

Features

Web Crawler – Iteratively collects URLs of academic papers.
Downloader – Fetches and preprocesses text from crawled URLs.
Indexer – Builds an inverted index for fast and efficient querying.
Querier – Retrieves relevant academic papers based on search terms, ranking results by relevance.
Scalable Deployment – Optimized for AWS EC2 instances, enabling dynamic scaling.
Performance Optimizations – Implements batch processing, network traffic reduction, and serialization improvements to enhance efficiency.

Challenges & Solutions

Socket Hang-Ups – Resolved by implementing batch processing to reduce coordination overhead.
Serialization Errors – Special character filtering ensures clean data handling.
Load Balancing – Investigating hashing improvements for better task distribution across nodes.

How to Use

Run Crawler: npm run test test/crawlv3.test.js Download Text: npm run test test/downloadText.test.js Index Data: npm run test test/invert.test.js Query Papers: npm run test test/query.test.js

Future Improvements

Implement breadth-first search crawling with cycle detection.
Enhance parallel downloading and adaptive bandwidth management.
Optimize indexing with advanced NLP techniques for better academic term recognition.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
distribution		distribution
engine		engine
store		store
test		test
README.md		README.md
distribution.js		distribution.js
metadata.yml		metadata.yml
package-lock.json		package-lock.json
package.json		package.json
testFigure.ipynb		testFigure.ipynb
xxx.txt		xxx.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Search System – Scalable Academic Search Engine

Features

Challenges & Solutions

How to Use

Future Improvements

About

Releases

Packages

Languages

JazJaz426/Scalable-Academic-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Distributed Search System – Scalable Academic Search Engine

Features

Challenges & Solutions

How to Use

Future Improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages