Phantom Search Engine is a lightweight, distributed web search engine designed to provide fast and relevant search results.
- Distributed crawler system for efficient web crawling
- Multithreaded crawling for concurrent processing
- TF-IDF based indexing for faster search and retrieval
- Query engine for processing user queries and returning relevant results
- Python 3.8 or higher
- pip
- Clone the repository:
git clone https://github.com/AnsahMohammad/Phantom.git
cd Phantom
- Create a virtual environment and activate it:
python3 -m venv .env
source .env/bin/activate
- Install the necessary dependencies:
pip install -r requirements.txt
- Build the files:
./build.sh
- Open the Search Engine GUI
python phantom.py
- Run the build.sh script:
This script performs the following actions:
./build.sh
- Creates a virtual environment and activates it.
- Installs the necessary dependencies from the requirements.txt file.
- Runs the Phantom crawler with the specified parameters.
- Downloads the necessary NLTK packages: stopwords and punkt.
- Runs the Phantom indexing module.
- Start the query engine locally in the terminal by running the search.sh file:
./search.sh
- Run the
crawl.sh
file by updating necessary parameters - Run the
local_search.sh
to index the crawled sites and run the query engine on it
Note: Read the documentation here
We welcome contributions! Please see our CONTRIBUTING.md for details on how to contribute to this project.
This project is licensed under the terms of the Apache License. See the LICENSE file for details.
- restrict send to db if EMPTY title and content
- Do not show result when score is 0
- Error handling
- Consistency in logs
- Local db enable
- Distributed query processing
- Caching locally
- Two layer crawling
- Optimize the scheduler by storing visited nodes
- Use unified crawler system in master-slave arch
- Create Storage abstraction classes for local and remote client
- TF-idf only on title
- Better similarity measure on content
- Generalize Storage Class
- Optimize the deployment
- Remove the nltk processing
- Refactor the codebase
- Migrate from local_db to cloud Phase-1
- Optimize the user interface
- Replace content with meta data (perhaps?)
- Extract background worker sites from env
- AI support Beta
- Template optimizations
- Extract timestamp and sort accordingly
- Remote crawler service (use background workers)
- Analyze the extractable metadata
- Error Logger to supabase for analytics
- Don't download everytime query engine is started
- Crawler doesn't follow the schema of remote_db
- Tracking variables on the server
- UI Re-org
- Title TF_IDF
- Join contents with .join(" ")
- Optimize parser to extract data effectively
- Add tests
Track Uptime here : Uptime