Phantom Search Engine

Phantom Search Engine is a lightweight, distributed web search engine designed to provide fast and relevant search results.

Features

Distributed crawler system for efficient web crawling
Multithreaded crawling for concurrent processing
TF-IDF based indexing for faster search and retrieval
Query engine for processing user queries and returning relevant results

Getting Started

Prerequisites

Python 3.8 or higher
pip

Installation

Clone the repository:

git clone https://github.com/AnsahMohammad/Phantom.git
cd Phantom

Create a virtual environment and activate it:

python3 -m venv .env
source .env/bin/activate

Install the necessary dependencies:

pip install -r requirements.txt

Build the files:

./build.sh

Open the Search Engine GUI

python phantom.py

Building from Source

Run the build.sh script:

This script performs the following actions:

./build.sh

Creates a virtual environment and activates it.
Installs the necessary dependencies from the requirements.txt file.
Runs the Phantom crawler with the specified parameters.
Downloads the necessary NLTK packages: stopwords and punkt.
Runs the Phantom indexing module.

Start the query engine locally in the terminal by running the search.sh file:

./search.sh

Alternative Method

Run the crawl.sh file by updating necessary parameters
Run the local_search.sh to index the crawled sites and run the query engine on it

Note: Read the documentation here

Contributing

We welcome contributions! Please see our CONTRIBUTING.md for details on how to contribute to this project.

License

This project is licensed under the terms of the Apache License. See the LICENSE file for details.

Development and Maintanence

0.9.2

restrict send to db if EMPTY title and content
Do not show result when score is 0

0.9.1

Error handling
Consistency in logs
Local db enable

0.10+

Distributed query processing
Caching locally
Two layer crawling
Optimize the scheduler by storing visited nodes
Use unified crawler system in master-slave arch
Create Storage abstraction classes for local and remote client

0.9

TF-idf only on title
Better similarity measure on content
Generalize Storage Class

0.8

0.7

Replace content with meta data (perhaps?)
Extract background worker sites from env
AI support Beta
Template optimizations

0.6

Extract timestamp and sort accordingly
Remote crawler service (use background workers)
Analyze the extractable metadata
Error Logger to supabase for analytics

0.5-

Don't download everytime query engine is started
Crawler doesn't follow the schema of remote_db
Tracking variables on the server
UI Re-org
Title TF_IDF
Join contents with .join(" ")
Optimize parser to extract data effectively
Add tests

Track Uptime here : Uptime

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
phantom		phantom
static		static
templates		templates
tests		tests
.gitignore		.gitignore
DOCUMENTATION.md		DOCUMENTATION.md
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
crawl.sh		crawl.sh
local_search.sh		local_search.sh
requirements.txt		requirements.txt
search.sh		search.sh
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phantom Search Engine

Features

Getting Started

Prerequisites

Installation

Building from Source

Alternative Method

Contributing

License

Development and Maintanence

0.9.2

0.9.1

0.10+

0.9

0.8

0.7

0.6

0.5-

About

Languages

License

AnsahMohammad/Phantom

Folders and files

Latest commit

History

Repository files navigation

Phantom Search Engine

Features

Getting Started

Prerequisites

Installation

Building from Source

Alternative Method

Contributing

License

Development and Maintanence

0.9.2

0.9.1

0.10+

0.9

0.8

0.7

0.6

0.5-

About

Topics

Resources

License

Stars

Watchers

Forks

Languages