Phantom Search Engine is a robust, scalable, and efficient web search engine designed to provide fast and relevant search results. It is built with a focus on performance, scalability, and accuracy. The engine is designed to handle a large amount of data and provide quick responses to user queries.
The Phantom Search Engine consists of several main components:
-
Crawler System: The Crawler System is responsible for crawling the web and fetching the content of web pages. It includes a multithreaded crawler for concurrent crawling and a distributed crawler system for large-scale crawling.
-
Phantom Indexer: The Phantom Indexer processes the fetched data to create an index for faster search and retrieval. It uses the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm to measure the importance of a term in a document in a corpus.
-
Phantom Query Engine: The Phantom Query Engine is a crucial component that takes a user's search query and returns the most relevant documents from the database. It uses the TF-IDF algorithm to rank the documents based on their relevance to the query.
Each of these components is designed to work together seamlessly to provide a comprehensive search engine solution. The following sections provide a detailed overview of each component and how they interact with each other.
This documentation is intended to provide a comprehensive understanding of the Phantom Search Engine's architecture, functionality, and usage. It is designed to be a valuable resource for developers, users, and anyone interested in understanding the inner workings of a web search engine.
There are two ways you can crawl the websites to save the indexes
- Multithreaded approach
- Distributed Crawler system
The multithreaded crawler is implemented in the Phantom
class in the phantom/phantom.py
file. It uses multiple threads to crawl websites concurrently, which significantly speeds up the crawling process.
Here's a brief overview of how it works:
-
The
Phantom
class is initialized with a list of URLs to crawl, the number of threads to use, and other optional parameters like whether to show logs, print logs, and a burnout time after which the crawler stops. -
The
run
method starts the crawling process. It generates the specified number of threads and starts them. Each thread runs thecrawler
method with a unique ID and a randomly chosen URL from the provided list. -
The
crawler
method is the heart of the crawler. It starts with a queue containing the initial URL and continuously pops URLs from the queue, fetches their content, and adds their neighbors (links on the page) to the queue. It also keeps track of visited URLs to avoid revisiting them. The content of each visited URL is stored in aStorage
object. -
The
Parser
class is used to fetch and parse the content of a URL. It uses the BeautifulSoup library to parse the HTML content, extract the text and the links, and clean the URLs. -
The
Storage
class is used to store the crawled data. It stores the data in a dictionary and can save it to a JSON file. -
The
stop
method can be used to stop the crawling process. It sets akill
flag that causes thecrawler
methods to stop, waits for all threads to finish, and then saves the crawled data and prints some statistics.
You can start the program by running the script on phantom/phantom.py
. It uses phantom_engine.py
to crawl the sites using multiple threads.
The distributed crawler system uses a master-slave architecture to coordinate multiple crawlers. The master node is implemented in the phantom_master.py
file, and the slave nodes are implemented in the phantom_child.py
file. They communicate using sockets.
The phantom_master.py
file contains the Server
class, which is the master node in the distributed crawler system. It manages the slave nodes (crawlers) and assigns them websites to crawl.
Here's a brief overview:
-
The
Server
class is initialized with the host and port to listen on, the number of clients (crawlers) to accept, and a burnout time after which the crawlers stop. -
The
run
method starts the server. It creates a socket, binds it to the specified host and port, and starts listening for connections. It accepts connections from the crawlers, starts a new thread to handle each crawler, and adds the crawler to its list of clients. -
The
handle_client
method is used to handle a crawler. It continuously receives requests from the crawler and processes them. If a crawler sends a "close" request, it removes the crawler from its list of clients. If a crawler sends a "status" request, it updates its status. -
The
status
method is used to print the status of the server and the crawlers. It prints the list of crawlers and their statuses. -
The
send_message
method is used to send a message to a specific crawler. If an error occurs while sending the message, it removes the crawler from its list of clients. -
The
assign_sites
method is used to assign websites to the crawlers. It either assigns each website to a different crawler or assigns all websites to all crawlers, depending on theremove_exist
parameter. -
The
generate
method is used to generate the websites to crawl. It asks the user to enter the websites, assigns them to the crawlers, and starts the crawlers. -
The
start
method is used to start the server. It starts the server in a new thread and then enters a command loop where it waits for user commands. The user can enter commands to get the status of the server, broadcast a message to all crawlers, send a message to a specific crawler, stop the server, generate websites, assign websites to crawlers, and merge the crawled data. -
The
merge
method is used to merge the data crawled by the crawlers. It merges the index and title data from all crawlers into a single index and title file and deletes the old files. -
The
stop
method is used to stop the server. It sends a "stop" message to all crawlers, stops the server thread, and closes the server socket.
You can start the server by running the phantom_master.py
script. It will start listening for connections from crawlers and you can then enter commands to control the crawlers.
The phantom_child.py
file contains the Crawler
and Storage
classes, which implement the slave nodes in the distributed crawler system.
Here's a brief overview:
-
The
Crawler
class is initialized with the host and port of the server. It creates a socket and connects to the server. It also initializes several other attributes, such as its ID, a list of threads, and several flags. -
The
connect
method is used to connect the crawler to the server. It starts a new thread to listen to the server and enters a command loop where it waits for user commands. The user can enter commands to stop the crawler, send a message to the server, get the status of the crawler, toggle the running state of the crawler, and store the crawled data. -
The
listen_to_server
method is used to listen to the server. It continuously receives messages from the server and processes them. If the server sends a "stop" message, it stops the crawler. If the server sends a "setup" message, it sets up the crawler. If the server sends a "status" message, it prints the status of the crawler. If the server sends an "append" message, it adds URLs to the queue. If the server sends a "restart" message, it reinitializes the crawler. If the server sends a "crawl" message, it starts crawling. -
The
setup
method is used to set up the crawler. It sets the URL to crawl and the burnout time. -
The
add_queue
method is used to add URLs to the queue. -
The
initialize
method is used to initialize the crawler. It initializes several attributes, such as the list of local URLs, the queue, the start time, and the parser. -
The
crawl
method is used to start crawling. It continuously pops URLs from the queue, parses them, and adds the parsed data to the storage. It also adds the neighbors of the current URL to the queue. If the burnout time is reached, it stops crawling. -
The
store
method is used to store the crawled data. It saves the index and title data to the storage. -
The
stop
method is used to stop the crawler. It sets the kill flag, clears the traversed URLs, joins all threads, sends a "close" message to the server, and closes the client socket. -
The
send
method is used to send a message to the server. -
The
status
method is used to print the status of the crawler. -
The
Storage
class is used to store the crawled data. It is initialized with a filename and a dictionary to store the data. Theadd
method is used to add data to the storage. Thesave
method is used to save the data to a file.
You can start a crawler by creating an instance of the Crawler
class and calling the connect
method. The crawler will connect to the server and start listening for commands.
The PhantomIndexer
class in the provided code is an implementation of an indexer. An indexer is a program that processes data (in this case, text documents) to create an index for faster search and retrieval. The index created by the PhantomIndexer
is based on the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm, a common algorithm used in information retrieval.
Here's a brief overview of the PhantomIndexer
class:
-
The
__init__
method initializes the indexer. It takes as input the name of the input file (filename
) and the name of the output file (out
). It also initializes several other attributes, such as the total number of documents (documents
), the term frequency (tf
), the inverse document frequency (idf
), and the TF-IDF (tfidf
). -
The
calculate_tf
method calculates the term frequency for each term in each document. The term frequency is the number of times a term appears in a document. -
The
calculate_idf
method calculates the inverse document frequency for each term. The inverse document frequency is a measure of how much information the term provides, i.e., if it's common or rare across all documents. -
The
calculate_tfidf
method calculates the TF-IDF for each term in each document. The TF-IDF is the product of the term frequency and the inverse document frequency. It is a measure of the importance of a term in a document in a corpus. -
The
process
method processes the data. It tokenizes the text, removes stop words, stems the words, and calculates the TF-IDF. -
The
save
method saves the TF-IDF and IDF to a file. -
The
log
method is used to log messages.
The PhantomIndexer
class is used as follows:
-
An instance of the
PhantomIndexer
class is created with the input file name and the output file name. -
The
process
method is called to process the data and calculate the TF-IDF. -
The
save
method is called to save the TF-IDF and IDF to a file.
The output of the PhantomIndexer
is a JSON file that contains the TF-IDF and IDF for each term in each document. This file can be used for fast search and retrieval of documents.
The Phantom_Query
class in the provided code is an implementation of a query engine. A query engine is a crucial component of a search engine that takes a user's search query and returns the most relevant documents from the database. The Phantom_Query
class uses the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm to rank the documents based on their relevance to the query.
Here's a brief overview of the Phantom_Query
class:
-
The
__init__
method initializes the query engine. It takes as input the name of the input file (filename
) and the name of the titles file (titles
). It also initializes several other attributes, such as the inverse document frequency (idf
), the TF-IDF (tfidf
), and a lookup set of all terms in the corpus. -
The
query
method takes a user's search query and returns the most relevant documents. It first splits the query into terms and filters out the terms that are not in the lookup set. It then calculates the TF-IDF for each term in the query. Next, it calculates the score for each document by summing the product of the TF-IDF of each term in the document and the TF-IDF of the same term in the query. Finally, it ranks the documents based on their scores and returns the topcount
documents. -
The
run
method starts the query engine. It continuously prompts the user to enter a query and prints the results of the query. -
The
log
method is used to log messages.
The Phantom_Query
class is used as follows:
-
An instance of the
Phantom_Query
class is created with the input file name and the titles file name. -
The
run
method is called to start the query engine.
The output of the Phantom_Query
class is a list of tuples, where each tuple contains the document ID, the score, and the title of a document. The list is sorted in descending order of the scores, so the first tuple corresponds to the most relevant document.
In this application, supabase
has been used, inorder to leverage the supabase, user will have to create an account, and create two tables
- Table index with the following fields:
- url (text)
- content (json)
- title (text)
- Table query with the fields
- query(text)
The query table is used to store the queries to take analyse the queries made