PhishSense is a system designed to extract features from URLs, classify them as either legitimate or phishing, and create a structured dataset for machine learning model training. The system consists of Python scripts that read a CSV file containing URLs and their corresponding types, then extracts features from each URL using web scraping techniques. The extracted features are saved to a CSV file, which can be used to train machine learning models for phishing detection.
Make sure you have Python installed on your system. Additionally, install the required Python packages using the following command:
pip install -r requirements.txt
-
Clone the Repository:
git clone https://github.com/yourusername/PhishSense.git cd PhishSense
-
Prepare the CSV File:
Create a CSV file (
your_csv_file.csv
) with columns 'url' and 'type', where 'type' indicates whether the URL is legitimate (0) or phishing (1).url,type http://example.com,0 http://phishing.com,1
-
Run the Main Script:
Execute the main script (
main.py
) with the input CSV file and desired output file for extracted features. Optionally, you can specify the start and end lines (inclusive) to read from the input CSV file:python main.py --input your_csv_file.csv --output extracted_features.csv --start-line 1 --end-line 100
This will read the CSV, extract features from each URL, and save the results to a new CSV file (
extracted_features.csv
).
The system generates a CSV file (extracted_features.csv
) containing the extracted features for each URL, including the URL itself, title, number of links, and the type of the website (legitimate or phishing). This file can be used as a labeled dataset for training machine learning models.
Watch this video to see PhishSense's interface and trained machine learning models in action.
PhishSense.Demo.1.1.mp4
- Ensure that the URLs in the input CSV file are accessible, as the system makes web requests to extract features.
- The machine learning model training part is not included in this system. You can use the generated
extracted_features.csv
file to train your own machine learning model for phishing detection.
The IPYNB file (Machine Learning Models.ipynb
) contains code for training and evaluating machine learning models for phishing detection. The file includes the following sections:
- Setup: Installation of necessary libraries and modules.
- Data Loading and Preprocessing: Loading the dataset and preprocessing steps such as standard scaling.
- Data Visualization: Visualizing the dataset using Principal Component Analysis (PCA).
- Support Vector Machine (SVM): Training, evaluation, and visualization of results for SVM model.
- Neural Networks: Building, training, evaluation, and visualization of results for neural network model.
- Random Forest: Training, evaluation, and visualization of results for random forest model.