Skip to content

Commit

Permalink
Added PDF malware detection pipeline
Browse files Browse the repository at this point in the history
  • Loading branch information
DarshAgrawal14 committed Oct 14, 2024
1 parent 795189f commit 4da5e2c
Show file tree
Hide file tree
Showing 6 changed files with 11,424 additions and 0 deletions.
10,027 changes: 10,027 additions & 0 deletions Cybersecurity_Tools/PDF_Malware_Detection/Dataset/PDFMalware2022.csv

Large diffs are not rendered by default.

62 changes: 62 additions & 0 deletions Cybersecurity_Tools/PDF_Malware_Detection/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# PDF Malware Detection

This project implements a machine learning-based system for detecting potential malware in PDF files. It includes feature extraction from PDF files, model training, and a prediction script for classifying PDFs as potentially malicious or benign.

## Components

1. **Feature Extraction** (`pdf_feature_extraction.py`)
- Extracts various features from PDF files using PyMuPDF and pdfid.
- Features include metadata, structural elements, and presence of potentially risky elements.

2. **Model Training** (`pdf_malware_dataset_training.py`)
- Prepares the dataset, handles data cleaning and preprocessing.
- Trains a Random Forest classifier for malware detection.
- Includes code for hyperparameter tuning (commented out).

3. **Prediction Script** (`predict_malware.py`)
- Uses the trained model to predict whether a given PDF file is potentially malicious.

## Setup

1. Install required dependencies:
```
pip install numpy pandas matplotlib scikit-learn imblearn PyMuPDF pdfid joblib
```

2. Ensure you have the dataset file `PDFMalware2022.csv` in the `Dataset` folder.

## Usage

### Training the Model

1. Run the `pdf_malware_dataset_training.py` script to train the model:
```
python pdf_malware_dataset_training.py
```
This will create a `random_forest_model.pkl` file containing the trained model.

### Predicting Malware

1. Use the `predict_malware` function in `predict_malware.py` to classify a PDF file:
```python
from predict_malware import predict_malware

result = predict_malware("path/to/your/pdf_file.pdf")
print("Prediction (0: Benign, 1: Malicious):", result)
```

2. Alternatively, run the script directly:
```
python predict_malware.py path/to/your/pdf_file.pdf
```

## Note

This project is for educational and research purposes only. It should not be used as a sole means of determining file safety. Always use caution when dealing with potentially malicious files and consult with cybersecurity professionals for comprehensive security measures.

## Future Improvements

- Implement more advanced feature extraction techniques.
- Explore other machine learning algorithms for potentially better performance.
- Add a user-friendly interface for easier interaction with the prediction system.
- Incorporate regular model updates with new malware samples to keep the detection current.
Loading

0 comments on commit 4da5e2c

Please sign in to comment.