Training the Model

ShellSweepX: Train The Model

Overview

ShellSweepX contains an advanced machine learning-based system designed to detect web shells in web server directories. It utilizes a logistic regression classifier combined with TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to analyze and classify file content, effectively identifying potential web shell threats.

How It Works

Data Collection:
- The system scans specified directories for both web shell samples and benign files.
- File contents are extracted and stored for processing.
Text Preprocessing:
- TF-IDF vectorization is applied to convert the text content of files into numerical features.
- Common words are filtered out to reduce noise and improve model performance.
Model Training:
- A logistic regression classifier is trained on the vectorized data.
- The dataset is split into training and testing sets for validation.
Model Persistence:
- The trained model and TF-IDF vectorizer are saved for later use in prediction tasks.
Prediction:
- New files can be analyzed using the trained model to determine if they are likely to be web shells.

Model Details

Algorithm: Logistic Regression
Feature Extraction: TF-IDF Vectorization
Max Features: 5000
Stop Words: Custom list of common words in web development

Usage Guide

Setting Up the Environment

Ensure you have Python 3.x installed.
Install required libraries:
```
pip install scikit-learn joblib
```

Preparing the Data

Create two directories:
- webshells/: Place known webshell samples here.
- benign/: Place benign web files (e.g., normal PHP, ASP, JSP files) here.

Update the webshell_directories list in the script:

webshell_directories = [
    '/path/to/your/webshells/',
    # Add more directories if needed
]

Running the Script

Save the script as train_model.py.
Run the script:
```
python train_model.py
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly