Skip to content

Training the Model

Michael Haag edited this page Jul 23, 2024 · 1 revision

ShellSweepX: Train The Model

Overview

ShellSweepX contains an advanced machine learning-based system designed to detect web shells in web server directories. It utilizes a logistic regression classifier combined with TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to analyze and classify file content, effectively identifying potential web shell threats.

How It Works

  1. Data Collection:

    • The system scans specified directories for both web shell samples and benign files.
    • File contents are extracted and stored for processing.
  2. Text Preprocessing:

    • TF-IDF vectorization is applied to convert the text content of files into numerical features.
    • Common words are filtered out to reduce noise and improve model performance.
  3. Model Training:

    • A logistic regression classifier is trained on the vectorized data.
    • The dataset is split into training and testing sets for validation.
  4. Model Persistence:

    • The trained model and TF-IDF vectorizer are saved for later use in prediction tasks.
  5. Prediction:

    • New files can be analyzed using the trained model to determine if they are likely to be web shells.

Model Details

  • Algorithm: Logistic Regression
  • Feature Extraction: TF-IDF Vectorization
  • Max Features: 5000
  • Stop Words: Custom list of common words in web development

Usage Guide

Setting Up the Environment

  1. Ensure you have Python 3.x installed.
  2. Install required libraries:
    pip install scikit-learn joblib
    

Preparing the Data

  1. Create two directories:

    • webshells/: Place known webshell samples here.
    • benign/: Place benign web files (e.g., normal PHP, ASP, JSP files) here.
  2. Update the webshell_directories list in the script:

    webshell_directories = [
        '/path/to/your/webshells/',
        # Add more directories if needed
    ]

Running the Script

  1. Save the script as train_model.py.
  2. Run the script:
    python train_model.py