-
Notifications
You must be signed in to change notification settings - Fork 15
Training the Model
Michael Haag edited this page Jul 23, 2024
·
1 revision
ShellSweepX contains an advanced machine learning-based system designed to detect web shells in web server directories. It utilizes a logistic regression classifier combined with TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to analyze and classify file content, effectively identifying potential web shell threats.
-
Data Collection:
- The system scans specified directories for both web shell samples and benign files.
- File contents are extracted and stored for processing.
-
Text Preprocessing:
- TF-IDF vectorization is applied to convert the text content of files into numerical features.
- Common words are filtered out to reduce noise and improve model performance.
-
Model Training:
- A logistic regression classifier is trained on the vectorized data.
- The dataset is split into training and testing sets for validation.
-
Model Persistence:
- The trained model and TF-IDF vectorizer are saved for later use in prediction tasks.
-
Prediction:
- New files can be analyzed using the trained model to determine if they are likely to be web shells.
- Algorithm: Logistic Regression
- Feature Extraction: TF-IDF Vectorization
- Max Features: 5000
- Stop Words: Custom list of common words in web development
- Ensure you have Python 3.x installed.
- Install required libraries:
pip install scikit-learn joblib
-
Create two directories:
-
webshells/
: Place known webshell samples here. -
benign/
: Place benign web files (e.g., normal PHP, ASP, JSP files) here.
-
-
Update the
webshell_directories
list in the script:webshell_directories = [ '/path/to/your/webshells/', # Add more directories if needed ]
- Save the script as
train_model.py
. - Run the script:
python train_model.py