Intrusion and Vulnerability Detection in Software-Defined Networks (SDN) by Team ML-IDS
Original and modified data for the experiments can be found here: https://drive.google.com/drive/folders/1hp8FB270BEYhK2dAIZJsBHIfNV8Fhwy1?usp=drive_link
This repository contains the research work conducted by Nasik Sami Khan, Md. Shamim Towhid, and Md Mahibul Hasan from the Department of Computer Science at the University of Regina, Canada. The goal of this research was to address the challenge posted by ULAK at ITU platform for tackling the IDS issue in SDN environment. The research focuses on developing effective intrusion detection systems for Software-Defined Networks (SDNs) using machine learning techniques.
The transition from conventional networking architectures to SDNs has brought about significant advancements in network management. However, the centralization of control within SDNs poses a security risk, necessitating robust intrusion detection systems. This research explores the development of multiclass classifiers capable of identifying various intrusion types in SDN-enabled networks. A comprehensive dataset, including Normal flow data, DDoS flow data, Malware flow data, and web-based flow data, is provided by ULAK to facilitate research. Machine learning techniques are employed to create effective intrusion detection models, contributing to the protection of SDN-based networks against a wide range of threats.
- SDN
- IDS
- Data Imbalance
- Machine Learning
- Ensemble Techniques
The provided dataset consists of 1.78 million rows with 77 distinct columns, including a labeled column representing the output variable. The dataset is characterized by a class imbalance, requiring specialized techniques for fair treatment of all classes. Feature selection is performed using the Random Forest classifier, resulting in a subset of 28 significant features.
The feature selection process involves evaluating each attribute's significance using the Random Forest classifier, followed by Principal Component Analysis (PCA) for validation. The final subset of 28 features enhances predictive ability and interpretability. We used embedded method and random forest to curate the 28 features.
Data preprocessing includes cleaning, conversion to numeric values, handling class imbalance, and scaling feature values. The dataset is split into training and validation sets to ensure consistent and well-prepared data for model training and evaluation. For data imbalance, we down-sampled the number of samples for Benign (Majority) Class and augmented synthetic data from the minority classes using SMOTE technique. We sepearated data into 3 groups based on the class sample distribution.
The model employs an ensemble approach, combining Random Forest, AdaBoost, and XGBoost classifiers. Random Forest serves as the meta-model, leveraging the strengths of individual models to enhance accuracy and robustness. We used stacking method of ensemble technique in tackling this problem.
The ensemble model demonstrates promising results with an average F1-score of 97.77% during 5-fold cross-validation on the validation set. Challenges include handling rare security threats and complex attack patterns. The model's performance is compared with existing methods, including a hybrid approach of CNN and Random Forest, a standalone Random Forest and another Standalone XGBoost algorithm.
We are working on building a hiererchical model to tackle the data imbalance issue in intrusuin detection. So far, it has given promising result on validation set, test set and seperate dataset.
Classification report on test set:
Classification report on different dataset (CIC_IDS_2017):
Comparision with baseline models:
We are optimistic in utilizing few-shot learning to tackle this problem set, because of the nature of the challenge. We are willing to explore custom ensemble models and voting mechanism to solve class imbalance issues.
The research highlights the importance of intrusion detection in SDN and proposes an ensemble model for effective threat detection. Future work includes addressing challenges with rare classes, exploring alternate feature engineering, optimizing threshold values, and refining the ensemble model architecture.
-
Model Training: The
Stacking_Model Train and Test Updated.ipynb
notebook contains the code for training the ensemble model. The dataset is loaded, features are selected, and the data is preprocessed. The base classifiers (Random Forest, XGBoost, AdaBoost) are trained, and their predictions are used as input to the meta-model (Random Forest). The F1-score is used as the evaluation metric. -
Model Evaluation: The model is evaluated on a validation set, and the classification report and confusion matrix are generated. The best meta-model is saved for future use.
-
Model Testing: The trained ensemble model is loaded, and a test dataset is used to make predictions. The classification report and confusion matrix for the test set are generated.
Stacking_Model Train and Test Updated.ipynb
: Jupyter notebook containing the code for model training and testing.*.pkl
: Pickle file containing the trained models, scaler, and label encoder. Just run the notebook and replace the test file with your own test files.
-
Clone the repository:
git clone https://github.com/your-username/ensemble-intrusion-detection.git](https://github.com/ITU-AI-ML-in-5G-Challenge/Intrusion-and-Vulnerability-Detection-in-Software-Defined-Networks-SDN-Team-ML-IDS.git)
-
Open and run the
Stacking_Model Train and Test Updated.ipynb
notebook using Jupyter Notebook or Google Colab. -
After training, you can use the saved model to make predictions on your own test data. Update the
model_path
anddata_path
variables in the provided testing script.
The model achieves an average F1-score of 0.9777 on the validation set and 0.8612 on the test set. Detailed classification reports and confusion matrices are provided in the notebook.
- Nasik Sami Khan ([email protected])
- Md. Shamim Towhid ([email protected])
- Md Mahibul Hasan ([email protected])