Intrusion-and-Vulnerability-Detection-in-Software-Defined-Networks-SDN-Team-ML-IDS

Intrusion and Vulnerability Detection in Software-Defined Networks (SDN) by Team ML-IDS

Original and modified data for the experiments can be found here: https://drive.google.com/drive/folders/1hp8FB270BEYhK2dAIZJsBHIfNV8Fhwy1?usp=drive_link

Overview

This repository contains the research work conducted by Nasik Sami Khan, Md. Shamim Towhid, and Md Mahibul Hasan from the Department of Computer Science at the University of Regina, Canada. The goal of this research was to address the challenge posted by ULAK at ITU platform for tackling the IDS issue in SDN environment. The research focuses on developing effective intrusion detection systems for Software-Defined Networks (SDNs) using machine learning techniques.

Abstract

The transition from conventional networking architectures to SDNs has brought about significant advancements in network management. However, the centralization of control within SDNs poses a security risk, necessitating robust intrusion detection systems. This research explores the development of multiclass classifiers capable of identifying various intrusion types in SDN-enabled networks. A comprehensive dataset, including Normal flow data, DDoS flow data, Malware flow data, and web-based flow data, is provided by ULAK to facilitate research. Machine learning techniques are employed to create effective intrusion detection models, contributing to the protection of SDN-based networks against a wide range of threats.

Key Words

SDN
IDS
Data Imbalance
Machine Learning
Ensemble Techniques

Research Methodology

Dataset

The provided dataset consists of 1.78 million rows with 77 distinct columns, including a labeled column representing the output variable. The dataset is characterized by a class imbalance, requiring specialized techniques for fair treatment of all classes. Feature selection is performed using the Random Forest classifier, resulting in a subset of 28 significant features.

Feature Selection

The feature selection process involves evaluating each attribute's significance using the Random Forest classifier, followed by Principal Component Analysis (PCA) for validation. The final subset of 28 features enhances predictive ability and interpretability. We used embedded method and random forest to curate the 28 features.

Data Preprocessing

Data preprocessing includes cleaning, conversion to numeric values, handling class imbalance, and scaling feature values. The dataset is split into training and validation sets to ensure consistent and well-prepared data for model training and evaluation. For data imbalance, we down-sampled the number of samples for Benign (Majority) Class and augmented synthetic data from the minority classes using SMOTE technique. We sepearated data into 3 groups based on the class sample distribution.

Model Architecture

The model employs an ensemble approach, combining Random Forest, AdaBoost, and XGBoost classifiers. Random Forest serves as the meta-model, leveraging the strengths of individual models to enhance accuracy and robustness. We used stacking method of ensemble technique in tackling this problem.

Result Analysis

The ensemble model demonstrates promising results with an average F1-score of 97.77% during 5-fold cross-validation on the validation set. Challenges include handling rare security threats and complex attack patterns. The model's performance is compared with existing methods, including a hybrid approach of CNN and Random Forest, a standalone Random Forest and another Standalone XGBoost algorithm.

Stacking model performance on test data

Stacking model performance on seperate test data (CICIDS_2017) for proving model generalization:

Previous Baseline models

Standalone Random Forest Baseline

Standalone XGBoost Baseline

CNN+Random Forest

Computational expence comparision

Model training time (sec):

Inference time per sample (sec):

Ongoing research direction

We are working on building a hiererchical model to tackle the data imbalance issue in intrusuin detection. So far, it has given promising result on validation set, test set and seperate dataset.

Model Architecture Summary:

Result Summary:

Classification report on test set:

Classification report on different dataset (CIC_IDS_2017):

Comparision with baseline models:

Performance on CICIDS_2017 dataset:

Future Works:

We are optimistic in utilizing few-shot learning to tackle this problem set, because of the nature of the challenge. We are willing to explore custom ensemble models and voting mechanism to solve class imbalance issues.

Conclusion

The research highlights the importance of intrusion detection in SDN and proposes an ensemble model for effective threat detection. Future work includes addressing challenges with rare classes, exploring alternate feature engineering, optimizing threshold values, and refining the ensemble model architecture.

How to Use

Overview

Model Training: The Stacking_Model Train and Test Updated.ipynb notebook contains the code for training the ensemble model. The dataset is loaded, features are selected, and the data is preprocessed. The base classifiers (Random Forest, XGBoost, AdaBoost) are trained, and their predictions are used as input to the meta-model (Random Forest). The F1-score is used as the evaluation metric.
Model Evaluation: The model is evaluated on a validation set, and the classification report and confusion matrix are generated. The best meta-model is saved for future use.
Model Testing: The trained ensemble model is loaded, and a test dataset is used to make predictions. The classification report and confusion matrix for the test set are generated.

Files

Stacking_Model Train and Test Updated.ipynb: Jupyter notebook containing the code for model training and testing.
*.pkl: Pickle file containing the trained models, scaler, and label encoder. Just run the notebook and replace the test file with your own test files.

Usage

Clone the repository:

git clone https://github.com/your-username/ensemble-intrusion-detection.git](https://github.com/ITU-AI-ML-in-5G-Challenge/Intrusion-and-Vulnerability-Detection-in-Software-Defined-Networks-SDN-Team-ML-IDS.git)

Open and run the Stacking_Model Train and Test Updated.ipynb notebook using Jupyter Notebook or Google Colab.
After training, you can use the saved model to make predictions on your own test data. Update the model_path and data_path variables in the provided testing script.

Results

The model achieves an average F1-score of 0.9777 on the validation set and 0.8612 on the test set. Detailed classification reports and confusion matrices are provided in the notebook.

Contributors

Nasik Sami Khan ([email protected])
Md. Shamim Towhid ([email protected])
Md Mahibul Hasan ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Baseline CNN_RF_Local.ipynb		Baseline CNN_RF_Local.ipynb
Baseline RF .ipynb		Baseline RF .ipynb
Baseline XGB.ipynb		Baseline XGB.ipynb
ITU.pptx		ITU.pptx
ITU_Challenge.pdf		ITU_Challenge.pdf
Model_train_and_test.pdf		Model_train_and_test.pdf
README.md		README.md
Stacking_Model Train and Test Updated.ipynb		Stacking_Model Train and Test Updated.ipynb
Training Data Split and Augmentation.ipynb		Training Data Split and Augmentation.ipynb
ULAK_Ensemble.pkl		ULAK_Ensemble.pkl
feature_analysis.ipynb		feature_analysis.ipynb
trained_models.pkl		trained_models.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intrusion-and-Vulnerability-Detection-in-Software-Defined-Networks-SDN-Team-ML-IDS

Overview

Abstract

Key Words