Advanced Persistent Threat (APT) Detection System

A comprehensive security solution that combines machine learning, behavioral analytics, and real-time monitoring to detect sophisticated cyber threats.

Overview

The APT Detection System is designed to identify and respond to advanced persistent threats using a multi-layered approach that combines traditional machine learning with behavioral analytics. The system integrates with various data sources, maps detected threats to the MITRE ATT&CK framework, and provides an intuitive dashboard for security analysts.

Key Capabilities

Hybrid Detection Engine: Combines LightGBM and Bi-LSTM models with behavioral analytics
Real-Time Monitoring: Integrates with EDR, SIEM systems, and Kafka streams
Behavioral Analysis: Establishes baselines and detects anomalies in entity behavior
MITRE ATT&CK Mapping: Automatically maps threats to the MITRE ATT&CK framework
Interactive Dashboard: Provides comprehensive visualization and analysis tools
Security Event Simulation: Generates realistic security events for testing and development

Architecture

The system employs a modular architecture with two main workflows:

Model Training Pipeline: Processes historical data to train detection models
Real-time Detection Pipeline: Analyzes incoming data streams to identify threats

System Architecture Diagram

graph TD;
    %% Data Sources and Ingestion
    DS[Data Sources] -->|Kafka, EDR, SIEM| DI[Data Ingestion]
    
    %% Training Pipeline
    DP[Data Preprocessing] --> FS[Feature Selection]
    FS --> DB[Data Balancing]
    DB --> MT[Model Training]
    MT --> LGB[LightGBM Model]
    MT --> LSTM[Bi-LSTM Model]
    LGB --> HM[Hybrid Model]
    LSTM --> HM
    
    %% Real-time Pipeline
    DI --> FE[Feature Extraction]
    FE --> HM
    FE --> AD[Anomaly Detection]
    
    %% Behavioral Analytics
    BA[Behavioral Analytics] --> BE[Baseline Establishment]
    BE --> AD
    
    %% Detection and Alerts
    HM --> MLD[ML-based Detection]
    AD --> BBD[Behavior-based Detection]
    MLD --> AG[Alert Generation]
    BBD --> AG
    AG --> MITRE[MITRE ATT&CK Mapping]
    AG --> REDIS[Redis Storage]
    MITRE --> DASH[Dashboard Visualization]
    REDIS --> DASH

Architecture Components

Data Ingestion: Collects data from multiple sources (Kafka, EDR, SIEM)
Data Preprocessing: Cleans and normalizes training data
Feature Selection: Uses HHOSSSA algorithm to select relevant features
Data Balancing: Applies HHOSSSA-SMOTE to address class imbalance
Model Training: Trains LightGBM and Bi-LSTM models
Hybrid Model: Combines predictions from multiple models
Behavioral Analytics: Establishes normal behavior baselines
Anomaly Detection: Identifies deviations from normal behavior
Alert Generation: Creates alerts based on detection results
Alert Storage: Stores alerts in Redis for persistence and sharing between processes
MITRE ATT&CK Mapping: Maps threats to the MITRE framework
Dashboard: Visualizes alerts and provides analysis tools

Installation

Prerequisites

Python 3.8 or higher
Java Development Kit (JDK) 11 or higher (for Kafka integration)
Wazuh Server (optional, for EDR integration)
Elasticsearch (optional, for SIEM integration)

Core Installation

Clone the repository:

git clone https://github.com/Ap6pack/APT-Detection-System.git
cd APT-Detection-System

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Optional Integrations

Kafka Integration

Download and install Kafka from the official Apache website
- Select the latest stable release (e.g., 3.5.x)
- Download the binary distribution (e.g., kafka_2.13-3.5.1.tgz)
- Extract the archive to your preferred location
Note: Zookeeper is included in the Kafka distribution. You don't need to download Zookeeper separately.
Automatic Kafka Management: The system now automatically manages Kafka and ZooKeeper:
- When you run python main.py --all, the system will:
  - Check if Kafka and ZooKeeper are running
  - Start them automatically if they're not running
  - Handle cluster ID mismatches by cleaning up and restarting Kafka when needed
  - Create the necessary Kafka topic if it doesn't exist
Note: You no longer need to manually start Kafka and ZooKeeper before running the system.

Manual Kafka Management (if needed): If you prefer to manage Kafka manually, you can still do so:

# Navigate to the Kafka directory
cd kafka_2.13-3.8.0

# Start Zookeeper first
./bin/zookeeper-server-start.sh config/zookeeper.properties

# In a new terminal, start Kafka
./bin/kafka-server-start.sh config/server.properties

# Create the topic (if it doesn't exist)
./bin/kafka-topics.sh --create --topic apt_topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

# Verify the topic was created
./bin/kafka-topics.sh --list --bootstrap-server localhost:9092

Troubleshooting Kafka:
- If Kafka server quits unexpectedly, check the following:
  - Ensure Zookeeper is running before starting Kafka
  - Increase memory allocation: export KAFKA_HEAP_OPTS="-Xmx512M -Xms512M"
  - Check for port conflicts on 9092 (Kafka) and 2181 (Zookeeper)
  - Verify logs in logs/server.log for specific error messages
  - For cluster ID mismatches, the system will automatically handle cleanup and restart

Wazuh EDR Integration

Install Wazuh Server following the official documentation
Configure the Wazuh API in the config.yaml file

Elasticsearch SIEM Integration

Install Elasticsearch following the official documentation
Configure Elasticsearch in the config.yaml file

Redis Integration (Recommended)

The system uses Redis for robust alert storage, enabling data sharing between different processes and providing persistence across system restarts.

Install Redis:

# On Ubuntu/Debian
sudo apt-get update
sudo apt-get install redis-server

# On CentOS/RHEL
sudo yum install redis

# On macOS with Homebrew
brew install redis

Start Redis server:

# On Linux
sudo systemctl start redis-server

# On macOS
brew services start redis

# Or run directly
redis-server

Verify Redis is running:
```
redis-cli ping
```
You should receive a response of PONG.
Test the Redis integration:
```
python test_redis.py
```

Configuration

The system is configured through the config.yaml file. Key configuration sections include:

Model Configuration

model_paths:
  base_dir: models/
  lightgbm: lightgbm_model.pkl
  bilstm: bilstm_model.h5

training_params:
  lightgbm:
    num_leaves: 31
    learning_rate: 0.05
    n_estimators: 100
  bilstm:
    epochs: 5
    batch_size: 32
    lstm_units: 64

Data Source Configuration

data_sources:
  wazuh:
    enabled: false
    api_url: "https://wazuh.example.com:55000"
    username: "wazuh-api-user"
    password: "wazuh-api-password"
    verify_ssl: false
    fetch_interval: 60

  elasticsearch:
    enabled: false
    hosts: ["localhost:9200"]
    index_pattern: "winlogbeat-*"
    username: "elastic"
    password: "changeme"
    verify_certs: false
    fetch_interval: 60

Behavioral Analytics Configuration

settings:
  behavioral_analytics:
    baseline_period_days: 7
    anomaly_threshold: 0.8
    time_window_minutes: 10

Usage

Running the System

The system can be run in different modes depending on your requirements:

Production Mode

# Using the production script
./run_production.sh

# Or manually
python main.py --production

This will:

Start the real-time detection engine with real data sources
Start the dashboard
Disable simulation mode
Configure the system for production use

For detailed production deployment instructions, see PRODUCTION.md.

Development/Testing Modes

Complete System (with Simulation)

python main.py --all

This will:

Start Kafka and ZooKeeper automatically if they're not running
Train models if they don't exist
Start the real-time detection engine
Start the dashboard
Start the simulation system

Training Models Only

python main.py --train

Running Prediction Engine Only

python main.py --predict

This will start the real-time detection engine and automatically manage Kafka and ZooKeeper.

Running Dashboard Only

python main.py --dashboard

Running Simulation Only

python main.py --simulation

This will start the simulation system and automatically manage Kafka and ZooKeeper.

Running Simulation with Dashboard

python main.py --simulation --dashboard

This will start both the simulation system and dashboard, and automatically manage Kafka and ZooKeeper.

Testing with Sample Data

To test the system with sample data, you can use the provided script to produce test messages to Kafka:

python produce_messages.py

Running the Standalone Simulation

The simulation system can also be run standalone:

# Run with default configuration
python simulation_runner.py

# Run with a specific event rate
python simulation_runner.py --rate 10

# Run with a specific realism level
python simulation_runner.py --realism advanced

# Run a specific scenario
python simulation_runner.py --scenario data_exfiltration

# Run for a specific duration (in minutes)
python simulation_runner.py --duration 30

Dashboard

The system includes a comprehensive real-time dashboard for monitoring and analysis, accessible at http://localhost:5000 when the dashboard is running.

Dashboard Features

Overview: Alert statistics, timeline, and top entities with real-time alert streaming
Real-Time Alert Stream: Live WebSocket-based alert feed with smooth animations and connection monitoring
Attack Timeline: Interactive visualization showing attack progression over time with MITRE ATT&CK technique mapping
Alerts: Detailed alert information with filtering and MITRE ATT&CK mapping
Enhanced Metrics: Comprehensive model performance and noise reduction analysis
Entity Analysis: Entity behavior statistics and anomaly detection
Models: Status of machine learning models and behavioral baselines
Connectors: Status and configuration of data source connectors

Real-Time Capabilities

WebSocket Integration: Real-time alert streaming with <1 second latency
Live Visualizations: Interactive charts that update automatically
Attack Timeline Analysis: Time-based filtering and attack phase detection
Connection Monitoring: Visual status indicators for system health
Professional Animations: Smooth transitions and visual feedback

Project Structure

APT_Detection_System/
├── config.yaml                      # Configuration file
├── main.py                          # Main application entry point
├── requirements.txt                 # Project dependencies
├── README.md                        # Documentation
├── PRODUCTION.md                    # Production deployment guide
├── run_production.sh                # Production startup script
├── apt_detection.log                # Application log file
├── REDIS_INTEGRATION.md             # Redis integration documentation
├── redis_storage.py                 # Redis storage module
├── dashboard/                       # Dashboard application
│   ├── app.py                       # Flask application
│   └── templates/                   # HTML templates
├── data_preprocessing/              # Data preprocessing modules
│   ├── data_cleaning.py
│   ├── feature_engineering.py
│   ├── load_dataset.py
│   └── preprocess.py
├── data_balancing/                  # Data balancing modules
│   └── hhosssa_smote.py             # SMOTE implementation
├── evaluation/                      # Model evaluation modules
│   ├── cross_validation.py
│   └── evaluation_metrics.py
├── feature_selection/               # Feature selection modules
│   └── hhosssa_feature_selection.py
├── models/                          # Model definitions and saved models
│   ├── bilstm_model.py
│   ├── hybrid_classifier.py
│   ├── lightgbm_model.py
│   └── train_models.py
├── real_time_detection/             # Real-time detection modules
│   ├── behavioral_analytics.py
│   ├── data_ingestion.py
│   ├── kafka_consumer.py
│   ├── mitre_attack_mapping.py
│   ├── prediction_engine.py
│   ├── redis_integration.py         # Redis storage integration
│   └── connectors/                  # Data source connectors
│       ├── connector_manager.py
│       ├── elasticsearch_connector.py
│       └── wazuh_connector.py
├── simulation/                      # Security event simulation system
│   ├── config.py                    # Simulation configuration
│   ├── simulator.py                 # Main simulation coordinator
│   ├── entities/                    # Simulated entities
│   │   ├── entity.py                # Base entity class
│   │   ├── host.py                  # Host entity
│   │   └── user.py                  # User entity
│   ├── generators/                  # Event generators
│   │   ├── base_generator.py        # Base generator class
│   │   ├── network_events.py        # Network event generator
│   │   ├── endpoint_events.py       # Endpoint event generator
│   │   └── user_events.py           # User event generator
│   ├── scenarios/                   # Attack scenarios
│   │   ├── base_scenario.py         # Base scenario class
│   │   └── basic_scenarios.py       # Basic attack scenarios
│   └── output/                      # Output adapters
│       ├── base_output.py           # Base output class
│       ├── redis_output.py          # Redis output adapter
│       └── kafka_output.py          # Kafka output adapter
├── simulation_runner.py             # Standalone simulation runner
├── visualization.py                 # Visualization utilities
│
│ # Testing and Development Files
├── test_mitre_attack.py             # Tests MITRE ATT&CK mapping
├── test_redis.py                    # Tests Redis integration
├── produce_messages.py              # Generates test Kafka messages
├── sample_alert.json                # Sample alert format
└── synthetic_apt_dataset.csv        # Sample dataset (excluded from Git)

Production vs. Testing Files

The repository is organized to clearly separate production code from testing utilities:

Production Files

Core system modules (main.py, config.yaml, etc.)
All modules in the directories (real_time_detection/, models/, etc.)
Redis integration (redis_storage.py, real_time_detection/redis_integration.py)
Dashboard application (dashboard/)

Testing/Development Files

Test scripts (test_redis.py, test_mitre_attack.py)
Sample data generation (produce_messages.py)
Sample data files (sample_alert.json, synthetic_apt_dataset.csv)

Large files like model binaries (*.h5, *.pkl) and the dataset CSV are excluded from the Git repository via .gitignore but can be generated using the provided scripts.

Technical Details

Machine Learning Models

The system uses two primary models:

LightGBM: A gradient boosting framework that uses tree-based learning algorithms
Bi-LSTM: A bidirectional long short-term memory neural network for sequence analysis

These models are combined into a hybrid classifier that leverages the strengths of both approaches.

Behavioral Analytics

The behavioral analytics module:

Establishes baselines of normal behavior for entities
Detects anomalies using the Isolation Forest algorithm
Analyzes entity behavior patterns over time
Identifies anomalous features contributing to alerts

MITRE ATT&CK Integration

The system maps detected threats to the MITRE ATT&CK framework, providing:

Technique identification
Tactic categorization
Mitigation recommendations
Contextual information for security analysts

Security Considerations

The APT Detection System is designed with security in mind:

Authentication: Secure authentication for all data source connections
Data Encryption: Encryption of sensitive configuration data
Input Validation: Validation of all input data to prevent injection attacks
Secure Defaults: Secure default configurations for all components

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

MITRE ATT&CK® for the comprehensive threat intelligence framework
The open-source community for the various libraries and tools used in this project

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
config		config
dashboard		dashboard
data_balancing		data_balancing
data_preprocessing		data_preprocessing
evaluation		evaluation
feature_selection		feature_selection
models		models
real_time_detection		real_time_detection
simulation		simulation
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
BEHAVIORAL_ANALYTICS.md		BEHAVIORAL_ANALYTICS.md
CONNECTORS.md		CONNECTORS.md
DASHBOARD.md		DASHBOARD.md
KAFKA_INTEGRATION.md		KAFKA_INTEGRATION.md
MODELS.md		MODELS.md
PRODUCTION.md		PRODUCTION.md
README.md		README.md
REDIS_INTEGRATION.md		REDIS_INTEGRATION.md
main.py		main.py
produce_messages.py		produce_messages.py
redis_storage.py		redis_storage.py
requirements.txt		requirements.txt
run_production.sh		run_production.sh
sample_alert.json		sample_alert.json
simulation_runner.py		simulation_runner.py
synthetic_apt_dataset.csv		synthetic_apt_dataset.csv
test_enhanced_mitre.py		test_enhanced_mitre.py
test_metrics.py		test_metrics.py
test_mitre_attack.py		test_mitre_attack.py
test_mitre_enhancement.py		test_mitre_enhancement.py
test_redis.py		test_redis.py
visualization.py		visualization.py

Ap6pack/APT-Detection-System

Folders and files

Latest commit

History

Repository files navigation