From 3784565b9a305451fad5b36ab8a6d1dc790e169c Mon Sep 17 00:00:00 2001 From: schlippe Date: Mon, 18 Mar 2024 18:12:00 +0100 Subject: [PATCH 01/21] Add basic structure for new dataset --- content/all_datasets.md | 1 + .../datasets/user_computer_associations.md | 57 +++++++++++++++++++ content/related_work.md | 4 +- 3 files changed, 60 insertions(+), 2 deletions(-) create mode 100644 content/datasets/user_computer_associations.md diff --git a/content/all_datasets.md b/content/all_datasets.md index 502b402..e6e2a2c 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -50,6 +50,7 @@ before-content: gh_buttons.html | [Unified Host and Network Data Set](../datasets/unified_host_and_network_dataset) | - | Selection of network and host events collected from operational environment, but without any attacks | 2017 | Enterprise IT | Windows, Linux | 🟥 | NetFlows, Windows events | - | - | | [Unraveled](../datasets/unraveled) | Host & Network | Large dataset with intricate labeling, though the focus seems to be on network flows. Mapping will be annoying. | 2021 | Enterprise IT | Windows, Ubuntu, Kali | 🟩 | pcaps, misc. logs (syslog, audit, auth, Snort) | - | 22 GB | | [UNSW-NB15](../datasets/unsw_nb15) | Network | Custom network undergoing a variety of attacks using IXIA PerfectStorm hardware. Mostly geared towards anomaly-based NIDS | 2015 | Undisclosed | Undisclosed | 🟩 | pcaps, custom network features | >100 GB | - | +| [User-Computer Associations in Time](../datasets/user_computer_associations) | | | | | | | | | | | [VAST Challenge 2011](../datasets/vast_2011) | Network | Originated from a challenge about data analytics, focus on network but also contains host logs. Labeling is a bit lacking | 2011 | Enterprise IT | Windows | 🟨 | pcaps, Windows events, misc. logs (firewall, Snort, Nessus) | 940 MB | 9,3 GB | | [VAST Challenge 2012](../datasets/vast_2012) | Network | Originated from a challenge about data analytics, focus an a large network being the victim of a botnet | 2012 | Enterprise IT | Undisclosed | 🟨 | Snort alerts, firewall logs | 186 MB | 2,9 GB | diff --git a/content/datasets/user_computer_associations.md b/content/datasets/user_computer_associations.md new file mode 100644 index 0000000..c223d08 --- /dev/null +++ b/content/datasets/user_computer_associations.md @@ -0,0 +1,57 @@ +--- +title: NEW_ENTRY_NAME +--- + +- [Overview](#overview) +- [Environment](#environment) +- [Activity](#activity) +- [Contained Data](#contained-data) +- [Papers](#papers) +- [Links](#links) +- [Data Examples](#data-examples) + +| | | +|--------------------------|----------| +| **Network Log Source** | | +| **Network Logs Labeled** | | +| **Host Log Source** | | +| **Host Logs Labeled** | | +| | | +| **Overall Setting** | | +| **OS Types** | | +| **Number of Machines** | | +| **Total Runtime** | | +| **Year of Collection** | | +| **Attack Categories** | | +| **User Emulation** | | +| | | +| **Packed Size** | | +| **Unpacked Size** | | +| **Download Link** | | + +*** + +### Overview +A general description of the dataset, giving a brief overview over origin, intended usage and some properties of the dataset. + +### Environment +A description of the environment the dataset originated from, including networks, operating systems, running services, etc. + +### Activity +What kind of activity, benign and malicious, was performed during the period of data collection. + +### Contained Data +What kind of data was collected and how it is present in the dataset, including any processing and labeling. + +### Papers +- List of related papers, ideally as DOI links + +### Links +- List of related links, such as homepages and download sources + +### Data Examples +Snippet from the dataset, ideally one for each data type. +Note that multi-word annotations (like `json lines`) will not render properly on GitHub Pages. +``` +data example +``` \ No newline at end of file diff --git a/content/related_work.md b/content/related_work.md index cce1a59..10c1726 100644 --- a/content/related_work.md +++ b/content/related_work.md @@ -137,7 +137,7 @@ Referenced datasets: - UNM system call dataset - [Unified Host and Network dataset](/intrusion-detection-datasets/content/datasets/unified_host_and_network_dataset) - [UNSW-NB15](/intrusion-detection-datasets/content/datasets/unsw_nb15) -- User-Computer Authentication Associations in Time +- [User-Computer Authentication Associations in Time](/intrusion-detection-datasets/content/datasets/user_computer_associations) - [Vast Challenge 2012]((/intrusion-detection-datasets/content/datasets/vast_2012)) - Vast Challenge 2013 @@ -175,7 +175,7 @@ Referenced datasets: - [ISCX IDS 2012](/intrusion-detection-datasets/content/datasets/iscx_ids_2012) - ISOT - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) -- Kent 2016 +- [Kent 2016 (alias for: Comprehensive, Multi-Source Cybersecurity Events)](/intrusion-detection-datasets/content/datasets/comp_multi_source_cybersec_events) - [Kyoto Honeypot](/intrusion-detection-datasets/content/datasets/kyoto_honeypot) - Lawrence Berkeley National Laboratory Traces - NDSec-1 From d09fbc1a61732de9eff995bee53a164469253bb0 Mon Sep 17 00:00:00 2001 From: schlippe Date: Mon, 18 Mar 2024 18:13:08 +0100 Subject: [PATCH 02/21] Remove BotsV3 from list --- content/all_datasets.md | 1 - content/datasets/botsv3.md | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/content/all_datasets.md b/content/all_datasets.md index e6e2a2c..ed0bbda 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -14,7 +14,6 @@ before-content: gh_buttons.html | [AIT Log Dataset](../datasets/ait_log_dataset) | Host & Network | Huge variety of labeled logs collected from multiple simulation runs of an enterprise network under attack. With user emulation. but only Linux machines | 2023 | Enterprise IT | Linux | 🟩 | pcaps, Suricata alerts, misc. logs (Apache, auth, dns, vpn, audit, suricata, syslog) | 130 GB | 206 GB | | [ASNM Datasets](../datasets/asnm_datasets) | Network | Specialized features extracted from instances of remote buffer overflow attacks for the purpose of anomaly-based detection | 2009-2018 | Mixed | Windows, Linux | 🟩 | Custom NetFlows | 21 MB | 95 GB | | [AWSCTD](../datasets/awsctd) | Host | Syscalls collected from ~10k malware samples running on Windows 7, no user emulation | 2018 | Single OS | Windows | 🟩 | Sequences of syscall numbers | 10 MB | 558 MB | -| [BotsV3](../datasets/botsv3) [_ON HOLD_] | | _Requires usage of Splunk + a bunch of extensions, postponed_ | 2020 | | | | | 17 GB | - | | [CDX CTF 2009](../datasets/cdx_2009) | Network | Dataset captured from a CTF event, generally intended to provide methods for reliable generating labeled datasets from such events | 2009 | Enterprise IT | Windows, Linux | 🟨 | pcaps, Snort IDS alerts, Apache logs, Splunk logs | 12 GB | 15,3 GB | | [CIC-IDS2017](../datasets/cic_ids2017) | Network | Simulation of medium-sized company network under attack, focuses solely on network traffic | 2017 | Enterprise IT | Windows, Linux | 🟩 | pcaps, NetFlows, custom network features | 48,4 GB | 50 GB | | [CIDD](../datasets/cidd) | - | Spin on the DARPA'98 dataset, correlating user behavior over different systems/environments for behavior-based IDSs | 2012 | Military IT | Unix | 🟩 | Sequences of user "audits" | - | 22 GB | diff --git a/content/datasets/botsv3.md b/content/datasets/botsv3.md index ca80ea4..21811f5 100644 --- a/content/datasets/botsv3.md +++ b/content/datasets/botsv3.md @@ -1,5 +1,5 @@ --- -title: BOTSv3 +title: BOTSv3 [UNLISTED ENTRY] --- - [Overview](#overview) From aee4de2c1955a741526e8eb6a91907f49790164e Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Tue, 19 Mar 2024 10:24:36 +0100 Subject: [PATCH 03/21] Add related entries --- content/datasets/comp_multi_source_cybersec_events.md | 5 +++++ content/datasets/unified_host_and_network_dataset.md | 5 +++++ 2 files changed, 10 insertions(+) diff --git a/content/datasets/comp_multi_source_cybersec_events.md b/content/datasets/comp_multi_source_cybersec_events.md index 2a8a4b5..8fa4f1f 100644 --- a/content/datasets/comp_multi_source_cybersec_events.md +++ b/content/datasets/comp_multi_source_cybersec_events.md @@ -82,6 +82,11 @@ or correlated. - [Homepage](https://csr.lanl.gov/data/cyber1/) - Data access must be requested here +### Related Entries +- Other LANL datasets: + - [Unified Host and Network Dataset](unified_host_and_network_dataset.md) + - [User-Computer Authentication Associations in Time](user_computer_associations.md) + ### Example Data Authentication events in `auth.txt` diff --git a/content/datasets/unified_host_and_network_dataset.md b/content/datasets/unified_host_and_network_dataset.md index 8a7f504..11f99d7 100644 --- a/content/datasets/unified_host_and_network_dataset.md +++ b/content/datasets/unified_host_and_network_dataset.md @@ -78,6 +78,11 @@ Note: Their service is currently unavailable, preventing downloads :( - [Homepage](https://csr.lanl.gov/data/2017/) +### Related Entries +- Other LANL datasets: + - [Comprehensive, Multi-Source Cybersecurity Events](comp_multi_source_cybersec_events.md) + - [User-Computer Authentication Associations in Time](user_computer_associations.md) + ### Data Examples Example network event data From 1621b8cd8971da2f3592a65b4b712320a4d0d561 Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Tue, 19 Mar 2024 11:01:48 +0100 Subject: [PATCH 04/21] Add information about User-Computer Associations in Time dataset --- content/all_datasets.md | 2 +- .../datasets/user_computer_associations.md | 80 ++++++++++++------- 2 files changed, 53 insertions(+), 29 deletions(-) diff --git a/content/all_datasets.md b/content/all_datasets.md index ed0bbda..f93aeae 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -49,7 +49,7 @@ before-content: gh_buttons.html | [Unified Host and Network Data Set](../datasets/unified_host_and_network_dataset) | - | Selection of network and host events collected from operational environment, but without any attacks | 2017 | Enterprise IT | Windows, Linux | 🟥 | NetFlows, Windows events | - | - | | [Unraveled](../datasets/unraveled) | Host & Network | Large dataset with intricate labeling, though the focus seems to be on network flows. Mapping will be annoying. | 2021 | Enterprise IT | Windows, Ubuntu, Kali | 🟩 | pcaps, misc. logs (syslog, audit, auth, Snort) | - | 22 GB | | [UNSW-NB15](../datasets/unsw_nb15) | Network | Custom network undergoing a variety of attacks using IXIA PerfectStorm hardware. Mostly geared towards anomaly-based NIDS | 2015 | Undisclosed | Undisclosed | 🟩 | pcaps, custom network features | >100 GB | - | -| [User-Computer Associations in Time](../datasets/user_computer_associations) | | | | | | | | | | +| [User-Computer Associations in Time](../datasets/user_computer_associations) | - | Large number of authentication events over a period of 9 months, but with very little detail and without any attacks | 2014 | Enterprise IT | Undisclosed | 🟥 | Custom auth event logs | 2,3 GB | - | | [VAST Challenge 2011](../datasets/vast_2011) | Network | Originated from a challenge about data analytics, focus on network but also contains host logs. Labeling is a bit lacking | 2011 | Enterprise IT | Windows | 🟨 | pcaps, Windows events, misc. logs (firewall, Snort, Nessus) | 940 MB | 9,3 GB | | [VAST Challenge 2012](../datasets/vast_2012) | Network | Originated from a challenge about data analytics, focus an a large network being the victim of a botnet | 2012 | Enterprise IT | Undisclosed | 🟨 | Snort alerts, firewall logs | 186 MB | 2,9 GB | diff --git a/content/datasets/user_computer_associations.md b/content/datasets/user_computer_associations.md index c223d08..7134fa3 100644 --- a/content/datasets/user_computer_associations.md +++ b/content/datasets/user_computer_associations.md @@ -1,5 +1,5 @@ --- -title: NEW_ENTRY_NAME +title: User-Computer Authentication Associations in Time --- - [Overview](#overview) @@ -10,48 +10,72 @@ title: NEW_ENTRY_NAME - [Links](#links) - [Data Examples](#data-examples) -| | | -|--------------------------|----------| -| **Network Log Source** | | -| **Network Logs Labeled** | | -| **Host Log Source** | | -| **Host Logs Labeled** | | -| | | -| **Overall Setting** | | -| **OS Types** | | -| **Number of Machines** | | -| **Total Runtime** | | -| **Year of Collection** | | -| **Attack Categories** | | -| **User Emulation** | | -| | | -| **Packed Size** | | -| **Unpacked Size** | | -| **Download Link** | | +| | | +|--------------------------|-----------------------| +| **Network Log Source** | - | +| **Network Logs Labeled** | - | +| **Host Log Source** | Authentication events | +| **Host Logs Labeled** | No | +| | | +| **Overall Setting** | Enterprise IT | +| **OS Types** | Undisclosed | +| **Number of Machines** | 22,284 | +| **Total Runtime** | 9 months | +| **Year of Collection** | 2014 | +| **Attack Categories** | none | +| **User Emulation** | Real users | +| | | +| **Packed Size** | 2,3 GB | +| **Unpacked Size** | n/a | +| **Download Link** | see below | *** ### Overview -A general description of the dataset, giving a brief overview over origin, intended usage and some properties of the dataset. +The "User-Computer Authentication in Time" dataset contains 708.304.516 anonymized authentication events collected over a period of 9 continuous months from the Los Alamos National Laboratory (LANL) enterprise network. +This data was used to study the concept of "Credential Hopping" - where an attacker uses stored/cached credentials on a machine to authenticate on another in the network - and which strategies can be taken to mitigate associated risks. +For this purpose, an "authentication graph" was built from these events and analyzed and/or modified using various assumptions and methodologies. +As it does not contain any known malicious events, it is most likely of little interest for intrusion detection research. ### Environment -A description of the environment the dataset originated from, including networks, operating systems, running services, etc. +Details other than "LANL enterprise network" are not provided, except for some statistical values. +There are a total of 11,362 distinct users, and 22,284 distinct computers. ### Activity -What kind of activity, benign and malicious, was performed during the period of data collection. +There is no mention of any malicious activity during the collection period. ### Contained Data -What kind of data was collected and how it is present in the dataset, including any processing and labeling. +The dataset consists of 708,304,516 authentication events, one event per line, each described with three values: +- an epoch timestamp (epoch 1 is the start of the collection period, an exact time is not provided) +- the user who successfully authenticated, in the form of "U" plus a unique number for that user (e.g., U1337) +- the computer the given user authenticated on, in the form of "C" plus a unique number for that computer (e.g., C42) + +The authors note that authentication events for some centralized computers, specifically the Active Directory Servers, have been removed. +Access to the dataset can easily be requested on the homepage linked below (the email address you need to supply does not have to be valid). +It can either be downloaded as one text file containing the full nine months (2,3 GB), or nine text files with 30 days of events each. ### Papers -- List of related papers, ideally as DOI links +- [Connected Components and Credential Hopping in Authentication Graphs](https://doi.org/10.1109/SITIS.2014.95) ### Links -- List of related links, such as homepages and download sources +- [Homepage](https://csr.lanl.gov/data/auth/) + +### Related Entries +- Other LANL datasets: + - [Comprehensive, Multi-Source Cybersecurity Events](comp_multi_source_cybersec_events.md) + - [Unified Host and Network Dataset](unified_host_and_network_dataset.md) ### Data Examples -Snippet from the dataset, ideally one for each data type. -Note that multi-word annotations (like `json lines`) will not render properly on GitHub Pages. +The first 10 lines of the dataset ``` -data example +1,U1,C1 +1,U1,C2 +2,U2,C3 +3,U3,C4 +6,U4,C5 +7,U4,C5 +7,U5,C6 +8,U6,C7 +11,U7,C8 +12,U8,C9 ``` \ No newline at end of file From e85b29315820d375195163bc0da85498e08ccedf Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Tue, 19 Mar 2024 11:16:34 +0100 Subject: [PATCH 05/21] Add links to related work --- content/related_work.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/related_work.md b/content/related_work.md index cce1a59..8c5bc5f 100644 --- a/content/related_work.md +++ b/content/related_work.md @@ -75,7 +75,6 @@ Referenced datasets: - ISOT - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) - [Kyoto Honeypot](/intrusion-detection-datasets/content/datasets/kyoto_honeypot) -- Lawrence Berkeley National Laboratory Traces - [NSL-KDD](/intrusion-detection-datasets/content/datasets/nsl_kdd_dataset) - [Twente 2009](/intrusion-detection-datasets/content/datasets/twente_2009) - [UNSW NB15](/intrusion-detection-datasets/content/datasets/unsw_nb15) @@ -84,6 +83,7 @@ Referenced datasets: Referenced collections: - CAIDA - DEFCON CTF Archive +- [Lawrence Berkeley National Laboratory Traces](#the-internet-traffic-archive) (alias for: The Internet Traffic Archive) - MAWILab - UMass Trace Repository @@ -106,12 +106,12 @@ Referenced datasets: - [ISCX IDS 2012](/intrusion-detection-datasets/content/datasets/iscx_ids_2012) - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) - [Kyoto Honeypot](/intrusion-detection-datasets/content/datasets/kyoto_honeypot) -- Lawrence Berkeley National Laboratory Traces - [NSL-KDD](/intrusion-detection-datasets/content/datasets/nsl_kdd_dataset) - [Twente 2009](/intrusion-detection-datasets/content/datasets/twente_2009) Referenced Collections: - CAIDA +- [Lawrence Berkeley National Laboratory Traces](#the-internet-traffic-archive) (alias for: The Internet Traffic Archive) - DEFCON CTF Archive ### A Survey of Intrusion Detection Systems leveraging Host Data (2019) @@ -177,7 +177,6 @@ Referenced datasets: - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) - Kent 2016 - [Kyoto Honeypot](/intrusion-detection-datasets/content/datasets/kyoto_honeypot) -- Lawrence Berkeley National Laboratory Traces - NDSec-1 - [NGIDS-DS](/intrusion-detection-datasets/content/datasets/ngids_dataset) - [NSL-KDD](/intrusion-detection-datasets/content/datasets/nsl_kdd_dataset) @@ -203,6 +202,7 @@ Referenced collections: - [IMPACT](#impact) - [Internet Traffic Archive](#the-internet-traffic-archive) - Kaggle +- [Lawrence Berkeley National Laboratory Traces](#the-internet-traffic-archive) (alias for: The Internet Traffic Archive) - [Malware Traffic Analysis](#malware-traffic-analysis) - Mid-Atlantic CCDC - MAWILab @@ -298,7 +298,6 @@ Datasets that are suitable for this purpose are mentioned as a secondary talking Referenced datasets: - [DARPA'98 Intrusion Detection Program](/intrusion-detection-datasets/content/datasets/darpa98) - [ISCX IDS 2012](/intrusion-detection-datasets/content/datasets/iscx_ids_2012) -- Lawrence Berkeley National Laboratory Traces - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) - [NSL-KDD](/intrusion-detection-datasets/content/datasets/nsl_kdd_dataset) - TUIDS @@ -306,6 +305,7 @@ Referenced datasets: Referenced collections: - CAIDA - DEFCON CTF archive +- [Lawrence Berkeley National Laboratory Traces](#the-internet-traffic-archive) (alias for: The Internet Traffic Archive) ## Other collections From c87b8dbe8008f9350521f03b29b9ec71d720127e Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Tue, 19 Mar 2024 11:23:26 +0100 Subject: [PATCH 06/21] Reformat link --- content/related_work.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/related_work.md b/content/related_work.md index 10c1726..6a1a73b 100644 --- a/content/related_work.md +++ b/content/related_work.md @@ -175,7 +175,7 @@ Referenced datasets: - [ISCX IDS 2012](/intrusion-detection-datasets/content/datasets/iscx_ids_2012) - ISOT - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) -- [Kent 2016 (alias for: Comprehensive, Multi-Source Cybersecurity Events)](/intrusion-detection-datasets/content/datasets/comp_multi_source_cybersec_events) +- [Kent 2016](/intrusion-detection-datasets/content/datasets/comp_multi_source_cybersec_events) (alias for: Comprehensive, Multi-Source Cybersecurity Events) - [Kyoto Honeypot](/intrusion-detection-datasets/content/datasets/kyoto_honeypot) - Lawrence Berkeley National Laboratory Traces - NDSec-1 From 76ed60616bfaab72358e8f25e66178aabad19ce9 Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Tue, 19 Mar 2024 11:43:25 +0100 Subject: [PATCH 07/21] Add missing year of publication --- content/datasets/user_computer_associations.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/datasets/user_computer_associations.md b/content/datasets/user_computer_associations.md index 7134fa3..178bef6 100644 --- a/content/datasets/user_computer_associations.md +++ b/content/datasets/user_computer_associations.md @@ -55,7 +55,7 @@ Access to the dataset can easily be requested on the homepage linked below (the It can either be downloaded as one text file containing the full nine months (2,3 GB), or nine text files with 30 days of events each. ### Papers -- [Connected Components and Credential Hopping in Authentication Graphs](https://doi.org/10.1109/SITIS.2014.95) +- [Connected Components and Credential Hopping in Authentication Graphs (2014)](https://doi.org/10.1109/SITIS.2014.95) ### Links - [Homepage](https://csr.lanl.gov/data/auth/) From dbf5facd1ae604dccd8d18070830a3f2d70d131a Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Tue, 19 Mar 2024 11:50:24 +0100 Subject: [PATCH 08/21] Add basic structure for new dataset --- content/all_datasets.md | 1 + content/datasets/gure_kddcup.md | 58 +++++++++++++++++++++++++++++++++ content/related_work.md | 6 ++-- 3 files changed, 62 insertions(+), 3 deletions(-) create mode 100644 content/datasets/gure_kddcup.md diff --git a/content/all_datasets.md b/content/all_datasets.md index 502b402..a6c8ff6 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -27,6 +27,7 @@ before-content: gh_buttons.html | [DARPA TC3](../datasets/darpa_tc3) | Host | Custom event logs from network under attack, designed to facilitate provenance tracking | 2018 | Undisclosed | Undisclosed | 🟨 | Custom event logs | 115 GB | - | | [DARPA TC5](../datasets/darpa_tc5) | Host | Custom event logs from network under attack from APT groups, designed to facilitate provenance tracking | 2019 | Undisclosed | Undisclosed | 🟨 | Custom event logs | - | - | | [EVTX to MITRE ATT&CK](../datasets/evtx_to_mitre_attck) | Host | Small dataset providing various events corresponding to certain MITRE tactics/techniques | 2022 | Single OS | Windows | 🟩 | Windows events | <1 GB | <1 GB | +| [gureKDDCup](../datasets/gure_kddcup) | | | | | | | | | | | [ISCX Intrusion Detection Evaluation](../datasets/iscx_ids_2012) | Network | Focus on realistic traffic generation in a company network, combined with some basic attacks | 2012 | Enterprise IT | Windows, Ubuntu | 🟩 | pcaps | 84 GB | 87 GB | | [KDD Cup 1999](../datasets/kdd_cup_1999) | Network | Network connection events derived from simulated U.S. Air Force network under attack. No longer appropriate to use for multiple reasons | 1999 | Military IT | Unix | 🟩 | Connection records | 18 MB | 743 MB | | [Kyoto Honeypot](../datasets/kyoto_honeypot) | Network | Collection of features derived from attack traffic targeting honeypots over the span of 9 years | 2006-2015 | Diverse | Windows, Unix, MacOS | 🟩 | Custom network features | 20 GB | - | diff --git a/content/datasets/gure_kddcup.md b/content/datasets/gure_kddcup.md new file mode 100644 index 0000000..7451c2c --- /dev/null +++ b/content/datasets/gure_kddcup.md @@ -0,0 +1,58 @@ +--- +title: gureKDDCup +--- + +- [Overview](#overview) +- [Environment](#environment) +- [Activity](#activity) +- [Contained Data](#contained-data) +- [Papers](#papers) +- [Links](#links) +- [Data Examples](#data-examples) + +| | | +|--------------------------|------------------------------------------------------------------------------------------------------------| +| **Network Log Source** | | +| **Network Logs Labeled** | | +| **Host Log Source** | | +| **Host Logs Labeled** | | +| | | +| **Overall Setting** | | +| **OS Types** | | +| **Number of Machines** | | +| **Total Runtime** | | +| **Year of Collection** | | +| **Attack Categories** | | +| **User Emulation** | | +| | | +| **Packed Size** | | +| **Unpacked Size** | | +| **Download Link** | [goto](http://www.sc.ehu.es/acwaldap/gureKddcup/gureKDDCup/gureKddcup/complete_database/gureKddcup.tar.gz) | + +*** + +### Overview +A general description of the dataset, giving a brief overview over origin, intended usage and some properties of the dataset. + +### Environment +A description of the environment the dataset originated from, including networks, operating systems, running services, etc. + +### Activity +What kind of activity, benign and malicious, was performed during the period of data collection. + +### Contained Data +What kind of data was collected and how it is present in the dataset, including any processing and labeling. + +### Papers +- [Service-independent payload analysis to improve intrusion detection in network traffic (2008)](https://dl.acm.org/doi/10.5555/2449288.2449315) + +### Links +- [Homepage](http://www.sc.ehu.es/acwaldap/gureKddcup/galdetegia_jaso.php) +- [Documentation](https://addi.ehu.es/bitstream/handle/10810/20608/20160601_Txostena_gurekddcup_InigoPeronaBalda.pdf?sequence=1) + +### Data Examples +Snippet from the dataset, ideally one for each data type. +Note that multi-word annotations (like `json lines`) will not render properly on GitHub Pages. +``` +data example +``` \ No newline at end of file diff --git a/content/related_work.md b/content/related_work.md index cce1a59..de64a43 100644 --- a/content/related_work.md +++ b/content/related_work.md @@ -70,7 +70,7 @@ Referenced datasets: - [CTU 13](/intrusion-detection-datasets/content/datasets/ctu_13) - CIC DoS - [DARPA'98 Intrusion Detection Program](/intrusion-detection-datasets/content/datasets/darpa98) -- Gure-KDD-Cup +- [gureKDDCup](/intrusion-detection-datasets/content/datasets/gure_kddcup) - [ISCX IDS 2012](/intrusion-detection-datasets/content/datasets/iscx_ids_2012) - ISOT - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) @@ -130,7 +130,7 @@ Referenced datasets: - [Comprehensive Multi-Source Cybersecurity Events](/intrusion-detection-datasets/content/datasets/comp_multi_source_cybersec_events) - [CTU 13](/intrusion-detection-datasets/content/datasets/ctu_13) - [DARPA'98 Intrusion Detection Program](/intrusion-detection-datasets/content/datasets/darpa98) -- GURE-KDD +- [gureKDDCup](/intrusion-detection-datasets/content/datasets/gure_kddcup) - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) - Malware Capture Facility Project - [NSL-KDD](/intrusion-detection-datasets/content/datasets/nsl_kdd_dataset) @@ -282,7 +282,7 @@ Sahu, S. K., Sarangi, S., & Jena, S. K. (2014, February). A detail analysis on i This paper shortly analyzed three papers the authors deem suitable to test their novel preprocessing techniques, which are supposed to improve the performance of various data mining algorithms. Referenced datasets: -- GURE-KDD +- [gureKDDCup](/intrusion-detection-datasets/content/datasets/gure_kddcup) - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) - [NSL-KDD](/intrusion-detection-datasets/content/datasets/nsl_kdd_dataset) From 1779ab305e3807427c16967fdf56ad608b61965c Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Wed, 20 Mar 2024 09:14:37 +0100 Subject: [PATCH 09/21] Add missing ToC entry --- content/datasets/user_computer_associations.md | 1 + 1 file changed, 1 insertion(+) diff --git a/content/datasets/user_computer_associations.md b/content/datasets/user_computer_associations.md index 178bef6..ec6d390 100644 --- a/content/datasets/user_computer_associations.md +++ b/content/datasets/user_computer_associations.md @@ -8,6 +8,7 @@ title: User-Computer Authentication Associations in Time - [Contained Data](#contained-data) - [Papers](#papers) - [Links](#links) +- [Related Entries](#related-entries) - [Data Examples](#data-examples) | | | From a6f12cfdc866d3756a7a4ce84acd2b7a1959d8a6 Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Wed, 20 Mar 2024 10:01:46 +0100 Subject: [PATCH 10/21] Correct minor factual error --- content/datasets/kdd_cup_1999.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/datasets/kdd_cup_1999.md b/content/datasets/kdd_cup_1999.md index 484b10e..863033c 100644 --- a/content/datasets/kdd_cup_1999.md +++ b/content/datasets/kdd_cup_1999.md @@ -68,7 +68,7 @@ The raw DARPA data, which comes in the form of binary TCP dumps, is divided and million connection records) of training data, and two weeks (~two million connection records) of test data. A connection record is defined as "a sequence of TCP packets starting and ending at some well-defined times, between which data flows to and from a source IP address to a target IP address under some well-defined protocol". -Each of these connection records contains 41 features (description linked below), including a label indicating whether +Each of these connection records contains 41 features (description linked below), with a 42nd label indicating whether this event is normal or malicious, which in the latter case also references the specific attack that event belongs to. The KDD'99 dataset fixes some issues present in its DARPA foundation, which was severely affected by simulation From d34a447e5f8f833e5717e531e679827c3b767caf Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Wed, 20 Mar 2024 10:01:59 +0100 Subject: [PATCH 11/21] Add information about gureKDDCup dataset --- content/all_datasets.md | 2 +- content/datasets/gure_kddcup.md | 98 ++++++++++++++++++++++++++------- 2 files changed, 78 insertions(+), 22 deletions(-) diff --git a/content/all_datasets.md b/content/all_datasets.md index a6c8ff6..fc4cbdd 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -27,7 +27,7 @@ before-content: gh_buttons.html | [DARPA TC3](../datasets/darpa_tc3) | Host | Custom event logs from network under attack, designed to facilitate provenance tracking | 2018 | Undisclosed | Undisclosed | 🟨 | Custom event logs | 115 GB | - | | [DARPA TC5](../datasets/darpa_tc5) | Host | Custom event logs from network under attack from APT groups, designed to facilitate provenance tracking | 2019 | Undisclosed | Undisclosed | 🟨 | Custom event logs | - | - | | [EVTX to MITRE ATT&CK](../datasets/evtx_to_mitre_attck) | Host | Small dataset providing various events corresponding to certain MITRE tactics/techniques | 2022 | Single OS | Windows | 🟩 | Windows events | <1 GB | <1 GB | -| [gureKDDCup](../datasets/gure_kddcup) | | | | | | | | | | +| [gureKDDCup](../datasets/gure_kddcup) | Network | An extension of the KDDCup 1999 dataset, adding additional information about payloads to each connection record | 2008 | Military IT | Unix | 🟩 | Connection records with payload information | 10 GB | - | | [ISCX Intrusion Detection Evaluation](../datasets/iscx_ids_2012) | Network | Focus on realistic traffic generation in a company network, combined with some basic attacks | 2012 | Enterprise IT | Windows, Ubuntu | 🟩 | pcaps | 84 GB | 87 GB | | [KDD Cup 1999](../datasets/kdd_cup_1999) | Network | Network connection events derived from simulated U.S. Air Force network under attack. No longer appropriate to use for multiple reasons | 1999 | Military IT | Unix | 🟩 | Connection records | 18 MB | 743 MB | | [Kyoto Honeypot](../datasets/kyoto_honeypot) | Network | Collection of features derived from attack traffic targeting honeypots over the span of 9 years | 2006-2015 | Diverse | Windows, Unix, MacOS | 🟩 | Custom network features | 20 GB | - | diff --git a/content/datasets/gure_kddcup.md b/content/datasets/gure_kddcup.md index 7451c2c..6203ba9 100644 --- a/content/datasets/gure_kddcup.md +++ b/content/datasets/gure_kddcup.md @@ -8,51 +8,107 @@ title: gureKDDCup - [Contained Data](#contained-data) - [Papers](#papers) - [Links](#links) +- [Related Entries](#related-entries) - [Data Examples](#data-examples) | | | |--------------------------|------------------------------------------------------------------------------------------------------------| -| **Network Log Source** | | -| **Network Logs Labeled** | | -| **Host Log Source** | | -| **Host Logs Labeled** | | +| **Network Log Source** | Connection records with payload | +| **Network Logs Labeled** | Yes | +| **Host Log Source** | - | +| **Host Logs Labeled** | - | | | | -| **Overall Setting** | | -| **OS Types** | | -| **Number of Machines** | | -| **Total Runtime** | | -| **Year of Collection** | | -| **Attack Categories** | | -| **User Emulation** | | +| **Overall Setting** | Military IT | +| **OS Types** | Linux 2.0.27
SunOS 4.1.4
Sun Solaris 2.5.1
Windows NT | +| **Number of Machines** | 1000's | +| **Total Runtime** | Nine weeks | +| **Year of Collection** | 1998 | +| **Attack Categories** | DoS
Remote to Local
User to Root
Surveillance/Probing | +| **User Emulation** | Scripts for traffic generation, actual humans for performing complex tasks | | | | -| **Packed Size** | | -| **Unpacked Size** | | +| **Packed Size** | 10 GB | +| **Unpacked Size** | n/a | | **Download Link** | [goto](http://www.sc.ehu.es/acwaldap/gureKddcup/gureKDDCup/gureKddcup/complete_database/gureKddcup.tar.gz) | *** ### Overview -A general description of the dataset, giving a brief overview over origin, intended usage and some properties of the dataset. +The gureKDDCup dataset is an extension of the well known KDDCup 1999 dataset -- which consists of connection records --, adding additional information regarding payloads. +Consequently, it is also based on the DARPA'98 Intrusion Detection Program; +information about both of these datasets can be found in the [Related Entries](#related-entries) section. +Note that the authors did not directly copy the KDDCup 1999 dataset, but instead recreated it using the same methodology, including additional information in the process. ### Environment -A description of the environment the dataset originated from, including networks, operating systems, running services, etc. +The simulated Air Force base consists of a small number of hosts, leveraging "custom software" to appear as if they were +1000s of hosts with different IP addresses. ### Activity -What kind of activity, benign and malicious, was performed during the period of data collection. +Within the network, automated users perform an array of tasks such as sending mails, browsing, or using services like +FTP, telnet or SNMP. +The total duration of this simulation was nine weeks. +Any protective devices such as firewalls are omitted, as "the focus was on detecting attacks, and not preventing +attacks". +All attacks are performed from the outside of this network, and a sniffer is located at the entry point of the network +to capture this traffic. +Attacks belong to one of four categories: + +- DoS +- Remote to Local +- User to Root +- Surveillance/Probing ### Contained Data -What kind of data was collected and how it is present in the dataset, including any processing and labeling. +The raw DARPA data, which comes in the form of binary TCP dumps, is transformed into connection records, mimicking the methodology of the KDDCup 1999 dataset. +This entire process is documented extensively in a separate document, which is linked below. +A connection record is defined as "a sequence of TCP packets starting and ending at some well-defined times, between +which data flows to and from a source IP address to a target IP address under some well-defined protocol". +Just as with the KDDCup dataset, each record contains 41 features (described in section C.2 of the documentation), with a 42nd label indicating whether this event is normal or malicious, which in the latter case also references the specific attack that event belongs to. + +As mentioned, the distinguishing factor here is the inclusion of additional payload information. +That is, for each connection record, three additional files are generated: +- `*.a`: sent packets' payloads, sorted by time +- `*.b`: received packets' payloads, sorted by time +- `*.c`: all packet payloads of the connection, sorted by time + +The filename before the extension is equal to the number of the associated conneciton record. +Data is divided into seven weeks, which then each contain five folders, one for every workday (MON-FR). +Each of those contains the following data: +- `gureKddcup.list`: Connection records for that day. +The first 6 attributes are: connection_number, start_time, orig_port, resp_port, orig_ip, resp_ip (information to identify the connection), followed by the cited 41 attributes plus class (see data example below) +- `a-matched`: All sent packets' payloads of that days connections, one file per connection record. +Each filename matches to a connection_number in the list of connection records. +- `b-matched`: All received packets' payloads of that days connections, one file per connection record. +Each filename matches to a connection_number in the list of connection records. +- `a-matched`: All packets' payloads of that days connections, one file per connection record. +Each filename matches to a connection_number in the list of connection records. + +The authors also supply a subset of this data called gureKddcup6percent. +It supplies the same information in the same way, but, as the name suggests, only supplies 6% of the original connection records plus associated payloads. +This sample contains all no-flood attacks, and a random selection of normal connections. ### Papers - [Service-independent payload analysis to improve intrusion detection in network traffic (2008)](https://dl.acm.org/doi/10.5555/2449288.2449315) ### Links -- [Homepage](http://www.sc.ehu.es/acwaldap/gureKddcup/galdetegia_jaso.php) +- [Homepage](http://www.sc.ehu.es/acwaldap/gureKddcup/galdetegia_jaso.php) (form does not have to be filled out) - [Documentation](https://addi.ehu.es/bitstream/handle/10810/20608/20160601_Txostena_gurekddcup_InigoPeronaBalda.pdf?sequence=1) +- [Link Hub](http://www.sc.ehu.es/acwaldap/) (in case homepage link deprecates) + +## Related Entries +- [DARPA'98 Intrusion Detection Program](darpa98.md) +- [KDD Cup 1999](kdd_cup_1999.md) ### Data Examples -Snippet from the dataset, ideally one for each data type. -Note that multi-word annotations (like `json lines`) will not render properly on GitHub Pages. +Connection records taken from `gureKddcup/Week6/Thursday/gureKddcup.list/gureKddcup-matched.list` ``` -data example +64558768 899989341.327858 8 0 197.218.177.69 172.16.114.115 0.000000 icmp 8 SH 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0 12 0.000000 0.000000 0.120000 1.000000 0.000000 0.000000 0.000000 0.000000 ipsweep +64558769 899989341.638201 4136 80 172.16.113.84 192.43.70.122 0.039594 tcp 80 SF 160 479 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0 9 0.000000 0.000000 0.000000 1.000000 0.000000 0.111111 0.000000 0.000000 +64558771 899989342.617289 1904 161 194.27.251.21 192.168.1.1 0.000000 udp 161 S0 105 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 38 28 0.736842 0.263158 0.000000 0.000000 0.368421 0.500000 0.000000 0.000000 +64558772 899989342.617289 161 1904 192.168.1.1 194.27.251.21 0.045382 udp 161 SF 0 146 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 39 29 0.743590 0.256410 0.010000 0.000000 0.384615 0.517241 0.000000 0.000000 +64558773 899989343.121947 49724 928 206.48.44.18 172.16.112.50 0.000449 tcp 928 REJ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 25 0 0.000000 1.000000 0.250000 0.000000 0.000000 0.000000 1.000000 0.000000 portsweep +64558774 899989343.345483 4141 25 172.16.113.84 194.7.248.153 2.057617 tcp 25 SF 3044 325 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0 12 0.000000 0.000000 0.000000 1.000000 0.000000 0.166667 0.000000 0.000000 +64558776 899989345.407192 4144 25 172.16.113.84 196.37.75.158 3.208491 tcp 25 SF 3047 331 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0 14 0.000000 0.000000 0.000000 1.000000 0.000000 0.142857 0.000000 0.000000 +64558777 899989346.151906 49724 91 206.48.44.18 172.16.112.50 0.000430 tcp 91 REJ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 26 0 0.000000 1.000000 0.260000 0.000000 0.000000 0.000000 1.000000 0.000000 portsweep +64558778 899989346.203066 26326 25 197.182.91.233 172.16.112.207 0.905250 tcp 25 SF 4536 329 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0 13 0.000000 0.000000 0.000000 1.000000 0.000000 0.153846 0.000000 0.000000 +64558779 899989346.716433 8 0 197.218.177.69 172.16.114.116 0.000000 icmp 8 SH 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0 13 0.000000 0.000000 0.130000 1.000000 0.000000 0.000000 0.000000 0.000000 ipsweep ``` \ No newline at end of file From 5c12c5525ba25c6359ac3eb783fdcf004a393974 Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Wed, 20 Mar 2024 10:19:30 +0100 Subject: [PATCH 12/21] Add basic structure for new dataset --- content/all_datasets.md | 1 + content/datasets/cic_dos.md | 57 +++++++++++++++++++++++++++++++++++++ 2 files changed, 58 insertions(+) create mode 100644 content/datasets/cic_dos.md diff --git a/content/all_datasets.md b/content/all_datasets.md index 502b402..7db3328 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -16,6 +16,7 @@ before-content: gh_buttons.html | [AWSCTD](../datasets/awsctd) | Host | Syscalls collected from ~10k malware samples running on Windows 7, no user emulation | 2018 | Single OS | Windows | 🟩 | Sequences of syscall numbers | 10 MB | 558 MB | | [BotsV3](../datasets/botsv3) [_ON HOLD_] | | _Requires usage of Splunk + a bunch of extensions, postponed_ | 2020 | | | | | 17 GB | - | | [CDX CTF 2009](../datasets/cdx_2009) | Network | Dataset captured from a CTF event, generally intended to provide methods for reliable generating labeled datasets from such events | 2009 | Enterprise IT | Windows, Linux | 🟨 | pcaps, Snort IDS alerts, Apache logs, Splunk logs | 12 GB | 15,3 GB | +| [CIC DoS](../datasets/cic_dos) | | | | | | | | | | | [CIC-IDS2017](../datasets/cic_ids2017) | Network | Simulation of medium-sized company network under attack, focuses solely on network traffic | 2017 | Enterprise IT | Windows, Linux | 🟩 | pcaps, NetFlows, custom network features | 48,4 GB | 50 GB | | [CIDD](../datasets/cidd) | - | Spin on the DARPA'98 dataset, correlating user behavior over different systems/environments for behavior-based IDSs | 2012 | Military IT | Unix | 🟩 | Sequences of user "audits" | - | 22 GB | | [CLUE-LDS](../datasets/clue_lds) | - | Database of real user behavior without known attacks, for evaluation of methods detecting shifts in user behavior | 2022 | Enterprise Subsystem | - (hBox) | 🟥 | Custom event logs | 640 MB | 14,9 GB | diff --git a/content/datasets/cic_dos.md b/content/datasets/cic_dos.md new file mode 100644 index 0000000..5913bbe --- /dev/null +++ b/content/datasets/cic_dos.md @@ -0,0 +1,57 @@ +--- +title: CIC DoS +--- + +- [Overview](#overview) +- [Environment](#environment) +- [Activity](#activity) +- [Contained Data](#contained-data) +- [Papers](#papers) +- [Links](#links) +- [Data Examples](#data-examples) + +| | | +|--------------------------|----------| +| **Network Log Source** | | +| **Network Logs Labeled** | | +| **Host Log Source** | | +| **Host Logs Labeled** | | +| | | +| **Overall Setting** | | +| **OS Types** | | +| **Number of Machines** | | +| **Total Runtime** | | +| **Year of Collection** | | +| **Attack Categories** | | +| **User Emulation** | | +| | | +| **Packed Size** | | +| **Unpacked Size** | | +| **Download Link** | | + +*** + +### Overview +A general description of the dataset, giving a brief overview over origin, intended usage and some properties of the dataset. + +### Environment +A description of the environment the dataset originated from, including networks, operating systems, running services, etc. + +### Activity +What kind of activity, benign and malicious, was performed during the period of data collection. + +### Contained Data +What kind of data was collected and how it is present in the dataset, including any processing and labeling. + +### Papers +- [Detecting HTTP-based Application Layer DoS Attacks on Web Servers in the Presence of Sampling (2017)](https://doi.org/10.1016/j.comnet.2017.03.018) + +### Links +- [Homepage](https://www.unb.ca/cic/datasets/dos-dataset.html) + +### Data Examples +Snippet from the dataset, ideally one for each data type. +Note that multi-word annotations (like `json lines`) will not render properly on GitHub Pages. +``` +data example +``` \ No newline at end of file From e01bc0f9be644aa318edb22e1c60aaa53fd7755d Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Wed, 20 Mar 2024 10:30:21 +0100 Subject: [PATCH 13/21] Update links and fix minor typos --- content/related_work.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/related_work.md b/content/related_work.md index cce1a59..9c7ea74 100644 --- a/content/related_work.md +++ b/content/related_work.md @@ -3,10 +3,10 @@ title: Related Work --- This page lists publications and collections covering IDS datasets. -Related publications, sorted by year or release, are any academic work that at least partially covers the topic of available IDS datasets. +Related publications, sorted by year of release, are any academic work that at least partially covers the topic of available IDS datasets. Collections, sorted alphabetically, simply features agglomerations of IDS-related datasets not backed by a scientific publication. -Each entry consists of citation and a brief description of the survey's scope of selected datasets. +Each entry consists of a citation and a brief description of the survey's scope of selected datasets. Additionally, for publications, all datasets discussed in the survey are also listed, linking to their respective entries on this website, if available. ## Contents @@ -68,7 +68,7 @@ Referenced datasets: - [CIC-IDS 2017](/intrusion-detection-datasets/content/datasets/cic_ids2017) - [CSE-CIC-IDS 2018](/intrusion-detection-datasets/content/datasets/cse_cic_ids2018) - [CTU 13](/intrusion-detection-datasets/content/datasets/ctu_13) -- CIC DoS +- [CIC DoS](/intrusion-detection-datasets/content/datasets/cic_dos) - [DARPA'98 Intrusion Detection Program](/intrusion-detection-datasets/content/datasets/darpa98) - Gure-KDD-Cup - [ISCX IDS 2012](/intrusion-detection-datasets/content/datasets/iscx_ids_2012) @@ -165,7 +165,7 @@ Referenced datasets: - Booters Dataset - ISCX Botnet 2014 - [CDX CTF 2009](/intrusion-detection-datasets/content/datasets/cdx_2009) -- CIC DoS +- [CIC DoS](/intrusion-detection-datasets/content/datasets/cic_dos) - [CIC-IDS 2017](/intrusion-detection-datasets/content/datasets/cic_ids2017) - CIDDS-001 & 002 - [CSE-CIC-IDS 2018](/intrusion-detection-datasets/content/datasets/cse_cic_ids2018) From 6c5beee268493da4547abb5260d6fd4e6c20b558 Mon Sep 17 00:00:00 2001 From: schlippe Date: Sun, 31 Mar 2024 10:28:42 +0200 Subject: [PATCH 14/21] Refer to original dataset instead of repeating text --- content/datasets/darpa98.md | 17 ++--------------- content/datasets/gure_kddcup.md | 17 ++--------------- 2 files changed, 4 insertions(+), 30 deletions(-) diff --git a/content/datasets/darpa98.md b/content/datasets/darpa98.md index 4098e4a..7a61f4d 100644 --- a/content/datasets/darpa98.md +++ b/content/datasets/darpa98.md @@ -43,24 +43,11 @@ Due to its age and a number of flaws, it should be used with reservations, if at ### Environment -The simulated Air Force base consists of a small number of hosts, leveraging "custom software" to appear as if they were -1000s of hosts with different IP addresses. +Refer to the underlying [DARPA'98 Intrusion Detection Program](darpa98.md). ### Activity -Within the network, automated users perform an array of tasks such as sending mails, browsing, or using services like -FTP, telnet or SNMP. -The total duration of this simulation was nine weeks. -Any protective devices such as firewalls are omitted, as "the focus was on detecting attacks, and not preventing -attacks". -All attacks are performed from the outside of this network, and a sniffer is located at the entry point of the network -to capture this traffic. -Attacks belong to one of four categories: - -- DoS -- Remote to Local -- User to Root -- Surveillance/Probing +Refer to the underlying [DARPA'98 Intrusion Detection Program](darpa98.md). ### Contained Data diff --git a/content/datasets/gure_kddcup.md b/content/datasets/gure_kddcup.md index 6203ba9..9a695a2 100644 --- a/content/datasets/gure_kddcup.md +++ b/content/datasets/gure_kddcup.md @@ -39,23 +39,10 @@ information about both of these datasets can be found in the [Related Entries](# Note that the authors did not directly copy the KDDCup 1999 dataset, but instead recreated it using the same methodology, including additional information in the process. ### Environment -The simulated Air Force base consists of a small number of hosts, leveraging "custom software" to appear as if they were -1000s of hosts with different IP addresses. +Refer to the underlying [DARPA'98 Intrusion Detection Program](darpa98.md). ### Activity -Within the network, automated users perform an array of tasks such as sending mails, browsing, or using services like -FTP, telnet or SNMP. -The total duration of this simulation was nine weeks. -Any protective devices such as firewalls are omitted, as "the focus was on detecting attacks, and not preventing -attacks". -All attacks are performed from the outside of this network, and a sniffer is located at the entry point of the network -to capture this traffic. -Attacks belong to one of four categories: - -- DoS -- Remote to Local -- User to Root -- Surveillance/Probing +Refer to the underlying [DARPA'98 Intrusion Detection Program](darpa98.md). ### Contained Data The raw DARPA data, which comes in the form of binary TCP dumps, is transformed into connection records, mimicking the methodology of the KDDCup 1999 dataset. From 20af38ea61632ab874fb8488f261f11d57d316fe Mon Sep 17 00:00:00 2001 From: schlippe Date: Sun, 31 Mar 2024 10:34:46 +0200 Subject: [PATCH 15/21] Refer to original dataset instead of repeating text --- content/datasets/darpa98.md | 17 +++++++++++++++-- content/datasets/kdd_cup_1999.md | 17 ++--------------- 2 files changed, 17 insertions(+), 17 deletions(-) diff --git a/content/datasets/darpa98.md b/content/datasets/darpa98.md index 7a61f4d..4098e4a 100644 --- a/content/datasets/darpa98.md +++ b/content/datasets/darpa98.md @@ -43,11 +43,24 @@ Due to its age and a number of flaws, it should be used with reservations, if at ### Environment -Refer to the underlying [DARPA'98 Intrusion Detection Program](darpa98.md). +The simulated Air Force base consists of a small number of hosts, leveraging "custom software" to appear as if they were +1000s of hosts with different IP addresses. ### Activity -Refer to the underlying [DARPA'98 Intrusion Detection Program](darpa98.md). +Within the network, automated users perform an array of tasks such as sending mails, browsing, or using services like +FTP, telnet or SNMP. +The total duration of this simulation was nine weeks. +Any protective devices such as firewalls are omitted, as "the focus was on detecting attacks, and not preventing +attacks". +All attacks are performed from the outside of this network, and a sniffer is located at the entry point of the network +to capture this traffic. +Attacks belong to one of four categories: + +- DoS +- Remote to Local +- User to Root +- Surveillance/Probing ### Contained Data diff --git a/content/datasets/kdd_cup_1999.md b/content/datasets/kdd_cup_1999.md index 863033c..43a6cd2 100644 --- a/content/datasets/kdd_cup_1999.md +++ b/content/datasets/kdd_cup_1999.md @@ -43,24 +43,11 @@ Like the dataset it is based on, due to its age and a number of flaws, it should ### Environment -The simulated Air Force base consists of a small number of hosts, leveraging "custom software" to appear as if they were -1000s of hosts with different IP addresses. +Refer to the underlying [DARPA'98 Intrusion Detection Program](darpa98.md). ### Activity -Within the network, automated users perform an array of tasks such as sending mails, browsing, or using services like -FTP, telnet or SNMP. -The total duration of this simulation was nine weeks. -Any protective devices such as firewalls are omitted, as "the focus was on detecting attacks, and not preventing -attacks". -All attacks are performed from the outside of this network, and a sniffer is located at the entry point of the network -to capture this traffic. -Attacks belong to one of four categories: - -- DoS -- Remote to Local -- User to Root -- Surveillance/Probing +Refer to the underlying [DARPA'98 Intrusion Detection Program](darpa98.md). ### Contained Data From 85ba9b0c34b15d8eb48191a10a4d277601b092cf Mon Sep 17 00:00:00 2001 From: schlippe Date: Tue, 2 Apr 2024 13:27:40 +0200 Subject: [PATCH 16/21] Add information about CIC DoS --- content/all_datasets.md | 2 +- content/datasets/cic_dos.md | 76 ++++++++++++++++++++++--------------- 2 files changed, 47 insertions(+), 31 deletions(-) diff --git a/content/all_datasets.md b/content/all_datasets.md index 7db3328..762c285 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -16,7 +16,7 @@ before-content: gh_buttons.html | [AWSCTD](../datasets/awsctd) | Host | Syscalls collected from ~10k malware samples running on Windows 7, no user emulation | 2018 | Single OS | Windows | 🟩 | Sequences of syscall numbers | 10 MB | 558 MB | | [BotsV3](../datasets/botsv3) [_ON HOLD_] | | _Requires usage of Splunk + a bunch of extensions, postponed_ | 2020 | | | | | 17 GB | - | | [CDX CTF 2009](../datasets/cdx_2009) | Network | Dataset captured from a CTF event, generally intended to provide methods for reliable generating labeled datasets from such events | 2009 | Enterprise IT | Windows, Linux | 🟨 | pcaps, Snort IDS alerts, Apache logs, Splunk logs | 12 GB | 15,3 GB | -| [CIC DoS](../datasets/cic_dos) | | | | | | | | | | +| [CIC DoS](../datasets/cic_dos) | Network | Dataset focusing on different DoS attacks targeting the application layer (instead of network layer), but no longer available | 2017 | Enterprise IT | Linux | 🟩 | Network traffic (unknown format) | - | 4,6 GB | | [CIC-IDS2017](../datasets/cic_ids2017) | Network | Simulation of medium-sized company network under attack, focuses solely on network traffic | 2017 | Enterprise IT | Windows, Linux | 🟩 | pcaps, NetFlows, custom network features | 48,4 GB | 50 GB | | [CIDD](../datasets/cidd) | - | Spin on the DARPA'98 dataset, correlating user behavior over different systems/environments for behavior-based IDSs | 2012 | Military IT | Unix | 🟩 | Sequences of user "audits" | - | 22 GB | | [CLUE-LDS](../datasets/clue_lds) | - | Database of real user behavior without known attacks, for evaluation of methods detecting shifts in user behavior | 2022 | Enterprise Subsystem | - (hBox) | 🟥 | Custom event logs | 640 MB | 14,9 GB | diff --git a/content/datasets/cic_dos.md b/content/datasets/cic_dos.md index 5913bbe..f3c8b84 100644 --- a/content/datasets/cic_dos.md +++ b/content/datasets/cic_dos.md @@ -8,40 +8,60 @@ title: CIC DoS - [Contained Data](#contained-data) - [Papers](#papers) - [Links](#links) -- [Data Examples](#data-examples) - -| | | -|--------------------------|----------| -| **Network Log Source** | | -| **Network Logs Labeled** | | -| **Host Log Source** | | -| **Host Logs Labeled** | | -| | | -| **Overall Setting** | | -| **OS Types** | | -| **Number of Machines** | | -| **Total Runtime** | | -| **Year of Collection** | | -| **Attack Categories** | | -| **User Emulation** | | -| | | -| **Packed Size** | | -| **Unpacked Size** | | -| **Download Link** | | +- [Related Entries](#related-entries) + +| | | +|--------------------------|-----------------------| +| **Network Log Source** | Unknown | +| **Network Logs Labeled** | Presumably | +| **Host Log Source** | - | +| **Host Logs Labeled** | - | +| | | +| **Overall Setting** | Single OS | +| **OS Types** | Apache Linux | +| **Number of Machines** | 1 | +| **Total Runtime** | 24 hours | +| **Year of Collection** | 2017 | +| **Attack Categories** | Application-layer DoS | +| **User Emulation** | n/a | +| | | +| **Packed Size** | n/a | +| **Unpacked Size** | 4,6 GB | +| **Download Link** | Currently unavailable | *** ### Overview -A general description of the dataset, giving a brief overview over origin, intended usage and some properties of the dataset. +The Canadian Institute for Cybersecurity (CIC) DoS dataset focuses on Denial-of-Services attacks targeting the application layer (as opposed to the network layer). +The authors argue that these types of DoS attacks commonly avoid traditional network-layer based detection mechanisms, requiring a novel approach. +Specifically, they focus mostly on low-volume DoS attacks, which are characterized by "small amounts of attack traffic transmitted strategically to a victim", whereas high-volume attacks are more similar to traditional DoS attacks, relying on flooding the application layer with requests. +As part of this research, and due to the lack of usable datasets of this kind, the authors introduce the CIC DoS dataset, which consists of 24 hours of traffic collected from a webserver being the victim of such attacks. +However, the dataset is no longer available for unknown reasons, making it both difficult and somewhat pointless to provide a lot of detailed information here. ### Environment -A description of the environment the dataset originated from, including networks, operating systems, running services, etc. +The victim setup consists of a webserver running Apache Linux v2.2.22, PHP5 and Drupal v7 as a content management system. +Further details are not available. ### Activity -What kind of activity, benign and malicious, was performed during the period of data collection. +The declared goal of executed attacks was to render services on the server side unresponsive while being as stealthy and resource-efficient as possible, including stopping attacks as soon as servers became unresponsive. +The authors state that attacks were selected to match the most common types of application layer DoS, resulting in a mix of high- and log-volume attacks. +These attacks were executed leveraging a several publicly available tools such as [Goldeneye](https://github.com/jseidl/GoldenEye) or [Slowloris](https://github.com/gkbrk/slowloris), for a total of eight attacks: +- High-volume HTTP attacks: + - DoS improved GET + - DDoS GET + - DoS GET +- Low-volume HTTP attacks + - slow-send body (twice with different tools) + - slow-send headers (twice with different tools) + - slow-read + +Additional details can be found in chapter 6 of the cited paper. ### Contained Data -What kind of data was collected and how it is present in the dataset, including any processing and labeling. +Traffic from executed attacks was intermixed with benign traces from the [ISCX Intrusion Detection Evaluation Dataset](iscx_ids_2012.md). +Attack traffic was presumably modified to target servers from the ISCX environment, for a total of 24 hours of attack traffic. +In which format (pcaps, NetFlows, custom features, etc.) this data is available is unknown and also not detailed in the paper. +I would assume data is labeled, but obviously have no way to confirm this. ### Papers - [Detecting HTTP-based Application Layer DoS Attacks on Web Servers in the Presence of Sampling (2017)](https://doi.org/10.1016/j.comnet.2017.03.018) @@ -49,9 +69,5 @@ What kind of data was collected and how it is present in the dataset, including ### Links - [Homepage](https://www.unb.ca/cic/datasets/dos-dataset.html) -### Data Examples -Snippet from the dataset, ideally one for each data type. -Note that multi-word annotations (like `json lines`) will not render properly on GitHub Pages. -``` -data example -``` \ No newline at end of file +### Related Entries +- [ISCX Intrusion Detection Evaluation Dataset](iscx_ids_2012.md) \ No newline at end of file From fe6e7803b57008ac97db863ef4a1b306b01d5990 Mon Sep 17 00:00:00 2001 From: schlippe Date: Tue, 2 Apr 2024 13:46:11 +0200 Subject: [PATCH 17/21] Add basic structure for new dataset --- content/all_datasets.md | 1 + content/datasets/cic_ddos.md | 58 ++++++++++++++++++++++++++++++++++++ 2 files changed, 59 insertions(+) create mode 100644 content/datasets/cic_ddos.md diff --git a/content/all_datasets.md b/content/all_datasets.md index 13ac171..795043d 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -15,6 +15,7 @@ before-content: gh_buttons.html | [ASNM Datasets](../datasets/asnm_datasets) | Network | Specialized features extracted from instances of remote buffer overflow attacks for the purpose of anomaly-based detection | 2009-2018 | Mixed | Windows, Linux | 🟩 | Custom NetFlows | 21 MB | 95 GB | | [AWSCTD](../datasets/awsctd) | Host | Syscalls collected from ~10k malware samples running on Windows 7, no user emulation | 2018 | Single OS | Windows | 🟩 | Sequences of syscall numbers | 10 MB | 558 MB | | [CDX CTF 2009](../datasets/cdx_2009) | Network | Dataset captured from a CTF event, generally intended to provide methods for reliable generating labeled datasets from such events | 2009 | Enterprise IT | Windows, Linux | 🟨 | pcaps, Snort IDS alerts, Apache logs, Splunk logs | 12 GB | 15,3 GB | +| [CIC-DDoS2019](../datasets/cic_ddos) | | | | | | | | | | | [CIC-IDS2017](../datasets/cic_ids2017) | Network | Simulation of medium-sized company network under attack, focuses solely on network traffic | 2017 | Enterprise IT | Windows, Linux | 🟩 | pcaps, NetFlows, custom network features | 48,4 GB | 50 GB | | [CIDD](../datasets/cidd) | - | Spin on the DARPA'98 dataset, correlating user behavior over different systems/environments for behavior-based IDSs | 2012 | Military IT | Unix | 🟩 | Sequences of user "audits" | - | 22 GB | | [CLUE-LDS](../datasets/clue_lds) | - | Database of real user behavior without known attacks, for evaluation of methods detecting shifts in user behavior | 2022 | Enterprise Subsystem | - (hBox) | 🟥 | Custom event logs | 640 MB | 14,9 GB | diff --git a/content/datasets/cic_ddos.md b/content/datasets/cic_ddos.md new file mode 100644 index 0000000..ed3cac6 --- /dev/null +++ b/content/datasets/cic_ddos.md @@ -0,0 +1,58 @@ +--- +title: CIC-DDos2019 +--- + +- [Overview](#overview) +- [Environment](#environment) +- [Activity](#activity) +- [Contained Data](#contained-data) +- [Papers](#papers) +- [Links](#links) +- [Data Examples](#data-examples) + +| | | +|--------------------------|---------------------------------------------------------------| +| **Network Log Source** | | +| **Network Logs Labeled** | | +| **Host Log Source** | | +| **Host Logs Labeled** | | +| | | +| **Overall Setting** | | +| **OS Types** | | +| **Number of Machines** | | +| **Total Runtime** | | +| **Year of Collection** | | +| **Attack Categories** | | +| **User Emulation** | | +| | | +| **Packed Size** | | +| **Unpacked Size** | | +| **Download Link** | [goto](http://205.174.165.80/CICDataset/CICDDoS2019/Dataset/) | + +*** + +### Overview +A general description of the dataset, giving a brief overview over origin, intended usage and some properties of the dataset. + +### Environment +A description of the environment the dataset originated from, including networks, operating systems, running services, etc. + +### Activity +What kind of activity, benign and malicious, was performed during the period of data collection. + +### Contained Data +What kind of data was collected and how it is present in the dataset, including any processing and labeling. + +### Papers +- List of related papers, ideally as DOI links + +### Links +- [Homepage](https://www.unb.ca/cic/datasets/ddos-2019.html) +- [Download](http://205.174.165.80/CICDataset/CICDDoS2019/Dataset/) + +### Data Examples +Snippet from the dataset, ideally one for each data type. +Note that multi-word annotations (like `json lines`) will not render properly on GitHub Pages. +``` +data example +``` \ No newline at end of file From a6a93c972f4a7482f97e355b28a8f857e64fb521 Mon Sep 17 00:00:00 2001 From: schlippe Date: Wed, 3 Apr 2024 10:03:13 +0200 Subject: [PATCH 18/21] Fix minor typo --- content/datasets/cic_dos.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/datasets/cic_dos.md b/content/datasets/cic_dos.md index f3c8b84..ae1823b 100644 --- a/content/datasets/cic_dos.md +++ b/content/datasets/cic_dos.md @@ -45,7 +45,7 @@ Further details are not available. ### Activity The declared goal of executed attacks was to render services on the server side unresponsive while being as stealthy and resource-efficient as possible, including stopping attacks as soon as servers became unresponsive. The authors state that attacks were selected to match the most common types of application layer DoS, resulting in a mix of high- and log-volume attacks. -These attacks were executed leveraging a several publicly available tools such as [Goldeneye](https://github.com/jseidl/GoldenEye) or [Slowloris](https://github.com/gkbrk/slowloris), for a total of eight attacks: +These attacks were executed leveraging several publicly available tools such as [Goldeneye](https://github.com/jseidl/GoldenEye) or [Slowloris](https://github.com/gkbrk/slowloris), for a total of eight attacks: - High-volume HTTP attacks: - DoS improved GET - DDoS GET From 651f573a1553e4d4a252480b2ba750442d23170d Mon Sep 17 00:00:00 2001 From: schlippe Date: Wed, 3 Apr 2024 10:08:32 +0200 Subject: [PATCH 19/21] Update log data description --- content/datasets/cse_cic_ids2018.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/datasets/cse_cic_ids2018.md b/content/datasets/cse_cic_ids2018.md index 9b205a4..04d44af 100644 --- a/content/datasets/cse_cic_ids2018.md +++ b/content/datasets/cse_cic_ids2018.md @@ -13,8 +13,8 @@ title: CSE-CIC-IDS2018 | | | |--------------------------|----------------------------------------------------------------------------------------------------------| -| **Network Log Source** | pcaps, network features | -| **Network Logs Labeled** | Only features are labeled | +| **Network Log Source** | pcaps, NetFlows | +| **Network Logs Labeled** | NetFlows are labeled | | **Host Log Source** | Ubuntu event logs, Windows event logs | | **Host Logs Labeled** | No | | | | From 77673f52c4ba8cbb23b52c94ca81051cc9f6da8b Mon Sep 17 00:00:00 2001 From: schlippe Date: Wed, 3 Apr 2024 12:22:56 +0200 Subject: [PATCH 20/21] Add information about CIC-DDoS2019 --- content/all_datasets.md | 4 +-- content/datasets/cic_ddos.md | 65 ++++++++++++++++++++++++------------ 2 files changed, 46 insertions(+), 23 deletions(-) diff --git a/content/all_datasets.md b/content/all_datasets.md index b6f84cf..544b774 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -16,12 +16,12 @@ before-content: gh_buttons.html | [AWSCTD](../datasets/awsctd) | Host | Syscalls collected from ~10k malware samples running on Windows 7, no user emulation | 2018 | Single OS | Windows | 🟩 | Sequences of syscall numbers | 10 MB | 558 MB | | [CDX CTF 2009](../datasets/cdx_2009) | Network | Dataset captured from a CTF event, generally intended to provide methods for reliable generating labeled datasets from such events | 2009 | Enterprise IT | Windows, Linux | 🟨 | pcaps, Snort IDS alerts, Apache logs, Splunk logs | 12 GB | 15,3 GB | | [CIC DoS](../datasets/cic_dos) | Network | Dataset focusing on different DoS attacks targeting the application layer (instead of network layer), but no longer available | 2017 | Enterprise IT | Linux | 🟩 | Network traffic (unknown format) | - | 4,6 GB | -| [CIC-DDoS2019](../datasets/cic_ddos) | | | | | | | | | | +| [CIC-DDoS2019](../datasets/cic_ddos) | Network | Dataset focusing on various DDoS attacks, covering a broad range of categories. Includes beni | 2019 | Enterprise IT | Windows, Linux | 🟩 | Pcaps, NetFlows, Windows events, Ubuntu events | 24,4 GB | - | | [CIC-IDS2017](../datasets/cic_ids2017) | Network | Simulation of medium-sized company network under attack, focuses solely on network traffic | 2017 | Enterprise IT | Windows, Linux | 🟩 | pcaps, NetFlows, custom network features | 48,4 GB | 50 GB | | [CIDD](../datasets/cidd) | - | Spin on the DARPA'98 dataset, correlating user behavior over different systems/environments for behavior-based IDSs | 2012 | Military IT | Unix | 🟩 | Sequences of user "audits" | - | 22 GB | | [CLUE-LDS](../datasets/clue_lds) | - | Database of real user behavior without known attacks, for evaluation of methods detecting shifts in user behavior | 2022 | Enterprise Subsystem | - (hBox) | 🟥 | Custom event logs | 640 MB | 14,9 GB | | [Comprehensive, Multi-Source Cyber-Security Events](../datasets/comp_multi_source_cybersec_events) | Host & Network | Various events from production network with red team activity, but extremely limited information per event | 2015 | Enterprise IT | Windows, Linux | 🟩 | Custom event logs (auth, proc, network flows, dns, redteam) | 12 GB | - | -| [CSE-CIC-IDS2018](../datasets/cse_cic_ids2018) | Network | Simulation of large enterprise IT (450 machines) with user emulation and various attacks, includes host and network logs, but only the latter are labeled | 2018 | Enterprise IT | Windows, Linux, MacOS | 🟨 | pcaps, NetFlows, custom network features, Windows events, Ubuntu events | 220 GB | - | +| [CSE-CIC-IDS2018](../datasets/cse_cic_ids2018) | Network | Simulation of large enterprise IT (450 machines) with user emulation and various attacks, includes host and network logs, but only the latter are labeled | 2018 | Enterprise IT | Windows, Linux, MacOS | 🟩 | pcaps, NetFlows, Windows events, Ubuntu events | 220 GB | - | | [CTU 13](../datasets/ctu_13) | Network | Collection of various botnet behavior combined with loads of background traffic, but very limited feature space | 2011 | Enterprise IT | Windows, Undisclosed | 🟩 | pcaps, NetFlows, Bro logs | - | 697 GB | | [DAPT 2020](../datasets/dapt2020) | Network | Focuses on attacks mimicking those of an APT group, executed in a rather small environment | 2020 | Enterprise IT | Undisclosed | 🟩 | NetFlows, misc. logs (DNS, syslog, auditd, apache, auth, various services) | 460 MB | - | | [DARPA'98 Intrusion Detection Program](../datasets/darpa98) | Network | Simulation of a small U.S. Air Force network under attack. No longer appropriate to use for a multiple reasons | 1998 | Military IT | Unix | 🟩 | tcpdumps, host audit logs, file system dumps | 5 GB | - | diff --git a/content/datasets/cic_ddos.md b/content/datasets/cic_ddos.md index ed3cac6..665cfca 100644 --- a/content/datasets/cic_ddos.md +++ b/content/datasets/cic_ddos.md @@ -12,47 +12,70 @@ title: CIC-DDos2019 | | | |--------------------------|---------------------------------------------------------------| -| **Network Log Source** | | -| **Network Logs Labeled** | | -| **Host Log Source** | | -| **Host Logs Labeled** | | +| **Network Log Source** | Pcaps, NetFlows | +| **Network Logs Labeled** | Flows are labeled | +| **Host Log Source** | Windows event logs, Ubuntu event logs | +| **Host Logs Labeled** | No | | | | -| **Overall Setting** | | -| **OS Types** | | -| **Number of Machines** | | -| **Total Runtime** | | -| **Year of Collection** | | -| **Attack Categories** | | -| **User Emulation** | | +| **Overall Setting** | Enterprise IT | +| **OS Types** | Windows Vista/7/8.1/10
Ubuntu 16.04
Fortinet | +| **Number of Machines** | 6 | +| **Total Runtime** | ~16 hours | +| **Year of Collection** | 2019 | +| **Attack Categories** | Various DDoS attacks | +| **User Emulation** | Yes, models complex behavior | | | | -| **Packed Size** | | -| **Unpacked Size** | | +| **Packed Size** | 24,4 GB | +| **Unpacked Size** | n/a | | **Download Link** | [goto](http://205.174.165.80/CICDataset/CICDDoS2019/Dataset/) | *** ### Overview -A general description of the dataset, giving a brief overview over origin, intended usage and some properties of the dataset. +The CIC-DDos2019 dataset, developed by the Canadian Institute for Cybersecurity (CIC), was created to enable evaluation of new DDoS detection methods, which, according to the authors, was not possible with previously existing datasets containing DDoS attacks. +The dataset is accompanied by a newly proposed taxonomy for DDoS attacks, dividing them into several subclasses. +These attacks are then executed within a small testbed, consisting of a victim network performing benign behavior and a separate attacker network. +This simulation was run on two separate days, namely training and testing day; +data was collected in the form of pcaps, which are then processed into labeled NetFlows. ### Environment -A description of the environment the dataset originated from, including networks, operating systems, running services, etc. +The victim network consists of four Windows machines (Vista/7/8.1/10), an Ubuntu 16.04 Web Server and a firewall. +Information regarding software is not available, IPs of individual machines can be found on the homepage. +Attacks originate from a separate attacker network, which is also not further detailed. ### Activity -What kind of activity, benign and malicious, was performed during the period of data collection. +So-called B(enign)-Profiles are leveraged to define normal behavior which is performed during the collection period; +this simulates 25 distinct users interacting with HTTP, HTTPS, FTP, SSH, and email-protocols. +Statistics for these interactions have been derived from observing real human behavior. + +Executed attacks are based on the newly proposed taxonomy of DDoS attacks, for details regarding this refer to Chapter 3 of the cited paper. +On the first day (training day), 12 different DDoS attacks were executed at different points in time. +On the second day (testing day), a subset of 5 of these attacks were executed, plus a sixth one that was not performed previously. ### Contained Data -What kind of data was collected and how it is present in the dataset, including any processing and labeling. +Attacks were exclusively executed within the collection period, i.e., no attack is running when data collection starts. +Data is organized per day and consists of pcaps, which were then processed into NetFlows using CICFlowMeter and subsequently labeled. +These flows are grouped by attack in separate `csv` files, but there are no flows available for benign behavior. +While these probably could be extracted manually from the available pcaps, I'm honestly not quite sure why they weren't included in the first place. + +A detailed analysis of these flows, especially with respect to the effects of individual attacks on certain features, is available in Chapter 5 of the paper. ### Papers -- List of related papers, ideally as DOI links +- [Developing Realistic Distributed Denial of Service (DDoS) Attack Dataset and Taxonomy (2019)](https://doi.org/10.1109/CCST.2019.8888419) ### Links - [Homepage](https://www.unb.ca/cic/datasets/ddos-2019.html) - [Download](http://205.174.165.80/CICDataset/CICDDoS2019/Dataset/) ### Data Examples -Snippet from the dataset, ideally one for each data type. -Note that multi-word annotations (like `json lines`) will not render properly on GitHub Pages. +Labeled flows taken from `CSVs/CSV-03-11/Portmap.csv` ``` -data example +Unnamed: 0,Flow ID, Source IP, Source Port, Destination IP, Destination Port, Protocol, Timestamp, Flow Duration, Total Fwd Packets, Total Backward Packets,Total Length of Fwd Packets, Total Length of Bwd Packets, Fwd Packet Length Max, Fwd Packet Length Min, Fwd Packet Length Mean, Fwd Packet Length Std,Bwd Packet Length Max, Bwd Packet Length Min, Bwd Packet Length Mean, Bwd Packet Length Std,Flow Bytes/s, Flow Packets/s, Flow IAT Mean, Flow IAT Std, Flow IAT Max, Flow IAT Min,Fwd IAT Total, Fwd IAT Mean, Fwd IAT Std, Fwd IAT Max, Fwd IAT Min,Bwd IAT Total, Bwd IAT Mean, Bwd IAT Std, Bwd IAT Max, Bwd IAT Min,Fwd PSH Flags, Bwd PSH Flags, Fwd URG Flags, Bwd URG Flags, Fwd Header Length, Bwd Header Length,Fwd Packets/s, Bwd Packets/s, Min Packet Length, Max Packet Length, Packet Length Mean, Packet Length Std, Packet Length Variance,FIN Flag Count, SYN Flag Count, RST Flag Count, PSH Flag Count, ACK Flag Count, URG Flag Count, CWE Flag Count, ECE Flag Count, Down/Up Ratio, Average Packet Size, Avg Fwd Segment Size, Avg Bwd Segment Size, Fwd Header Length.1,Fwd Avg Bytes/Bulk, Fwd Avg Packets/Bulk, Fwd Avg Bulk Rate, Bwd Avg Bytes/Bulk, Bwd Avg Packets/Bulk,Bwd Avg Bulk Rate,Subflow Fwd Packets, Subflow Fwd Bytes, Subflow Bwd Packets, Subflow Bwd Bytes,Init_Win_bytes_forward, Init_Win_bytes_backward, act_data_pkt_fwd, min_seg_size_forward,Active Mean, Active Std, Active Max, Active Min,Idle Mean, Idle Std, Idle Max, Idle Min,SimillarHTTP, Inbound, Label +[...] +162471,172.16.0.5-192.168.50.4-932-44723-17,172.16.0.5,932,192.168.50.4,44723,17,2018-11-03 10:01:35.983831,1,2,0,458.0,0.0,229.0,229.0,229.0,0.0,0.0,0.0,0.0,0.0,4.58E8,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,40,0,2000000.0,0.0,229.0,229.0,229.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,343.5,229.0,0.0,40,0,0,0,0,0,0,2,458,0,0,-1,-1,1,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,Portmap +61268,172.16.0.5-192.168.50.4-933-39983-17,172.16.0.5,933,192.168.50.4,39983,17,2018-11-03 10:01:35.984211,1,2,0,458.0,0.0,229.0,229.0,229.0,0.0,0.0,0.0,0.0,0.0,4.58E8,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,40,0,2000000.0,0.0,229.0,229.0,229.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,343.5,229.0,0.0,40,0,0,0,0,0,0,2,458,0,0,-1,-1,1,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,Portmap +27258,172.16.0.5-192.168.50.4-934-26737-17,172.16.0.5,934,192.168.50.4,26737,17,2018-11-03 10:01:35.984213,1,2,0,458.0,0.0,229.0,229.0,229.0,0.0,0.0,0.0,0.0,0.0,4.58E8,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,40,0,2000000.0,0.0,229.0,229.0,229.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,343.5,229.0,0.0,40,0,0,0,0,0,0,2,458,0,0,-1,-1,1,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,Portmap +85566,172.16.0.5-192.168.50.4-648-21313-17,172.16.0.5,648,192.168.50.4,21313,17,2018-11-03 10:01:35.984783,2,2,0,458.0,0.0,229.0,229.0,229.0,0.0,0.0,0.0,0.0,0.0,2.29E8,1000000.0,2.0,0.0,2.0,2.0,2.0,2.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,40,0,1000000.0,0.0,229.0,229.0,229.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,343.5,229.0,0.0,40,0,0,0,0,0,0,2,458,0,0,-1,-1,1,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,Portmap +108025,172.16.0.5-192.168.50.4-935-15051-17,172.16.0.5,935,192.168.50.4,15051,17,2018-11-03 10:01:35.984786,0,2,0,530.0,0.0,265.0,265.0,265.0,0.0,0.0,0.0,0.0,0.0,Infinity,Infinity,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,40,0,0.0,0.0,265.0,265.0,265.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,397.5,265.0,0.0,40,0,0,0,0,0,0,2,530,0,0,-1,-1,1,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,Portmap +87041,172.16.0.5-192.168.50.4-936-49469-17,172.16.0.5,936,192.168.50.4,49469,17,2018-11-03 10:01:35.985305,2,2,0,458.0,0.0,229.0,229.0,229.0,0.0,0.0,0.0,0.0,0.0,2.29E8,1000000.0,2.0,0.0,2.0,2.0,2.0,2.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,40,0,1000000.0,0.0,229.0,229.0,229.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,343.5,229.0,0.0,40,0,0,0,0,0,0,2,458,0,0,-1,-1,1,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,Portmap ``` \ No newline at end of file From 45bb3aa60290e0f29e13ba387528284d26f15803 Mon Sep 17 00:00:00 2001 From: schlippe Date: Wed, 3 Apr 2024 12:28:05 +0200 Subject: [PATCH 21/21] Add missing text --- content/all_datasets.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/all_datasets.md b/content/all_datasets.md index 544b774..badcedc 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -16,7 +16,7 @@ before-content: gh_buttons.html | [AWSCTD](../datasets/awsctd) | Host | Syscalls collected from ~10k malware samples running on Windows 7, no user emulation | 2018 | Single OS | Windows | 🟩 | Sequences of syscall numbers | 10 MB | 558 MB | | [CDX CTF 2009](../datasets/cdx_2009) | Network | Dataset captured from a CTF event, generally intended to provide methods for reliable generating labeled datasets from such events | 2009 | Enterprise IT | Windows, Linux | 🟨 | pcaps, Snort IDS alerts, Apache logs, Splunk logs | 12 GB | 15,3 GB | | [CIC DoS](../datasets/cic_dos) | Network | Dataset focusing on different DoS attacks targeting the application layer (instead of network layer), but no longer available | 2017 | Enterprise IT | Linux | 🟩 | Network traffic (unknown format) | - | 4,6 GB | -| [CIC-DDoS2019](../datasets/cic_ddos) | Network | Dataset focusing on various DDoS attacks, covering a broad range of categories. Includes beni | 2019 | Enterprise IT | Windows, Linux | 🟩 | Pcaps, NetFlows, Windows events, Ubuntu events | 24,4 GB | - | +| [CIC-DDoS2019](../datasets/cic_ddos) | Network | Dataset focusing on various DDoS attacks, covering a broad range of categories. Includes benign behavior, but only for Pcaps, not NetFlows | 2019 | Enterprise IT | Windows, Linux | 🟩 | Pcaps, NetFlows, Windows events, Ubuntu events | 24,4 GB | - | | [CIC-IDS2017](../datasets/cic_ids2017) | Network | Simulation of medium-sized company network under attack, focuses solely on network traffic | 2017 | Enterprise IT | Windows, Linux | 🟩 | pcaps, NetFlows, custom network features | 48,4 GB | 50 GB | | [CIDD](../datasets/cidd) | - | Spin on the DARPA'98 dataset, correlating user behavior over different systems/environments for behavior-based IDSs | 2012 | Military IT | Unix | 🟩 | Sequences of user "audits" | - | 22 GB | | [CLUE-LDS](../datasets/clue_lds) | - | Database of real user behavior without known attacks, for evaluation of methods detecting shifts in user behavior | 2022 | Enterprise Subsystem | - (hBox) | 🟥 | Custom event logs | 640 MB | 14,9 GB |