From 69b865d27356ffc867ce7862afdeb7c3680ea520 Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Wed, 10 Apr 2024 10:40:17 +0200 Subject: [PATCH 01/16] Add basic structure for new dataset --- content/all_datasets.md | 1 + content/datasets/unibs.md | 61 +++++++++++++++++++++++++++++++++++++++ content/related_work.md | 6 ++-- 3 files changed, 65 insertions(+), 3 deletions(-) create mode 100644 content/datasets/unibs.md diff --git a/content/all_datasets.md b/content/all_datasets.md index badcedc..40ccc50 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -49,6 +49,7 @@ before-content: gh_buttons.html | [Twente 2009](../datasets/twente_2009) | Network | Intricately labeled network flows + alerts collected from a single honeypot over the span of 6 days | 2009 | Single OS | Linux | ๐ŸŸฉ | NetFlows | 303 MB | 1,9 GB | | [Twente 2014](../datasets/twente_2014) | Network | Anonymized network flows and host logs from real network, but only those related to ssh authentication, focusing on detecting related brute force attacks | 2014 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows | 2,42 GB | 5,8 GB | | [UGR'16](../datasets/ugr16) | Network | Network flows collected from real network over a long period of time, with some attack traffic injected | 2016 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows | 236 GB | - | +| [UNIBS](../datasets/unibs) | - | | | | | | | | | | [Unified Host and Network Data Set](../datasets/unified_host_and_network_dataset) | - | Selection of network and host events collected from operational environment, but without any attacks | 2017 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | NetFlows, Windows events | - | - | | [Unraveled](../datasets/unraveled) | Host & Network | Large dataset with intricate labeling, though the focus seems to be on network flows. Mapping will be annoying. | 2021 | Enterprise IT | Windows, Ubuntu, Kali | ๐ŸŸฉ | pcaps, misc. logs (syslog, audit, auth, Snort) | - | 22 GB | | [UNSW-NB15](../datasets/unsw_nb15) | Network | Custom network undergoing a variety of attacks using IXIA PerfectStorm hardware. Mostly geared towards anomaly-based NIDS | 2015 | Undisclosed | Undisclosed | ๐ŸŸฉ | pcaps, custom network features | >100 GB | - | diff --git a/content/datasets/unibs.md b/content/datasets/unibs.md new file mode 100644 index 0000000..bdf9da9 --- /dev/null +++ b/content/datasets/unibs.md @@ -0,0 +1,61 @@ +--- +title: UNIBS +--- + +- [Overview](#overview) +- [Environment](#environment) +- [Activity](#activity) +- [Contained Data](#contained-data) +- [Papers](#papers) +- [Links](#links) +- [Data Examples](#data-examples) + +| | | +|--------------------------|--------------------------------------------------------------------| +| **Network Log Source** | | +| **Network Logs Labeled** | | +| **Host Log Source** | | +| **Host Logs Labeled** | | +| | | +| **Overall Setting** | | +| **OS Types** | | +| **Number of Machines** | | +| **Total Runtime** | | +| **Year of Collection** | | +| **Attack Categories** | | +| **User Emulation** | | +| | | +| **Packed Size** | | +| **Unpacked Size** | | +| **Download Link** | [must be requested](http://netweb.ing.unibs.it/~ntw/tools/traces/) | + +*** + +### Overview +A general description of the dataset, giving a brief overview over origin, intended usage and some properties of the dataset. + +### Environment +A description of the environment the dataset originated from, including networks, operating systems, running services, etc. + +### Activity +What kind of activity, benign and malicious, was performed during the period of data collection. + +### Contained Data +What kind of data was collected and how it is present in the dataset, including any processing and labeling. + +### Papers +- [GT: picking up the truth from the ground for internet traffic (2009)](https://doi.org/10.1145/1629607.1629610) + +### Links +- [Homepage](http://netweb.ing.unibs.it/~ntw/tools/traces/) + +### Data Examples +Snippet from the dataset, ideally one for each data type. +Note that multi-word annotations (like `json lines`) will not render properly on GitHub Pages. +Wrapping these snippets with `raw`/`endraw` is not strictly required, but prevents Liquid from parsing anything it shouldn't. + + +``` +data example +``` + \ No newline at end of file diff --git a/content/related_work.md b/content/related_work.md index 7a70a75..987d4df 100644 --- a/content/related_work.md +++ b/content/related_work.md @@ -78,7 +78,7 @@ Referenced datasets: - [NSL-KDD](/intrusion-detection-datasets/content/datasets/nsl_kdd_dataset) - [Twente 2009](/intrusion-detection-datasets/content/datasets/twente_2009) - [UNSW NB15](/intrusion-detection-datasets/content/datasets/unsw_nb15) -- Mentioned, but not further detailed:
Metrosec, UNIBS 2009, TUIDS, University of Napoli traffic dataset, CSIC 2010 HTTP dataset, UNM system call dataset +- Mentioned, but not further detailed:
Metrosec, [UNIBS](/intrusion-detection-datasets/content/datasets/unibs), TUIDS, University of Napoli traffic dataset, CSIC 2010 HTTP dataset, UNM system call dataset Referenced collections: - CAIDA @@ -189,7 +189,7 @@ Referenced datasets: - TRAbID - TUIDS - [Twente 2009](/intrusion-detection-datasets/content/datasets/twente_2009) -- UNIBS +- [UNIBS](/intrusion-detection-datasets/content/datasets/unibs) - [Unified Host and Network dataset](/intrusion-detection-datasets/content/datasets/unified_host_and_network_dataset) - [UNSW-NB15](/intrusion-detection-datasets/content/datasets/unsw_nb15) @@ -230,7 +230,7 @@ Referenced datasets: - [ISCX IDS 2012](/intrusion-detection-datasets/content/datasets/iscx_ids_2012) - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) - [NSL-KDD](/intrusion-detection-datasets/content/datasets/nsl_kdd_dataset) -- UNIBS +- [UNIBS](/intrusion-detection-datasets/content/datasets/unibs) Referenced collections: - CAIDA From d3dacabef17a6f466d6dd30c9cd83b9a163185b5 Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Wed, 10 Apr 2024 10:50:44 +0200 Subject: [PATCH 02/16] Add basic structure for new dataset --- content/all_datasets.md | 1 + content/datasets/isot_botnet.md | 61 +++++++++++++++++++++++++++++++++ content/related_work.md | 4 +-- 3 files changed, 64 insertions(+), 2 deletions(-) create mode 100644 content/datasets/isot_botnet.md diff --git a/content/all_datasets.md b/content/all_datasets.md index badcedc..dda875e 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -30,6 +30,7 @@ before-content: gh_buttons.html | [EVTX to MITRE ATT&CK](../datasets/evtx_to_mitre_attck) | Host | Small dataset providing various events corresponding to certain MITRE tactics/techniques | 2022 | Single OS | Windows | ๐ŸŸฉ | Windows events | <1 GB | <1 GB | | [gureKDDCup](../datasets/gure_kddcup) | Network | An extension of the KDDCup 1999 dataset, adding additional information about payloads to each connection record | 2008 | Military IT | Unix | ๐ŸŸฉ | Connection records with payload information | 10 GB | - | | [ISCX Intrusion Detection Evaluation](../datasets/iscx_ids_2012) | Network | Focus on realistic traffic generation in a company network, combined with some basic attacks | 2012 | Enterprise IT | Windows, Ubuntu | ๐ŸŸฉ | pcaps | 84 GB | 87 GB | +| [ISOT Botnet](../datasets/isot_botnet) | | | | | | | | | | | [KDD Cup 1999](../datasets/kdd_cup_1999) | Network | Network connection events derived from simulated U.S. Air Force network under attack. No longer appropriate to use for multiple reasons | 1999 | Military IT | Unix | ๐ŸŸฉ | Connection records | 18 MB | 743 MB | | [Kyoto Honeypot](../datasets/kyoto_honeypot) | Network | Collection of features derived from attack traffic targeting honeypots over the span of 9 years | 2006-2015 | Diverse | Windows, Unix, MacOS | ๐ŸŸฉ | Custom network features | 20 GB | - | | [LID-DS 2019](../datasets/lids_ds_2019) | Host | Contains system calls + associated data/metadata for a variety of Linux exploits, includes normal behavior | 2019 | Single OS | Ubuntu | ๐ŸŸจ | Sequences of syscalls with extended information | 13 GB | - | diff --git a/content/datasets/isot_botnet.md b/content/datasets/isot_botnet.md new file mode 100644 index 0000000..7656922 --- /dev/null +++ b/content/datasets/isot_botnet.md @@ -0,0 +1,61 @@ +--- +title: ISOT BOTNET +--- + +- [Overview](#overview) +- [Environment](#environment) +- [Activity](#activity) +- [Contained Data](#contained-data) +- [Papers](#papers) +- [Links](#links) +- [Data Examples](#data-examples) + +| | | +|--------------------------|--------------------------------------------------------------------------------| +| **Network Log Source** | | +| **Network Logs Labeled** | | +| **Host Log Source** | | +| **Host Logs Labeled** | | +| | | +| **Overall Setting** | | +| **OS Types** | | +| **Number of Machines** | | +| **Total Runtime** | | +| **Year of Collection** | | +| **Attack Categories** | | +| **User Emulation** | | +| | | +| **Packed Size** | | +| **Unpacked Size** | | +| **Download Link** | [goto](https://drive.google.com/file/d/1X1zPBJFPHU1ToQbpyd1Is1tJJuz2BeRd/view) | + +*** + +### Overview +A general description of the dataset, giving a brief overview over origin, intended usage and some properties of the dataset. + +### Environment +A description of the environment the dataset originated from, including networks, operating systems, running services, etc. + +### Activity +What kind of activity, benign and malicious, was performed during the period of data collection. + +### Contained Data +What kind of data was collected and how it is present in the dataset, including any processing and labeling. + +### Papers +- [Detecting P2P botnets through network behavior analysis and machine learning (2011)](https://doi.org/10.1109/PST.2011.5971980) + +### Links +- [Documentation](https://onlineacademiccommunity.uvic.ca/isot/wp-content/uploads/sites/7295/2023/03/ISOT-Dataset-Overview-v0.5.pdf) + +### Data Examples +Snippet from the dataset, ideally one for each data type. +Note that multi-word annotations (like `json lines`) will not render properly on GitHub Pages. +Wrapping these snippets with `raw`/`endraw` is not strictly required, but prevents Liquid from parsing anything it shouldn't. + + +``` +data example +``` + \ No newline at end of file diff --git a/content/related_work.md b/content/related_work.md index 7a70a75..d6c8cfd 100644 --- a/content/related_work.md +++ b/content/related_work.md @@ -72,7 +72,7 @@ Referenced datasets: - [DARPA'98 Intrusion Detection Program](/intrusion-detection-datasets/content/datasets/darpa98) - [gureKDDCup](/intrusion-detection-datasets/content/datasets/gure_kddcup) - [ISCX IDS 2012](/intrusion-detection-datasets/content/datasets/iscx_ids_2012) -- ISOT +- [ISOT Botnet](/intrusion-detection-datasets/content/datasets/isot_botnet) - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) - [Kyoto Honeypot](/intrusion-detection-datasets/content/datasets/kyoto_honeypot) - [NSL-KDD](/intrusion-detection-datasets/content/datasets/nsl_kdd_dataset) @@ -173,7 +173,7 @@ Referenced datasets: - DDoS 2016 - IRSC - [ISCX IDS 2012](/intrusion-detection-datasets/content/datasets/iscx_ids_2012) -- ISOT +- [ISOT Botnet](/intrusion-detection-datasets/content/datasets/isot_botnet) - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) - [Kent 2016](/intrusion-detection-datasets/content/datasets/comp_multi_source_cybersec_events) (alias for: Comprehensive, Multi-Source Cybersecurity Events) - [Kyoto Honeypot](/intrusion-detection-datasets/content/datasets/kyoto_honeypot) From e1d0df73ef066c4725dce394a36e586d951176d1 Mon Sep 17 00:00:00 2001 From: schlippe Date: Thu, 18 Apr 2024 15:47:25 +0200 Subject: [PATCH 03/16] Add mention of issue listing datasets that need to be added --- content/contributing.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/content/contributing.md b/content/contributing.md index 4ad010b..a303d1e 100644 --- a/content/contributing.md +++ b/content/contributing.md @@ -16,5 +16,7 @@ If you want to contribute a new dataset entry, please use this [template](https: A new entry should consist of said template filled out and named appropriately, placed in `/content/datasets/`. Additionally, a new row should be added to the list of all datasets in `/content/all_datasets.md`, adding information to each cell as needed. +You can find a list of datasets that we are aware of, but which do not have an entry, in [this issue](https://github.com/fkie-cad/intrusion-detection-datasets/issues/13) + On every page you will also find an "Edit Page" button at the bottom leading you to GitHub, where you will be prompted to fork this repository - saving you a few clicks when you want to edit an existing entry. While contributions should generally be aimed towards datasets, suggestions regarding the underlying structure (like the website itself) are of course also welcome. \ No newline at end of file From a187eea37c9992e2c2d16f643da230f2b1aa32d0 Mon Sep 17 00:00:00 2001 From: schlippe Date: Thu, 18 Apr 2024 15:48:22 +0200 Subject: [PATCH 04/16] Fix typo --- content/contributing.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/contributing.md b/content/contributing.md index a303d1e..7db54a0 100644 --- a/content/contributing.md +++ b/content/contributing.md @@ -16,7 +16,7 @@ If you want to contribute a new dataset entry, please use this [template](https: A new entry should consist of said template filled out and named appropriately, placed in `/content/datasets/`. Additionally, a new row should be added to the list of all datasets in `/content/all_datasets.md`, adding information to each cell as needed. -You can find a list of datasets that we are aware of, but which do not have an entry, in [this issue](https://github.com/fkie-cad/intrusion-detection-datasets/issues/13) +You can find a list of datasets that we are aware of, but which do not have an entry yet, in [this issue](https://github.com/fkie-cad/intrusion-detection-datasets/issues/13) On every page you will also find an "Edit Page" button at the bottom leading you to GitHub, where you will be prompted to fork this repository - saving you a few clicks when you want to edit an existing entry. While contributions should generally be aimed towards datasets, suggestions regarding the underlying structure (like the website itself) are of course also welcome. \ No newline at end of file From 7c0e534c88cca9da2301480636b42f7e65d278d1 Mon Sep 17 00:00:00 2001 From: schlippe Date: Wed, 24 Apr 2024 13:26:06 +0200 Subject: [PATCH 05/16] Update UNIBS entry --- content/all_datasets.md | 2 +- content/datasets/unibs.md | 14 +++++++------- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/content/all_datasets.md b/content/all_datasets.md index 40ccc50..161499c 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -49,7 +49,7 @@ before-content: gh_buttons.html | [Twente 2009](../datasets/twente_2009) | Network | Intricately labeled network flows + alerts collected from a single honeypot over the span of 6 days | 2009 | Single OS | Linux | ๐ŸŸฉ | NetFlows | 303 MB | 1,9 GB | | [Twente 2014](../datasets/twente_2014) | Network | Anonymized network flows and host logs from real network, but only those related to ssh authentication, focusing on detecting related brute force attacks | 2014 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows | 2,42 GB | 5,8 GB | | [UGR'16](../datasets/ugr16) | Network | Network flows collected from real network over a long period of time, with some attack traffic injected | 2016 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows | 236 GB | - | -| [UNIBS](../datasets/unibs) | - | | | | | | | | | +| [UNIBS](../datasets/unibs) | - | | 2009 | | | | | | | | [Unified Host and Network Data Set](../datasets/unified_host_and_network_dataset) | - | Selection of network and host events collected from operational environment, but without any attacks | 2017 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | NetFlows, Windows events | - | - | | [Unraveled](../datasets/unraveled) | Host & Network | Large dataset with intricate labeling, though the focus seems to be on network flows. Mapping will be annoying. | 2021 | Enterprise IT | Windows, Ubuntu, Kali | ๐ŸŸฉ | pcaps, misc. logs (syslog, audit, auth, Snort) | - | 22 GB | | [UNSW-NB15](../datasets/unsw_nb15) | Network | Custom network undergoing a variety of attacks using IXIA PerfectStorm hardware. Mostly geared towards anomaly-based NIDS | 2015 | Undisclosed | Undisclosed | ๐ŸŸฉ | pcaps, custom network features | >100 GB | - | diff --git a/content/datasets/unibs.md b/content/datasets/unibs.md index bdf9da9..8df48e5 100644 --- a/content/datasets/unibs.md +++ b/content/datasets/unibs.md @@ -12,18 +12,18 @@ title: UNIBS | | | |--------------------------|--------------------------------------------------------------------| -| **Network Log Source** | | -| **Network Logs Labeled** | | -| **Host Log Source** | | -| **Host Logs Labeled** | | +| **Network Data Source** | | +| **Network Data Labeled** | | +| **Host Data Source** | | +| **Host Data Labeled** | | | | | | **Overall Setting** | | | **OS Types** | | | **Number of Machines** | | | **Total Runtime** | | | **Year of Collection** | | -| **Attack Categories** | | -| **User Emulation** | | +| **Attack Categories** | None | +| **Benign Activity** | Real users | | | | | **Packed Size** | | | **Unpacked Size** | | @@ -32,7 +32,7 @@ title: UNIBS *** ### Overview -A general description of the dataset, giving a brief overview over origin, intended usage and some properties of the dataset. +The University of Brescia (UNIBS) dataset was created to showcase the capabilities of the "GT" tool, ### Environment A description of the environment the dataset originated from, including networks, operating systems, running services, etc. From f4be810b8c5e9b533da1ba4bf46a628e17858d54 Mon Sep 17 00:00:00 2001 From: schlippe Date: Wed, 24 Apr 2024 14:43:51 +0200 Subject: [PATCH 06/16] Add information about UNIBS dataset --- content/all_datasets.md | 2 +- content/datasets/unibs.md | 55 +++++++++++++++++++-------------------- 2 files changed, 28 insertions(+), 29 deletions(-) diff --git a/content/all_datasets.md b/content/all_datasets.md index 161499c..f12a549 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -49,7 +49,7 @@ before-content: gh_buttons.html | [Twente 2009](../datasets/twente_2009) | Network | Intricately labeled network flows + alerts collected from a single honeypot over the span of 6 days | 2009 | Single OS | Linux | ๐ŸŸฉ | NetFlows | 303 MB | 1,9 GB | | [Twente 2014](../datasets/twente_2014) | Network | Anonymized network flows and host logs from real network, but only those related to ssh authentication, focusing on detecting related brute force attacks | 2014 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows | 2,42 GB | 5,8 GB | | [UGR'16](../datasets/ugr16) | Network | Network flows collected from real network over a long period of time, with some attack traffic injected | 2016 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows | 236 GB | - | -| [UNIBS](../datasets/unibs) | - | | 2009 | | | | | | | +| [UNIBS](../datasets/unibs) | - | Traces annotated with additional application-level ground truth to showcase a novel toolset, but does not feature any attacks | 2009 | Enterprise IT | Undisclosed | ๐ŸŸฅ | NetFlows | - | 2,7 GB | | [Unified Host and Network Data Set](../datasets/unified_host_and_network_dataset) | - | Selection of network and host events collected from operational environment, but without any attacks | 2017 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | NetFlows, Windows events | - | - | | [Unraveled](../datasets/unraveled) | Host & Network | Large dataset with intricate labeling, though the focus seems to be on network flows. Mapping will be annoying. | 2021 | Enterprise IT | Windows, Ubuntu, Kali | ๐ŸŸฉ | pcaps, misc. logs (syslog, audit, auth, Snort) | - | 22 GB | | [UNSW-NB15](../datasets/unsw_nb15) | Network | Custom network undergoing a variety of attacks using IXIA PerfectStorm hardware. Mostly geared towards anomaly-based NIDS | 2015 | Undisclosed | Undisclosed | ๐ŸŸฉ | pcaps, custom network features | >100 GB | - | diff --git a/content/datasets/unibs.md b/content/datasets/unibs.md index 8df48e5..04270a7 100644 --- a/content/datasets/unibs.md +++ b/content/datasets/unibs.md @@ -8,54 +8,53 @@ title: UNIBS - [Contained Data](#contained-data) - [Papers](#papers) - [Links](#links) -- [Data Examples](#data-examples) | | | |--------------------------|--------------------------------------------------------------------| -| **Network Data Source** | | -| **Network Data Labeled** | | -| **Host Data Source** | | -| **Host Data Labeled** | | +| **Network Data Source** | NetFlows | +| **Network Data Labeled** | No | +| **Host Data Source** | - | +| **Host Data Labeled** | - | | | | -| **Overall Setting** | | -| **OS Types** | | -| **Number of Machines** | | -| **Total Runtime** | | -| **Year of Collection** | | +| **Overall Setting** | Enterprise IT | +| **OS Types** | Undisclosed | +| **Number of Machines** | 20 | +| **Total Runtime** | 3 days | +| **Year of Collection** | 2009 | | **Attack Categories** | None | | **Benign Activity** | Real users | | | | -| **Packed Size** | | -| **Unpacked Size** | | +| **Packed Size** | - | +| **Unpacked Size** | 2,7 GB | | **Download Link** | [must be requested](http://netweb.ing.unibs.it/~ntw/tools/traces/) | *** ### Overview -The University of Brescia (UNIBS) dataset was created to showcase the capabilities of the "GT" tool, +The University of Brescia (UNIBS) dataset was created to showcase the capabilities of the "GT" software, an open source toolset facilitating the association of application-level ground truth with network traffic traces. +This is done by probing a monitored host's kernel to gather ground truth at the application level, which can then later be assigned to any collected traces with minimal CPU overhead. +Beyond this, the dataset does not seem to serve a greater purpose, as it does not contain any malicious activity (that the authors are aware of) and is also anonymized. ### Environment -A description of the environment the dataset originated from, including networks, operating systems, running services, etc. +Traffic was collected from 20 workstations located in the campus network of the University of Brescia over the course of three consecutive days (2009-09-30 to 2009-10-02). +Each workstation is running a "GT client daemon", information regarding network configuration or specific operating systems is not available. ### Activity -What kind of activity, benign and malicious, was performed during the period of data collection. +(Presumably) real users used a variety of traffic generating applications and protocols, namely: +- Web (HTTP, HTTPS) +- Mail (POP3, IMAP4, SMTP) +- Skype +- P2P (Bittorrent, Edonkey) +- Other (FTP, SSH, MSN) +Any further details are not available, most likely because the focus of this dataset was simply on correctly assigning flows to these services or protocols. +Intentional malicious activity is not present. ### Contained Data -What kind of data was collected and how it is present in the dataset, including any processing and labeling. +Traffic was collected from the central faculty router via `tcpdump` and enriched with ground truth from the GT tool (in the form of related protocol and application). +It is available in an anonymized and payload-stripped form, presumably in as NetFlows, but has to be requested via mail. ### Papers - [GT: picking up the truth from the ground for internet traffic (2009)](https://doi.org/10.1145/1629607.1629610) ### Links -- [Homepage](http://netweb.ing.unibs.it/~ntw/tools/traces/) - -### Data Examples -Snippet from the dataset, ideally one for each data type. -Note that multi-word annotations (like `json lines`) will not render properly on GitHub Pages. -Wrapping these snippets with `raw`/`endraw` is not strictly required, but prevents Liquid from parsing anything it shouldn't. - - -``` -data example -``` - \ No newline at end of file +- [Homepage](http://netweb.ing.unibs.it/~ntw/tools/traces/) \ No newline at end of file From 004b76950738b4e676c79bfe88e097d205e21598 Mon Sep 17 00:00:00 2001 From: schlippe Date: Wed, 24 Apr 2024 14:44:54 +0200 Subject: [PATCH 07/16] Add missing newline --- content/datasets/unibs.md | 1 + 1 file changed, 1 insertion(+) diff --git a/content/datasets/unibs.md b/content/datasets/unibs.md index 04270a7..092a1d6 100644 --- a/content/datasets/unibs.md +++ b/content/datasets/unibs.md @@ -46,6 +46,7 @@ Each workstation is running a "GT client daemon", information regarding network - Skype - P2P (Bittorrent, Edonkey) - Other (FTP, SSH, MSN) + Any further details are not available, most likely because the focus of this dataset was simply on correctly assigning flows to these services or protocols. Intentional malicious activity is not present. From 83407f5def47db9feb0eb06cbd9b3fadae2338bf Mon Sep 17 00:00:00 2001 From: schlippe Date: Wed, 24 Apr 2024 18:53:39 +0200 Subject: [PATCH 08/16] Add information about ISOT Botnet dataset --- content/all_datasets.md | 2 +- content/datasets/isot_botnet.md | 59 +++++++++++++++++---------------- 2 files changed, 31 insertions(+), 30 deletions(-) diff --git a/content/all_datasets.md b/content/all_datasets.md index dda875e..aacfedd 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -30,7 +30,7 @@ before-content: gh_buttons.html | [EVTX to MITRE ATT&CK](../datasets/evtx_to_mitre_attck) | Host | Small dataset providing various events corresponding to certain MITRE tactics/techniques | 2022 | Single OS | Windows | ๐ŸŸฉ | Windows events | <1 GB | <1 GB | | [gureKDDCup](../datasets/gure_kddcup) | Network | An extension of the KDDCup 1999 dataset, adding additional information about payloads to each connection record | 2008 | Military IT | Unix | ๐ŸŸฉ | Connection records with payload information | 10 GB | - | | [ISCX Intrusion Detection Evaluation](../datasets/iscx_ids_2012) | Network | Focus on realistic traffic generation in a company network, combined with some basic attacks | 2012 | Enterprise IT | Windows, Ubuntu | ๐ŸŸฉ | pcaps | 84 GB | 87 GB | -| [ISOT Botnet](../datasets/isot_botnet) | | | | | | | | | | +| [ISOT Botnet](../datasets/isot_botnet) | Network | Amalgamation of existing malicious botnet and normal traces to test novel botnet detection methods | 2010 | Enterprise IT | Undisclosed | ๐ŸŸฉ | pcaps | 3 GB | 10,6 GB | | [KDD Cup 1999](../datasets/kdd_cup_1999) | Network | Network connection events derived from simulated U.S. Air Force network under attack. No longer appropriate to use for multiple reasons | 1999 | Military IT | Unix | ๐ŸŸฉ | Connection records | 18 MB | 743 MB | | [Kyoto Honeypot](../datasets/kyoto_honeypot) | Network | Collection of features derived from attack traffic targeting honeypots over the span of 9 years | 2006-2015 | Diverse | Windows, Unix, MacOS | ๐ŸŸฉ | Custom network features | 20 GB | - | | [LID-DS 2019](../datasets/lids_ds_2019) | Host | Contains system calls + associated data/metadata for a variety of Linux exploits, includes normal behavior | 2019 | Single OS | Ubuntu | ๐ŸŸจ | Sequences of syscalls with extended information | 13 GB | - | diff --git a/content/datasets/isot_botnet.md b/content/datasets/isot_botnet.md index 7656922..cc594e1 100644 --- a/content/datasets/isot_botnet.md +++ b/content/datasets/isot_botnet.md @@ -8,54 +8,55 @@ title: ISOT BOTNET - [Contained Data](#contained-data) - [Papers](#papers) - [Links](#links) -- [Data Examples](#data-examples) | | | |--------------------------|--------------------------------------------------------------------------------| -| **Network Log Source** | | -| **Network Logs Labeled** | | -| **Host Log Source** | | -| **Host Logs Labeled** | | +| **Network Data Source** | pcaps | +| **Network Data Labeled** | Yes | +| **Host Data Source** | - | +| **Host Data Labeled** | - | | | | -| **Overall Setting** | | -| **OS Types** | | -| **Number of Machines** | | -| **Total Runtime** | | -| **Year of Collection** | | -| **Attack Categories** | | -| **User Emulation** | | +| **Overall Setting** | Enterprise IT | +| **OS Types** | Undisclosed | +| **Number of Machines** | 2000+ | +| **Total Runtime** | n/a | +| **Year of Collection** | 2004-2010 | +| **Attack Categories** | Botnets (Storm, Waledac) | +| **Benign Activity** | Real users | | | | -| **Packed Size** | | -| **Unpacked Size** | | +| **Packed Size** | 3 GB | +| **Unpacked Size** | 10,6 GB | | **Download Link** | [goto](https://drive.google.com/file/d/1X1zPBJFPHU1ToQbpyd1Is1tJJuz2BeRd/view) | *** ### Overview -A general description of the dataset, giving a brief overview over origin, intended usage and some properties of the dataset. +The ISOT Botnet dataset is an amalgamation of several individual datasets, two containing malicious botnet traffic, and five datasets consisting of benign traffic. +Malicious data was taken from the "French Chapter" of the Honeynet project, while (anonymized) benign traces come from the LBNL Enterprise Trace Repository. +The combination of these traces, after some preprocessing to make them appear as if they would stem from the same network, are then used to test several botnet detection methods leveraging network behavior analysis and machine learning. +However, we were unable to find any information regarding the source of malicious traces, as linked pages no longer exist and further search remained fruitless. ### Environment -A description of the environment the dataset originated from, including networks, operating systems, running services, etc. +The merged dataset contains traces from 23 individual subnets, 22 with only benign traffic (stemming from the LBNL traces) and one with both malicious and benign traffic (merged traffic from both sources). +The IPs of the latter subnet can be obtained from Table 2 of the linked documentation. +Information regarding services, operating systems and so on are not available. ### Activity -What kind of activity, benign and malicious, was performed during the period of data collection. +Details regarding activity are not available; +there might be some additional information hidden in LBNL publications, but we consider this to be out of scope. ### Contained Data -What kind of data was collected and how it is present in the dataset, including any processing and labeling. +As a first step to merge benign and malicious traces, the IP addresses of infected machines were mapped to two of the machines providing benign background traffic. +Then, the authors used to the `TcpReplay` tool to replay all traces on the same network interface in order to homogenize the network behavior shown by individual datasets. +These traces are simply available in the form of a single large pcap file with 1,675,424 unique flows, of which 3.33% are malicious. +Labels are available via malicious traffic having a specific MAC, as per Table 2 of the linked documentation. + +It should be noted that the application of methods based on machine learning on merged datasets bears some additional risks; +researchers must ensure that results are not a byproducts of anomalies that remained after the merging, which might not actually be caused by the malicious behavior, but rather the simple fact that these traces stemmed from separate environments. ### Papers - [Detecting P2P botnets through network behavior analysis and machine learning (2011)](https://doi.org/10.1109/PST.2011.5971980) ### Links - [Documentation](https://onlineacademiccommunity.uvic.ca/isot/wp-content/uploads/sites/7295/2023/03/ISOT-Dataset-Overview-v0.5.pdf) - -### Data Examples -Snippet from the dataset, ideally one for each data type. -Note that multi-word annotations (like `json lines`) will not render properly on GitHub Pages. -Wrapping these snippets with `raw`/`endraw` is not strictly required, but prevents Liquid from parsing anything it shouldn't. - - -``` -data example -``` - \ No newline at end of file +- [LBNL/ICSI Enterprise Tracing Project](https://www.icir.org/enterprise-tracing/download.html) \ No newline at end of file From 0a9c1d2b803cf2ac1bcddf6b9bbbcb3e04a3e20b Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Thu, 25 Apr 2024 15:09:54 +0200 Subject: [PATCH 09/16] Fix typos --- content/datasets/isot_botnet.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/datasets/isot_botnet.md b/content/datasets/isot_botnet.md index cc594e1..7c92d28 100644 --- a/content/datasets/isot_botnet.md +++ b/content/datasets/isot_botnet.md @@ -52,7 +52,7 @@ These traces are simply available in the form of a single large pcap file with 1 Labels are available via malicious traffic having a specific MAC, as per Table 2 of the linked documentation. It should be noted that the application of methods based on machine learning on merged datasets bears some additional risks; -researchers must ensure that results are not a byproducts of anomalies that remained after the merging, which might not actually be caused by the malicious behavior, but rather the simple fact that these traces stemmed from separate environments. +researchers must ensure that results are not a byproducts of anomalies that remained after the merging process, which might not actually be caused by the malicious behavior, but rather the simple fact that these traces stem from separate environments. ### Papers - [Detecting P2P botnets through network behavior analysis and machine learning (2011)](https://doi.org/10.1109/PST.2011.5971980) From 4110f876f3cdff7c849b6ca26992f3094193caa0 Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Thu, 25 Apr 2024 15:35:49 +0200 Subject: [PATCH 10/16] Fix typo --- content/datasets/unibs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/datasets/unibs.md b/content/datasets/unibs.md index 092a1d6..2b495fd 100644 --- a/content/datasets/unibs.md +++ b/content/datasets/unibs.md @@ -52,7 +52,7 @@ Intentional malicious activity is not present. ### Contained Data Traffic was collected from the central faculty router via `tcpdump` and enriched with ground truth from the GT tool (in the form of related protocol and application). -It is available in an anonymized and payload-stripped form, presumably in as NetFlows, but has to be requested via mail. +It is available in an anonymized and payload-stripped form, presumably as NetFlows, but has to be requested via mail. ### Papers - [GT: picking up the truth from the ground for internet traffic (2009)](https://doi.org/10.1145/1629607.1629610) From 1052529a9a6b19d3f1f3b80c71e5ceaab39b090a Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Thu, 25 Apr 2024 16:06:24 +0200 Subject: [PATCH 11/16] Update CSV file --- assets/data/datasets.csv | 2 ++ 1 file changed, 2 insertions(+) diff --git a/assets/data/datasets.csv b/assets/data/datasets.csv index 545d930..3fa53a6 100644 --- a/assets/data/datasets.csv +++ b/assets/data/datasets.csv @@ -41,9 +41,11 @@ TUIDS;Yes;No;2012;2012;Enterprise IT;Undisclosed;pcaps, NetFlows;Features are la VAST Challenge 2012;Yes;No;2012;2012;Enterprise IT;Undisclosed;Firewall and IDS logs;Ground truth provided;-;-;BotNet;Presumably synthetic, but not detailed;186.0;2900.0; CTU 13;Yes;No;2011;2011;Enterprise IT;Windows, Undisclosed;pcaps, NetFlows;Yes, NetFlows are labeled;-;-;Various Botnet activity, (Neris, Rbot, Virut, Menti, Sogou, Murlo, NSIS.ay);Real background traffic;;697000.0; VAST Challenge 2011;Yes;No;2011;2011;Enterprise IT;Windows;Firewall, Snort;Ground truth provided;OS Security Events;Ground truth provided;Reconnaissance, DoS, Persistence;Present, but not further explained;940.0;9300.0; +ISOT Botnet;Yes;No;2010;2010;Enterprise IT;Undisclosed;pcaps;Yes;-;-;Botnets (Storm, Waledac);Real users;3000.0;10600.0; CDX CTF 2009;Yes;No;2009;2009;Enterprise IT;Windows, Linux;pcaps, snort IDS alerts;Some ground truth provided;Apache web server logs, Splunk logs;No;n/a;Synthetic, via scripts;12000.0;15300.0; NSL-KDD;Yes;No;2009;2009;Military IT;Unix;Connection records;Yes;-;-;DoS, Remote to Local, User to Root, Surveillance/Probing;Scripts for synthetic traffic generation, real humans for performing complex tasks;6.0;19.0; Twente 2009;Yes;No;2009;2009;Single OS;Linux;NetFlows;Yes;-;-;Diverse;None;303.0;1900.0; +UNIBS;No;No;2009;2009;Enterprise IT;Undisclosed;NetFlows;No;-;-;None;Real users;;2700.0; gureKDDCup;Yes;No;2008;2008;Military IT;Unix;Connection records with payload;Yes;-;-;DoS, Remote to Local, User to Root, Surveillance/Probing;Scripts for synthetic traffic generation, real humans for performing complex tasks;10000.0;; KDD Cup 1999;Yes;No;1999;1999;Military IT;Unix;Connection records;Yes;-;-;DoS, Remote to Local, User to Root, Surveillance/Probing;Scripts for synthetic traffic generation, real humans for performing complex tasks;18.0;743.0; DARPA'98 Intrusion Detection Program;Yes;No;1998;1998;Military IT;Unix;tcpdumps;Ground truth provided;bsm audits, file system dumps;No;DoS, Remote to Local, User to Root, Surveillance/Probing;Scripts for synthetic traffic generation, real humans for performing complex tasks;5000.0;; From ed88d4de01a35b8973a63663f6d73ba278a0328b Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Mon, 3 Jun 2024 16:17:41 +0200 Subject: [PATCH 12/16] Add entry for UWF-ZeekData22 --- content/all_datasets.md | 101 ++++++++++++------------- content/datasets/uwf_zeekdata22.md | 116 +++++++++++++++++++++++++++++ 2 files changed, 167 insertions(+), 50 deletions(-) create mode 100644 content/datasets/uwf_zeekdata22.md diff --git a/content/all_datasets.md b/content/all_datasets.md index 8e91263..a2ae445 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -6,56 +6,57 @@ full-width: true before-content: gh_buttons.html --- -| Name | Network/Host Data | TL;DR | Year | Setting | OS Type | Labeled?ยน | Data Type/Source | Packed Size | Unpacked Size | -|----------------------------------------------------------------------------------------------------|:-----------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------:|---------------|-----------------------|:---------:|--------------------------------------------------------------------------------------|------------:|--------------:| -| [AIT Alert Dataset](../datasets/ait_alert_dataset) | Both | Alerts generated from the AIT log dataset, including labels. Only caveat is the lack of Windows machines | 2023 | Enterprise IT | Linux | ๐ŸŸฉ | Wazuh, Suricata and AMiner alerts | 96 MB | 2,9 GB | -| [OTFR Security Datasets - LSASS Campaign](../datasets/otfr_lsass_campaign) | Both | Very small simulation focusing on exploiting Windows' LSASS.exe. Lacking documentation, no labels and no user behavior | 2023 | Single OS | Windows | ๐ŸŸฅ | pcaps, Windows events, Zeek logs | 423 MB | 1 GB | -| [AIT Log Dataset](../datasets/ait_log_dataset) | Both | Huge variety of labeled logs collected from multiple simulation runs of an enterprise network under attack. With user emulation. but only Linux machines | 2022 | Enterprise IT | Linux | ๐ŸŸฉ | pcaps, Suricata alerts, misc. logs (Apache, auth, dns, vpn, audit, suricata, syslog) | 130 GB | 206 GB | -| [CLUE-LDS](../datasets/clue_lds) | Host | Database of real user behavior without known attacks, for evaluation of methods detecting shifts in user behavior | 2022 | Subsystem | Undisclosed | ๐ŸŸฅ | Custom event logs | 640 MB | 14,9 GB | -| [EVTX to MITRE ATT&CK](../datasets/evtx_to_mitre_attck) | Host | Small dataset providing various events corresponding to certain MITRE tactics/techniques | 2022 | Single OS | Windows | ๐ŸŸฉ | Windows events | <1 GB | <1 GB | -| [OTFR Security Datasets - Atomic](../datasets/otfr_atomic) | Both | Various small datasets, each corresponding to a specific MITRE tactic/technique. Lacks user simulation / underlying scenario and does not provide explicit labels | 2019-2022 | Single OS | Windows, Linux, Cloud | ๐ŸŸจ | pcaps, Windows events, auditd logs, AWS CloudTrail logs | 125 MB | - | -| [PWNJUTSU](../datasets/pwnjutsu) | Both | Rich collection of complex attacks executed by various red team participants each acting in a small network, but not labeled | 2022 | Miscellaneous | Windows, Linux | ๐ŸŸฅ | pcaps, Windows events, Sysmon, auditd, various logs (Apache, auth, dns, ssh, etc.) | 82 GB | - | -| [NF-UQ-NIDS](../datasets/nf_uq_nids) | Network | Combination of four distinct network datasets using a newly proposed set of standardized features | 2021 | Miscellaneous | Windows, Linux, MacOS | ๐ŸŸฉ | Custom NetFlows | 2 GB | 14,8 GB | -| [OTFR Security Datasets - Log4Shell](../datasets/otfr_log4shell) | Both | Very small simulation focusing on the Log4j vulnerability. Lacking documentation, no explicit labels and no user behavior | 2021 | Single OS | Linux | ๐ŸŸจ | pcaps, Ubuntu events | <1 MB | 1 MB | -| [OTFR Security Datasets - SimuLand Golden SAML](../datasets/otfr_golden_saml) | Host | Barely a dataset, only contains very few traces for some specific events. At most usable to test specific Windows detection rules. | 2021 | Enterprise IT | Windows | ๐ŸŸฉ | Windows Events | - | <1 MB | -| [SOCBED Example Dataset](../datasets/socbed_dataset) | Both | Generated using the SOCBED framework, demonstrating reproducible dataset creation, though current attacks are on the basic side | 2021 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | Windows events, Linux events, packetbeat | 78 MB | 1,3 GB | -| [Unraveled](../datasets/unraveled) | Both | Large dataset with intricate labeling, though the focus seems to be on network flows. Mapping will be annoying. | 2021 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | pcaps, misc. logs (syslog, audit, auth, Snort) | - | 22 GB | -| [DAPT 2020](../datasets/dapt2020) | Both | Focuses on attacks mimicking those of an APT group, executed in a rather small environment | 2020 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows, misc. logs (DNS, syslog, auditd, apache, auth, various services) | 460 MB | - | -| [OpTC](../datasets/optc) | Both | Huge amount of data and interesting attacks, but possibly hard to use due to uncommon event format and requiring semi-manual labeling | 2020 | Enterprise IT | Windows | ๐ŸŸจ | Custom event logs, Zeek events | - | 1 TB | -| [OTFR Security Datasets - APT 29](../datasets/otfr_apt_29) | Both | Replication of APT29 evaluation developed by MITRE. Well made and documented, but without labels or user behavior | 2020 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | pcaps, Windows events, Zeek events | 126 MB | 2 GB | -| [CICDDoS2019](../datasets/cic_ddos) | Network | Dataset focusing on various DDoS attacks, covering a broad range of categories. Includes benign behavior, but only for Pcaps, not NetFlows | 2019 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | Pcaps, NetFlows, Windows events, Ubuntu events | 24,4 GB | - | -| [DARPA TC5](../datasets/darpa_tc5) | Host | Custom event logs from network under attack from APT groups, designed to facilitate provenance tracking | 2019 | Undisclosed | Undisclosed | ๐ŸŸจ | Custom event logs | - | - | -| [LID-DS 2019](../datasets/lids_ds_2019) | Host | Contains system calls + associated data/metadata for a variety of Linux exploits, includes normal behavior | 2019 | Single OS | Linux | ๐ŸŸจ | Sequences of syscalls with extended information | 13 GB | - | -| [OTFR Security Datasets - APT 3](../datasets/otfr_apt_3) | Host | Replication of APT3 evaluation developed by MITRE. Lacking documentation, no labels and no user behavior | 2019 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | Windows events | 30 MB | 855 MB | -| [ASNM Datasets](../datasets/asnm_datasets) | Network | Specialized features extracted from instances of remote buffer overflow attacks for the purpose of anomaly-based detection | 2009-2018 | Miscellaneous | Windows, Linux | ๐ŸŸฉ | Custom NetFlows | 21 MB | 95 GB | -| [AWSCTD](../datasets/awsctd) | Host | Syscalls collected from ~10k malware samples running on Windows 7, no user emulation | 2018 | Single OS | Windows | ๐ŸŸฉ | Sequences of syscall numbers | 10 MB | 558 MB | -| [CSE-CIC-IDS2018](../datasets/cse_cic_ids2018) | Both | Simulation of large enterprise IT (450 machines) with user emulation and various attacks, includes host and network logs, but only the latter are labeled | 2018 | Enterprise IT | Windows, Linux, MacOS | ๐ŸŸฉ | pcaps, NetFlows, Windows events, Ubuntu events | 220 GB | - | -| [DARPA TC3](../datasets/darpa_tc3) | Host | Custom event logs from network under attack, designed to facilitate provenance tracking | 2018 | Undisclosed | Undisclosed | ๐ŸŸจ | Custom event logs | 115 GB | - | -| [NGIDS-DS](../datasets/nigds_dataset) | Both | Enterprise network undergoing variety of attacks using IXIA PerfectStorm hardware. Seems to lack host user behavior, does not provide raw host logs | 2018 | Enterprise IT | Linux | ๐ŸŸฉ | pcaps, custom host features | 941 MB | 13,4 GB | -| [CIC DoS](../datasets/cic_dos) | Network | Dataset focusing on different DoS attacks targeting the application layer (instead of network layer), but no longer available | 2017 | Enterprise IT | Linux | ๐ŸŸฉ | Network traffic (unknown format) | - | 4,6 GB | -| [CIC-IDS2017](../datasets/cic_ids2017) | Network | Simulation of medium-sized company network under attack, focuses solely on network traffic | 2017 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | pcaps, NetFlows, custom network features | 48,4 GB | 50 GB | -| [Unified Host and Network Data Set](../datasets/unified_host_and_network_dataset) | Both | Selection of network and host events collected from operational environment, but without any attacks | 2017 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | NetFlows, Windows events | - | - | -| [UGR'16](../datasets/ugr16) | Network | Network flows collected from real network over a long period of time, with some attack traffic injected | 2016 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows | 236 GB | - | -| [Comprehensive, Multi-Source Cyber-Security Events](../datasets/comp_multi_source_cybersec_events) | Both | Various events from production network with red team activity, but extremely limited information per event | 2015 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | Custom event logs (auth, proc, network flows, dns, redteam) | 12 GB | - | -| [Kyoto Honeypot](../datasets/kyoto_honeypot) | Network | Collection of features derived from attack traffic targeting honeypots over the span of 9 years | 2006-2015 | Miscellaneous | Windows, Unix, MacOS | ๐ŸŸฉ | Custom network features | 20 GB | - | -| [UNSW-NB15](../datasets/unsw_nb15) | Network | Custom network undergoing a variety of attacks using IXIA PerfectStorm hardware. Mostly geared towards anomaly-based NIDS | 2015 | Undisclosed | Undisclosed | ๐ŸŸฉ | pcaps, custom network features | >100 GB | - | -| [ADFA-WD](../datasets/adfa_wd) | Host | Mostly intended for anomaly-based stuff leveraging library calls, explores interesting concept of stealthy shellcode | 2014 | Single OS | Windows | ๐ŸŸจ | Sequences of dll calls, Windows events (dll calls only) | 403 MB | 13,6 GB | -| [Skopik 2014](../datasets/skopik_et_al) | Host | Focus on realistically emulating user behavior, does not include attacks | 2014 | Enterprise IT | Linux | ๐ŸŸฅ | misc. logs (Apache, database, mail server, bug tracker app) | - | - | -| [Twente 2014](../datasets/twente_2014) | Both | Anonymized network flows and host logs from real network, but only those related to ssh authentication, focusing on detecting related brute force attacks | 2014 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows | 2,42 GB | 5,8 GB | -| [User-Computer Associations in Time](../datasets/user_computer_associations) | Host | Large number of authentication events over a period of 9 months, but with very little detail and without any attacks | 2014 | Enterprise IT | Undisclosed | ๐ŸŸฅ | Custom auth event logs | 2,3 GB | - | -| [ADFA-LD](../datasets/adfa_ld) | Host | Purely intended for anomaly-based approaches, provides only syscall numbers | 2013 | Single OS | Linux | ๐ŸŸฉ | Sequences of syscall numbers | 2 MB | 17 MB | -| [CIDD](../datasets/cidd) | Network | Spin on the DARPA'98 dataset, correlating user behavior over different systems/environments for behavior-based IDSs | 2012 | Military IT | Unix | ๐ŸŸฉ | Sequences of user "audits" | - | 22 GB | -| [ISCX IDS 2012](../datasets/iscx_ids_2012) | Network | Focus on realistic traffic generation in a company network, combined with some basic attacks | 2012 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | pcaps | 84 GB | 87 GB | -| [TUIDS](../datasets/tuids) | Network | Dataset focusing on DoS attacks, but very poorly documented | 2012 | Enterprise IT | Undisclosed | ๐ŸŸฉ | pcaps, NetFlows | - | - | -| [VAST Challenge 2012](../datasets/vast_2012) | Network | Originated from a challenge about data analytics, focus an a large network being the victim of a botnet | 2012 | Enterprise IT | Undisclosed | ๐ŸŸจ | Snort alerts, firewall logs | 186 MB | 2,9 GB | -| [CTU 13](../datasets/ctu_13) | Network | Collection of various botnet behavior combined with loads of background traffic, but very limited feature space | 2011 | Enterprise IT | Windows, Undisclosed | ๐ŸŸฉ | pcaps, NetFlows, Bro logs | - | 697 GB | -| [VAST Challenge 2011](../datasets/vast_2011) | Both | Originated from a challenge about data analytics, focus on network but also contains host logs. Labeling is a bit lacking | 2011 | Enterprise IT | Windows | ๐ŸŸจ | pcaps, Windows events, misc. logs (firewall, Snort, Nessus) | 940 MB | 9,3 GB | -| [CDX CTF 2009](../datasets/cdx_2009) | Both | Dataset captured from a CTF event, generally intended to provide methods for reliable generating labeled datasets from such events | 2009 | Enterprise IT | Windows, Linux | ๐ŸŸจ | pcaps, Snort IDS alerts, Apache logs, Splunk logs | 12 GB | 15,3 GB | -| [NSL-KDD](../datasets/nsl_kdd_dataset) | Network | An improvement of the original KDD'99 dataset, but still outdated at its core | 2009 | Military IT | Unix | ๐ŸŸฉ | Connection records | 6 MB | 19 MB | -| [Twente 2009](../datasets/twente_2009) | Network | Intricately labeled network flows + alerts collected from a single honeypot over the span of 6 days | 2009 | Single OS | Linux | ๐ŸŸฉ | NetFlows | 303 MB | 1,9 GB | -| [gureKDDCup](../datasets/gure_kddcup) | Network | An extension of the KDDCup 1999 dataset, adding additional information about payloads to each connection record | 2008 | Military IT | Unix | ๐ŸŸฉ | Connection records with payload information | 10 GB | - | -| [KDD Cup 1999](../datasets/kdd_cup_1999) | Network | Network connection events derived from simulated U.S. Air Force network under attack. No longer appropriate to use for multiple reasons | 1999 | Military IT | Unix | ๐ŸŸฉ | Connection records | 18 MB | 743 MB | -| [DARPA'98 Intrusion Detection Program](../datasets/darpa98) | Both | Simulation of a small U.S. Air Force network under attack. No longer appropriate to use for a multiple reasons | 1998 | Military IT | Unix | ๐ŸŸจ | tcpdumps, host audit logs, file system dumps | 5 GB | - | +| Name | Network/Host Data | TL;DR | Year | Setting | OS Type | Labeled?ยน | Data Type/Source | Packed Size | Unpacked Size | +|----------------------------------------------------------------------------------------------------|:-----------------:|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------:|---------------|-----------------------|:---------:|--------------------------------------------------------------------------------------|------------:|--------------:| +| [AIT Alert Dataset](../datasets/ait_alert_dataset) | Both | Alerts generated from the AIT log dataset, including labels. Only caveat is the lack of Windows machines | 2023 | Enterprise IT | Linux | ๐ŸŸฉ | Wazuh, Suricata and AMiner alerts | 96 MB | 2,9 GB | +| [OTFR Security Datasets - LSASS Campaign](../datasets/otfr_lsass_campaign) | Both | Very small simulation focusing on exploiting Windows' LSASS.exe. Lacking documentation, no labels and no user behavior | 2023 | Single OS | Windows | ๐ŸŸฅ | pcaps, Windows events, Zeek logs | 423 MB | 1 GB | +| [AIT Log Dataset](../datasets/ait_log_dataset) | Both | Huge variety of labeled logs collected from multiple simulation runs of an enterprise network under attack. With user emulation. but only Linux machines | 2022 | Enterprise IT | Linux | ๐ŸŸฉ | pcaps, Suricata alerts, misc. logs (Apache, auth, dns, vpn, audit, suricata, syslog) | 130 GB | 206 GB | +| [CLUE-LDS](../datasets/clue_lds) | Host | Database of real user behavior without known attacks, for evaluation of methods detecting shifts in user behavior | 2022 | Subsystem | Undisclosed | ๐ŸŸฅ | Custom event logs | 640 MB | 14,9 GB | +| [EVTX to MITRE ATT&CK](../datasets/evtx_to_mitre_attck) | Host | Small dataset providing various events corresponding to certain MITRE tactics/techniques | 2022 | Single OS | Windows | ๐ŸŸฉ | Windows events | <1 GB | <1 GB | +| [OTFR Security Datasets - Atomic](../datasets/otfr_atomic) | Both | Various small datasets, each corresponding to a specific MITRE tactic/technique. Lacks user simulation / underlying scenario and does not provide explicit labels | 2019-2022 | Single OS | Windows, Linux, Cloud | ๐ŸŸจ | pcaps, Windows events, auditd logs, AWS CloudTrail logs | 125 MB | - | +| [PWNJUTSU](../datasets/pwnjutsu) | Both | Rich collection of complex attacks executed by various red team participants each acting in a small network, but not labeled | 2022 | Miscellaneous | Windows, Linux | ๐ŸŸฅ | pcaps, Windows events, Sysmon, auditd, various logs (Apache, auth, dns, ssh, etc.) | 82 GB | - | +| [UWF-ZeekData22](../datasets/uwf_zeekdata22) | Network | Traffic collected from a universities wargaming course. Covers all MITRE tactics, though the overwhelming majority is simple recon and attacks are poorly documented | 2022 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | pcaps, Zeek logs | - | 209 GB | +| [NF-UQ-NIDS](../datasets/nf_uq_nids) | Network | Combination of four distinct network datasets using a newly proposed set of standardized features | 2021 | Miscellaneous | Windows, Linux, MacOS | ๐ŸŸฉ | Custom NetFlows | 2 GB | 14,8 GB | +| [OTFR Security Datasets - Log4Shell](../datasets/otfr_log4shell) | Both | Very small simulation focusing on the Log4j vulnerability. Lacking documentation, no explicit labels and no user behavior | 2021 | Single OS | Linux | ๐ŸŸจ | pcaps, Ubuntu events | <1 MB | 1 MB | +| [OTFR Security Datasets - SimuLand Golden SAML](../datasets/otfr_golden_saml) | Host | Barely a dataset, only contains very few traces for some specific events. At most usable to test specific Windows detection rules. | 2021 | Enterprise IT | Windows | ๐ŸŸฉ | Windows Events | - | <1 MB | +| [SOCBED Example Dataset](../datasets/socbed_dataset) | Both | Generated using the SOCBED framework, demonstrating reproducible dataset creation, though current attacks are on the basic side | 2021 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | Windows events, Linux events, packetbeat | 78 MB | 1,3 GB | +| [Unraveled](../datasets/unraveled) | Both | Large dataset with intricate labeling, though the focus seems to be on network flows. Mapping will be annoying. | 2021 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | pcaps, misc. logs (syslog, audit, auth, Snort) | - | 22 GB | +| [DAPT 2020](../datasets/dapt2020) | Both | Focuses on attacks mimicking those of an APT group, executed in a rather small environment | 2020 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows, misc. logs (DNS, syslog, auditd, apache, auth, various services) | 460 MB | - | +| [OpTC](../datasets/optc) | Both | Huge amount of data and interesting attacks, but possibly hard to use due to uncommon event format and requiring semi-manual labeling | 2020 | Enterprise IT | Windows | ๐ŸŸจ | Custom event logs, Zeek events | - | 1 TB | +| [OTFR Security Datasets - APT 29](../datasets/otfr_apt_29) | Both | Replication of APT29 evaluation developed by MITRE. Well made and documented, but without labels or user behavior | 2020 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | pcaps, Windows events, Zeek events | 126 MB | 2 GB | +| [CICDDoS2019](../datasets/cic_ddos) | Network | Dataset focusing on various DDoS attacks, covering a broad range of categories. Includes benign behavior, but only for Pcaps, not NetFlows | 2019 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | Pcaps, NetFlows, Windows events, Ubuntu events | 24,4 GB | - | +| [DARPA TC5](../datasets/darpa_tc5) | Host | Custom event logs from network under attack from APT groups, designed to facilitate provenance tracking | 2019 | Undisclosed | Undisclosed | ๐ŸŸจ | Custom event logs | - | - | +| [LID-DS 2019](../datasets/lids_ds_2019) | Host | Contains system calls + associated data/metadata for a variety of Linux exploits, includes normal behavior | 2019 | Single OS | Linux | ๐ŸŸจ | Sequences of syscalls with extended information | 13 GB | - | +| [OTFR Security Datasets - APT 3](../datasets/otfr_apt_3) | Host | Replication of APT3 evaluation developed by MITRE. Lacking documentation, no labels and no user behavior | 2019 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | Windows events | 30 MB | 855 MB | +| [ASNM Datasets](../datasets/asnm_datasets) | Network | Specialized features extracted from instances of remote buffer overflow attacks for the purpose of anomaly-based detection | 2009-2018 | Miscellaneous | Windows, Linux | ๐ŸŸฉ | Custom NetFlows | 21 MB | 95 GB | +| [AWSCTD](../datasets/awsctd) | Host | Syscalls collected from ~10k malware samples running on Windows 7, no user emulation | 2018 | Single OS | Windows | ๐ŸŸฉ | Sequences of syscall numbers | 10 MB | 558 MB | +| [CSE-CIC-IDS2018](../datasets/cse_cic_ids2018) | Both | Simulation of large enterprise IT (450 machines) with user emulation and various attacks, includes host and network logs, but only the latter are labeled | 2018 | Enterprise IT | Windows, Linux, MacOS | ๐ŸŸฉ | pcaps, NetFlows, Windows events, Ubuntu events | 220 GB | - | +| [DARPA TC3](../datasets/darpa_tc3) | Host | Custom event logs from network under attack, designed to facilitate provenance tracking | 2018 | Undisclosed | Undisclosed | ๐ŸŸจ | Custom event logs | 115 GB | - | +| [NGIDS-DS](../datasets/nigds_dataset) | Both | Enterprise network undergoing variety of attacks using IXIA PerfectStorm hardware. Seems to lack host user behavior, does not provide raw host logs | 2018 | Enterprise IT | Linux | ๐ŸŸฉ | pcaps, custom host features | 941 MB | 13,4 GB | +| [CIC DoS](../datasets/cic_dos) | Network | Dataset focusing on different DoS attacks targeting the application layer (instead of network layer), but no longer available | 2017 | Enterprise IT | Linux | ๐ŸŸฉ | Network traffic (unknown format) | - | 4,6 GB | +| [CIC-IDS2017](../datasets/cic_ids2017) | Network | Simulation of medium-sized company network under attack, focuses solely on network traffic | 2017 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | pcaps, NetFlows, custom network features | 48,4 GB | 50 GB | +| [Unified Host and Network Data Set](../datasets/unified_host_and_network_dataset) | Both | Selection of network and host events collected from operational environment, but without any attacks | 2017 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | NetFlows, Windows events | - | - | +| [UGR'16](../datasets/ugr16) | Network | Network flows collected from real network over a long period of time, with some attack traffic injected | 2016 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows | 236 GB | - | +| [Comprehensive, Multi-Source Cyber-Security Events](../datasets/comp_multi_source_cybersec_events) | Both | Various events from production network with red team activity, but extremely limited information per event | 2015 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | Custom event logs (auth, proc, network flows, dns, redteam) | 12 GB | - | +| [Kyoto Honeypot](../datasets/kyoto_honeypot) | Network | Collection of features derived from attack traffic targeting honeypots over the span of 9 years | 2006-2015 | Miscellaneous | Windows, Unix, MacOS | ๐ŸŸฉ | Custom network features | 20 GB | - | +| [UNSW-NB15](../datasets/unsw_nb15) | Network | Custom network undergoing a variety of attacks using IXIA PerfectStorm hardware. Mostly geared towards anomaly-based NIDS | 2015 | Undisclosed | Undisclosed | ๐ŸŸฉ | pcaps, custom network features | >100 GB | - | +| [ADFA-WD](../datasets/adfa_wd) | Host | Mostly intended for anomaly-based stuff leveraging library calls, explores interesting concept of stealthy shellcode | 2014 | Single OS | Windows | ๐ŸŸจ | Sequences of dll calls, Windows events (dll calls only) | 403 MB | 13,6 GB | +| [Skopik 2014](../datasets/skopik_et_al) | Host | Focus on realistically emulating user behavior, does not include attacks | 2014 | Enterprise IT | Linux | ๐ŸŸฅ | misc. logs (Apache, database, mail server, bug tracker app) | - | - | +| [Twente 2014](../datasets/twente_2014) | Both | Anonymized network flows and host logs from real network, but only those related to ssh authentication, focusing on detecting related brute force attacks | 2014 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows | 2,42 GB | 5,8 GB | +| [User-Computer Associations in Time](../datasets/user_computer_associations) | Host | Large number of authentication events over a period of 9 months, but with very little detail and without any attacks | 2014 | Enterprise IT | Undisclosed | ๐ŸŸฅ | Custom auth event logs | 2,3 GB | - | +| [ADFA-LD](../datasets/adfa_ld) | Host | Purely intended for anomaly-based approaches, provides only syscall numbers | 2013 | Single OS | Linux | ๐ŸŸฉ | Sequences of syscall numbers | 2 MB | 17 MB | +| [CIDD](../datasets/cidd) | Network | Spin on the DARPA'98 dataset, correlating user behavior over different systems/environments for behavior-based IDSs | 2012 | Military IT | Unix | ๐ŸŸฉ | Sequences of user "audits" | - | 22 GB | +| [ISCX IDS 2012](../datasets/iscx_ids_2012) | Network | Focus on realistic traffic generation in a company network, combined with some basic attacks | 2012 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | pcaps | 84 GB | 87 GB | +| [TUIDS](../datasets/tuids) | Network | Dataset focusing on DoS attacks, but very poorly documented | 2012 | Enterprise IT | Undisclosed | ๐ŸŸฉ | pcaps, NetFlows | - | - | +| [VAST Challenge 2012](../datasets/vast_2012) | Network | Originated from a challenge about data analytics, focus an a large network being the victim of a botnet | 2012 | Enterprise IT | Undisclosed | ๐ŸŸจ | Snort alerts, firewall logs | 186 MB | 2,9 GB | +| [CTU 13](../datasets/ctu_13) | Network | Collection of various botnet behavior combined with loads of background traffic, but very limited feature space | 2011 | Enterprise IT | Windows, Undisclosed | ๐ŸŸฉ | pcaps, NetFlows, Bro logs | - | 697 GB | +| [VAST Challenge 2011](../datasets/vast_2011) | Both | Originated from a challenge about data analytics, focus on network but also contains host logs. Labeling is a bit lacking | 2011 | Enterprise IT | Windows | ๐ŸŸจ | pcaps, Windows events, misc. logs (firewall, Snort, Nessus) | 940 MB | 9,3 GB | +| [CDX CTF 2009](../datasets/cdx_2009) | Both | Dataset captured from a CTF event, generally intended to provide methods for reliable generating labeled datasets from such events | 2009 | Enterprise IT | Windows, Linux | ๐ŸŸจ | pcaps, Snort IDS alerts, Apache logs, Splunk logs | 12 GB | 15,3 GB | +| [NSL-KDD](../datasets/nsl_kdd_dataset) | Network | An improvement of the original KDD'99 dataset, but still outdated at its core | 2009 | Military IT | Unix | ๐ŸŸฉ | Connection records | 6 MB | 19 MB | +| [Twente 2009](../datasets/twente_2009) | Network | Intricately labeled network flows + alerts collected from a single honeypot over the span of 6 days | 2009 | Single OS | Linux | ๐ŸŸฉ | NetFlows | 303 MB | 1,9 GB | +| [gureKDDCup](../datasets/gure_kddcup) | Network | An extension of the KDDCup 1999 dataset, adding additional information about payloads to each connection record | 2008 | Military IT | Unix | ๐ŸŸฉ | Connection records with payload information | 10 GB | - | +| [KDD Cup 1999](../datasets/kdd_cup_1999) | Network | Network connection events derived from simulated U.S. Air Force network under attack. No longer appropriate to use for multiple reasons | 1999 | Military IT | Unix | ๐ŸŸฉ | Connection records | 18 MB | 743 MB | +| [DARPA'98 Intrusion Detection Program](../datasets/darpa98) | Both | Simulation of a small U.S. Air Force network under attack. No longer appropriate to use for a multiple reasons | 1998 | Military IT | Unix | ๐ŸŸจ | tcpdumps, host audit logs, file system dumps | 5 GB | - | ### Legend diff --git a/content/datasets/uwf_zeekdata22.md b/content/datasets/uwf_zeekdata22.md new file mode 100644 index 0000000..e815bfa --- /dev/null +++ b/content/datasets/uwf_zeekdata22.md @@ -0,0 +1,116 @@ +--- +title: UWF-ZeekData22 +--- + +- [Overview](#overview) +- [Environment](#environment) +- [Activity](#activity) +- [Contained Data](#contained-data) +- [Papers](#papers) +- [Links](#links) +- [Data Examples](#data-examples) + +| | | +|--------------------------|--------------------------------------------------------------------------------| +| **Network Data Source** | | +| **Network Data Labeled** | | +| **Host Data Source** | | +| **Host Data Labeled** | | +| | | +| **Overall Setting** | Enterprise IT | +| **OS Types** | Windows 10/2008 Metasploitable3
Debian 11
Ubuntu 14.04 Metasploitable3 | +| **Number of Machines** | 6 | +| **Total Runtime** | 64 days | +| **Year of Collection** | 2022 | +| **Attack Categories** | All MITRE tactics | +| **User Emulation** | n/a | +| | | +| **Packed Size** | - | +| **Unpacked Size** | 209 GB | +| **Download Link** | [goto](https://datasets.uwf.edu/data/UWF-ZeekData22/) | + +*** + +### Overview +The University of West Florida Zeek Dataset (UWF-ZeekData22) is consists of 64 days network traffic and related Zeek logs, collected from a "cyber wargaming course" held at the same university. +This course leveraged the UWF's cyber range, a virtualized and relatively diverse environment of different systems which participants were instructed to attack and defend. +The datasets defining feature is the inclusion of MITRE tactic labels assigned to each packet or log, potentially allowing for attack chain detection or similar use cases. +However, the vast majority (>99.9%) of malicious traffic consists of simple reconnaissance, and, apart from statistics, there is very little information about individual attacks. +The authors also detail the process of collecting these large amounts of data with a dedicated solution (Apache Hadoop), though this is considered out of scope for this survey. + +### Environment +As mentioned, course participants leveraged the universities cyber range. +Although the authors state that their dataset contains thousands of distinct IP addresses, this is most likely caused by the fact that each group of students (81 in total) was assigned their own environment (as opposed to one actually large network). +Each individual network hosts, presumably, six machines with different versions of Windows and Linux operating systems, running various, partially vulnerable, services - presumably, because Section 4 of the underlying paper [1] is pretty unclear in this regard. + +Traffic is captured on one of these VMs and sent to a Hadoop instance, a distributed file system designed for storing and processing large datasets. +The same VM also generated various Zeek logs, which were forwarded in the same manner. + +### Activity +The collection period lasted from 2021/12/12 to 2022/02/20, with a break of six days, for a total of 64 days. +While attacks cover the entire range of MITRE tactics (14 at the time of writing), no detail at all is provided regarding the way in which these attack were executed; +only the number of instances per attack tactis is available: +- Reconnaissance: 9.278.722 +- Discovery: 2.086 +- Credential Access: 31 +- Privilege Escalation: 13 +- Exfiltration: 7 +- Lateral Movement: 4 +- Resource Development: 3 +- Initial Access: 1 +- Persistence: 1 +- Defense Evasion: 1 + +In other words, the vast majority of malicious traffic consists most likely of port scans and similar trivial operations. +Additionally, while there seems to be some form of benign activity, it is in no way documented. + +### Contained Data +Data is generally available in three different formats, all of which are labelled with the associated MITRE tactic: +- pcaps: Contains captured traffic. +Note that these are in a [custom binary format](https://docs.securityonion.net/en/latest/stenographer.html) generated by Security Onion. +These files are divided into thousands of smaller files, each covering roughly one minute of traffic. +- parquet: A binary column-oriented data storage format (basically a faster version of CSV when working with large files). +Contained information is very similar to that of network flows (see example of CSV files below). +There are eight files in total, each covering eight days of traffic. +- CSV: A subset of aforementioned parquet files, which, according to the authors, were mainly made available for people who do not have access to "Big Data" technologies. +These files contain data from 2022/02/10, 0300-0600, 0900-1000, and 1400-1500, with one file per hour, thus five files in total. +Each file contains one million entries with a benign/attack ratio of about 80/20. +For attacks, only the tactics "Reconnaissance" and "Discovery" are included. + +It is unclear where exactly Zeek logs can be found, though I am assuming they are part of the PCAP files. +The authors leverage what they call "mission logs" to perform labeling, though the nature of these logs is not further detailed. +Section 6.1 in [1] seems to suggest that these are manually created by participants, who document their current activity in the form of timestamps, ports, IPs, tactics, etc. + +### Papers +- [[1] Introducing UWF-ZeekData22: A Comprehensive Network Traffic Dataset Based on the MITRE ATT&CK Framework (2022)](https://doi.org/10.3390/data8010018) + +### Links +- [Homepage](https://datasets.uwf.edu/) + - [Types of collected Zeek logs](https://datasets.uwf.edu/tables/table1.html) + - [Attributes per Zeek log type](https://datasets.uwf.edu/tables/table2.html) + +### Data Examples +Traffic information in CSV format taken from `csv/part-00000-d32a9d5e-45b7-4e51-807e-1af297aba2df-c000.csv` + + +``` +resp_pkts,service,orig_ip_bytes,local_resp,missed_bytes,protocol,duration,conn_state,dest_ip,orig_pkts,community_id,resp_ip_bytes,dest_port,orig_bytes,local_orig,datetime,history,resp_bytes,uid,src_port,ts,src_ip,mitre_attack_tactics +2,dns,186,false,0,udp,0.002279996871948242,SF,143.88.5.1,2,1:Z2qpnUv+rxq4N1rn7Go962U/gi8=,186,53,130,false,2022-02-10T03:58:29.979Z,Dd,130,CwO2bA321vyBxBjtxb,36073,1.644465509979958E9,143.88.5.12,Reconnaissance +2,dns,186,false,0,udp,0.002279996871948242,SF,143.88.5.1,2,1:Z2qpnUv+rxq4N1rn7Go962U/gi8=,186,53,130,false,2022-02-10T03:58:29.979Z,Dd,130,CwO2bA321vyBxBjtxb,36073,1.644465509979958E9,143.88.5.12,Reconnaissance +2,dns,186,false,0,udp,0.002279996871948242,SF,143.88.5.1,2,1:Z2qpnUv+rxq4N1rn7Go962U/gi8=,186,53,130,false,2022-02-10T03:58:29.979Z,Dd,130,CwO2bA321vyBxBjtxb,36073,1.644465509979958E9,143.88.5.12,Reconnaissance +2,dns,186,false,0,udp,0.002279996871948242,SF,143.88.5.1,2,1:Z2qpnUv+rxq4N1rn7Go962U/gi8=,186,53,130,false,2022-02-10T03:58:29.979Z,Dd,130,CwO2bA321vyBxBjtxb,36073,1.644465509979958E9,143.88.5.12,Reconnaissance +2,dns,186,false,0,udp,0.002279996871948242,SF,143.88.5.1,2,1:Z2qpnUv+rxq4N1rn7Go962U/gi8=,186,53,130,false,2022-02-10T03:58:29.979Z,Dd,130,CwO2bA321vyBxBjtxb,36073,1.644465509979958E9,143.88.5.12,Reconnaissance +2,dns,186,false,0,udp,0.002279996871948242,SF,143.88.5.1,2,1:Z2qpnUv+rxq4N1rn7Go962U/gi8=,186,53,130,false,2022-02-10T03:58:29.979Z,Dd,130,CwO2bA321vyBxBjtxb,36073,1.644465509979958E9,143.88.5.12,Reconnaissance +2,dns,186,false,0,udp,0.002279996871948242,SF,143.88.5.1,2,1:Z2qpnUv+rxq4N1rn7Go962U/gi8=,186,53,130,false,2022-02-10T03:58:29.979Z,Dd,130,CwO2bA321vyBxBjtxb,36073,1.644465509979958E9,143.88.5.12,Reconnaissance +2,dns,186,false,0,udp,0.002279996871948242,SF,143.88.5.1,2,1:Z2qpnUv+rxq4N1rn7Go962U/gi8=,186,53,130,false,2022-02-10T03:58:29.979Z,Dd,130,CwO2bA321vyBxBjtxb,36073,1.644465509979958E9,143.88.5.12,Reconnaissance +2,dns,186,false,0,udp,0.002279996871948242,SF,143.88.5.1,2,1:Z2qpnUv+rxq4N1rn7Go962U/gi8=,186,53,130,false,2022-02-10T03:58:29.979Z,Dd,130,CwO2bA321vyBxBjtxb,36073,1.644465509979958E9,143.88.5.12,Reconnaissance +2,dns,186,false,0,udp,0.002279996871948242,SF,143.88.5.1,2,1:Z2qpnUv+rxq4N1rn7Go962U/gi8=,186,53,130,false,2022-02-10T03:58:29.979Z,Dd,130,CwO2bA321vyBxBjtxb,36073,1.644465509979958E9,143.88.5.12,Reconnaissance +2,dns,186,false,0,udp,0.002279996871948242,SF,143.88.5.1,2,1:Z2qpnUv+rxq4N1rn7Go962U/gi8=,186,53,130,false,2022-02-10T03:58:29.979Z,Dd,130,CwO2bA321vyBxBjtxb,36073,1.644465509979958E9,143.88.5.12,Reconnaissance +2,dns,186,false,0,udp,0.002279996871948242,SF,143.88.5.1,2,1:Z2qpnUv+rxq4N1rn7Go962U/gi8=,186,53,130,false,2022-02-10T03:58:29.979Z,Dd,130,CwO2bA321vyBxBjtxb,36073,1.644465509979958E9,143.88.5.12,Reconnaissance +2,dns,186,false,0,udp,0.002279996871948242,SF,143.88.5.1,2,1:Z2qpnUv+rxq4N1rn7Go962U/gi8=,186,53,130,false,2022-02-10T03:58:29.979Z,Dd,130,CwO2bA321vyBxBjtxb,36073,1.644465509979958E9,143.88.5.12,Reconnaissance +2,dns,186,false,0,udp,0.002279996871948242,SF,143.88.5.1,2,1:Z2qpnUv+rxq4N1rn7Go962U/gi8=,186,53,130,false,2022-02-10T03:58:29.979Z,Dd,130,CwO2bA321vyBxBjtxb,36073,1.644465509979958E9,143.88.5.12,Reconnaissance +2,dns,186,false,0,udp,0.002279996871948242,SF,143.88.5.1,2,1:Z2qpnUv+rxq4N1rn7Go962U/gi8=,186,53,130,false,2022-02-10T03:58:29.979Z,Dd,130,CwO2bA321vyBxBjtxb,36073,1.644465509979958E9,143.88.5.12,Reconnaissance +2,dns,186,false,0,udp,0.002279996871948242,SF,143.88.5.1,2,1:Z2qpnUv+rxq4N1rn7Go962U/gi8=,186,53,130,false,2022-02-10T03:58:29.979Z,Dd,130,CwO2bA321vyBxBjtxb,36073,1.644465509979958E9,143.88.5.12,Reconnaissance +[...] +``` + \ No newline at end of file From 32ae3938991261d544d2753207d67a998b2d5d89 Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Mon, 3 Jun 2024 16:29:18 +0200 Subject: [PATCH 13/16] Add missing data fields --- content/datasets/uwf_zeekdata22.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/datasets/uwf_zeekdata22.md b/content/datasets/uwf_zeekdata22.md index e815bfa..39e6052 100644 --- a/content/datasets/uwf_zeekdata22.md +++ b/content/datasets/uwf_zeekdata22.md @@ -12,10 +12,10 @@ title: UWF-ZeekData22 | | | |--------------------------|--------------------------------------------------------------------------------| -| **Network Data Source** | | -| **Network Data Labeled** | | -| **Host Data Source** | | -| **Host Data Labeled** | | +| **Network Data Source** | pcaps, Zeek logs | +| **Network Data Labeled** | Yes | +| **Host Data Source** | - | +| **Host Data Labeled** | - | | | | | **Overall Setting** | Enterprise IT | | **OS Types** | Windows 10/2008 Metasploitable3
Debian 11
Ubuntu 14.04 Metasploitable3 | From 572aa8f784ffade8b530cc69d148bacbd1f0124a Mon Sep 17 00:00:00 2001 From: Philipp Boenninghausen Date: Mon, 3 Jun 2024 16:30:01 +0200 Subject: [PATCH 14/16] Fix typos --- content/datasets/uwf_zeekdata22.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/datasets/uwf_zeekdata22.md b/content/datasets/uwf_zeekdata22.md index 39e6052..b35c41a 100644 --- a/content/datasets/uwf_zeekdata22.md +++ b/content/datasets/uwf_zeekdata22.md @@ -32,7 +32,7 @@ title: UWF-ZeekData22 *** ### Overview -The University of West Florida Zeek Dataset (UWF-ZeekData22) is consists of 64 days network traffic and related Zeek logs, collected from a "cyber wargaming course" held at the same university. +The University of West Florida Zeek Dataset (UWF-ZeekData22) consists of 64 days network traffic and related Zeek logs, collected from a "cyber wargaming course" held at the same university. This course leveraged the UWF's cyber range, a virtualized and relatively diverse environment of different systems which participants were instructed to attack and defend. The datasets defining feature is the inclusion of MITRE tactic labels assigned to each packet or log, potentially allowing for attack chain detection or similar use cases. However, the vast majority (>99.9%) of malicious traffic consists of simple reconnaissance, and, apart from statistics, there is very little information about individual attacks. From d89e269c766073b9678de7598c9364c6084c72ad Mon Sep 17 00:00:00 2001 From: Maspital Date: Tue, 4 Jun 2024 17:45:36 +0200 Subject: [PATCH 15/16] Incorporate requested changes --- content/all_datasets.md | 102 ++++++++++++++--------------- content/datasets/uwf_zeekdata22.md | 10 +-- 2 files changed, 56 insertions(+), 56 deletions(-) diff --git a/content/all_datasets.md b/content/all_datasets.md index a2ae445..061ef23 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -6,57 +6,57 @@ full-width: true before-content: gh_buttons.html --- -| Name | Network/Host Data | TL;DR | Year | Setting | OS Type | Labeled?ยน | Data Type/Source | Packed Size | Unpacked Size | -|----------------------------------------------------------------------------------------------------|:-----------------:|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------:|---------------|-----------------------|:---------:|--------------------------------------------------------------------------------------|------------:|--------------:| -| [AIT Alert Dataset](../datasets/ait_alert_dataset) | Both | Alerts generated from the AIT log dataset, including labels. Only caveat is the lack of Windows machines | 2023 | Enterprise IT | Linux | ๐ŸŸฉ | Wazuh, Suricata and AMiner alerts | 96 MB | 2,9 GB | -| [OTFR Security Datasets - LSASS Campaign](../datasets/otfr_lsass_campaign) | Both | Very small simulation focusing on exploiting Windows' LSASS.exe. Lacking documentation, no labels and no user behavior | 2023 | Single OS | Windows | ๐ŸŸฅ | pcaps, Windows events, Zeek logs | 423 MB | 1 GB | -| [AIT Log Dataset](../datasets/ait_log_dataset) | Both | Huge variety of labeled logs collected from multiple simulation runs of an enterprise network under attack. With user emulation. but only Linux machines | 2022 | Enterprise IT | Linux | ๐ŸŸฉ | pcaps, Suricata alerts, misc. logs (Apache, auth, dns, vpn, audit, suricata, syslog) | 130 GB | 206 GB | -| [CLUE-LDS](../datasets/clue_lds) | Host | Database of real user behavior without known attacks, for evaluation of methods detecting shifts in user behavior | 2022 | Subsystem | Undisclosed | ๐ŸŸฅ | Custom event logs | 640 MB | 14,9 GB | -| [EVTX to MITRE ATT&CK](../datasets/evtx_to_mitre_attck) | Host | Small dataset providing various events corresponding to certain MITRE tactics/techniques | 2022 | Single OS | Windows | ๐ŸŸฉ | Windows events | <1 GB | <1 GB | -| [OTFR Security Datasets - Atomic](../datasets/otfr_atomic) | Both | Various small datasets, each corresponding to a specific MITRE tactic/technique. Lacks user simulation / underlying scenario and does not provide explicit labels | 2019-2022 | Single OS | Windows, Linux, Cloud | ๐ŸŸจ | pcaps, Windows events, auditd logs, AWS CloudTrail logs | 125 MB | - | -| [PWNJUTSU](../datasets/pwnjutsu) | Both | Rich collection of complex attacks executed by various red team participants each acting in a small network, but not labeled | 2022 | Miscellaneous | Windows, Linux | ๐ŸŸฅ | pcaps, Windows events, Sysmon, auditd, various logs (Apache, auth, dns, ssh, etc.) | 82 GB | - | -| [UWF-ZeekData22](../datasets/uwf_zeekdata22) | Network | Traffic collected from a universities wargaming course. Covers all MITRE tactics, though the overwhelming majority is simple recon and attacks are poorly documented | 2022 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | pcaps, Zeek logs | - | 209 GB | -| [NF-UQ-NIDS](../datasets/nf_uq_nids) | Network | Combination of four distinct network datasets using a newly proposed set of standardized features | 2021 | Miscellaneous | Windows, Linux, MacOS | ๐ŸŸฉ | Custom NetFlows | 2 GB | 14,8 GB | -| [OTFR Security Datasets - Log4Shell](../datasets/otfr_log4shell) | Both | Very small simulation focusing on the Log4j vulnerability. Lacking documentation, no explicit labels and no user behavior | 2021 | Single OS | Linux | ๐ŸŸจ | pcaps, Ubuntu events | <1 MB | 1 MB | -| [OTFR Security Datasets - SimuLand Golden SAML](../datasets/otfr_golden_saml) | Host | Barely a dataset, only contains very few traces for some specific events. At most usable to test specific Windows detection rules. | 2021 | Enterprise IT | Windows | ๐ŸŸฉ | Windows Events | - | <1 MB | -| [SOCBED Example Dataset](../datasets/socbed_dataset) | Both | Generated using the SOCBED framework, demonstrating reproducible dataset creation, though current attacks are on the basic side | 2021 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | Windows events, Linux events, packetbeat | 78 MB | 1,3 GB | -| [Unraveled](../datasets/unraveled) | Both | Large dataset with intricate labeling, though the focus seems to be on network flows. Mapping will be annoying. | 2021 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | pcaps, misc. logs (syslog, audit, auth, Snort) | - | 22 GB | -| [DAPT 2020](../datasets/dapt2020) | Both | Focuses on attacks mimicking those of an APT group, executed in a rather small environment | 2020 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows, misc. logs (DNS, syslog, auditd, apache, auth, various services) | 460 MB | - | -| [OpTC](../datasets/optc) | Both | Huge amount of data and interesting attacks, but possibly hard to use due to uncommon event format and requiring semi-manual labeling | 2020 | Enterprise IT | Windows | ๐ŸŸจ | Custom event logs, Zeek events | - | 1 TB | -| [OTFR Security Datasets - APT 29](../datasets/otfr_apt_29) | Both | Replication of APT29 evaluation developed by MITRE. Well made and documented, but without labels or user behavior | 2020 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | pcaps, Windows events, Zeek events | 126 MB | 2 GB | -| [CICDDoS2019](../datasets/cic_ddos) | Network | Dataset focusing on various DDoS attacks, covering a broad range of categories. Includes benign behavior, but only for Pcaps, not NetFlows | 2019 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | Pcaps, NetFlows, Windows events, Ubuntu events | 24,4 GB | - | -| [DARPA TC5](../datasets/darpa_tc5) | Host | Custom event logs from network under attack from APT groups, designed to facilitate provenance tracking | 2019 | Undisclosed | Undisclosed | ๐ŸŸจ | Custom event logs | - | - | -| [LID-DS 2019](../datasets/lids_ds_2019) | Host | Contains system calls + associated data/metadata for a variety of Linux exploits, includes normal behavior | 2019 | Single OS | Linux | ๐ŸŸจ | Sequences of syscalls with extended information | 13 GB | - | -| [OTFR Security Datasets - APT 3](../datasets/otfr_apt_3) | Host | Replication of APT3 evaluation developed by MITRE. Lacking documentation, no labels and no user behavior | 2019 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | Windows events | 30 MB | 855 MB | -| [ASNM Datasets](../datasets/asnm_datasets) | Network | Specialized features extracted from instances of remote buffer overflow attacks for the purpose of anomaly-based detection | 2009-2018 | Miscellaneous | Windows, Linux | ๐ŸŸฉ | Custom NetFlows | 21 MB | 95 GB | -| [AWSCTD](../datasets/awsctd) | Host | Syscalls collected from ~10k malware samples running on Windows 7, no user emulation | 2018 | Single OS | Windows | ๐ŸŸฉ | Sequences of syscall numbers | 10 MB | 558 MB | -| [CSE-CIC-IDS2018](../datasets/cse_cic_ids2018) | Both | Simulation of large enterprise IT (450 machines) with user emulation and various attacks, includes host and network logs, but only the latter are labeled | 2018 | Enterprise IT | Windows, Linux, MacOS | ๐ŸŸฉ | pcaps, NetFlows, Windows events, Ubuntu events | 220 GB | - | -| [DARPA TC3](../datasets/darpa_tc3) | Host | Custom event logs from network under attack, designed to facilitate provenance tracking | 2018 | Undisclosed | Undisclosed | ๐ŸŸจ | Custom event logs | 115 GB | - | -| [NGIDS-DS](../datasets/nigds_dataset) | Both | Enterprise network undergoing variety of attacks using IXIA PerfectStorm hardware. Seems to lack host user behavior, does not provide raw host logs | 2018 | Enterprise IT | Linux | ๐ŸŸฉ | pcaps, custom host features | 941 MB | 13,4 GB | -| [CIC DoS](../datasets/cic_dos) | Network | Dataset focusing on different DoS attacks targeting the application layer (instead of network layer), but no longer available | 2017 | Enterprise IT | Linux | ๐ŸŸฉ | Network traffic (unknown format) | - | 4,6 GB | -| [CIC-IDS2017](../datasets/cic_ids2017) | Network | Simulation of medium-sized company network under attack, focuses solely on network traffic | 2017 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | pcaps, NetFlows, custom network features | 48,4 GB | 50 GB | -| [Unified Host and Network Data Set](../datasets/unified_host_and_network_dataset) | Both | Selection of network and host events collected from operational environment, but without any attacks | 2017 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | NetFlows, Windows events | - | - | -| [UGR'16](../datasets/ugr16) | Network | Network flows collected from real network over a long period of time, with some attack traffic injected | 2016 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows | 236 GB | - | -| [Comprehensive, Multi-Source Cyber-Security Events](../datasets/comp_multi_source_cybersec_events) | Both | Various events from production network with red team activity, but extremely limited information per event | 2015 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | Custom event logs (auth, proc, network flows, dns, redteam) | 12 GB | - | -| [Kyoto Honeypot](../datasets/kyoto_honeypot) | Network | Collection of features derived from attack traffic targeting honeypots over the span of 9 years | 2006-2015 | Miscellaneous | Windows, Unix, MacOS | ๐ŸŸฉ | Custom network features | 20 GB | - | -| [UNSW-NB15](../datasets/unsw_nb15) | Network | Custom network undergoing a variety of attacks using IXIA PerfectStorm hardware. Mostly geared towards anomaly-based NIDS | 2015 | Undisclosed | Undisclosed | ๐ŸŸฉ | pcaps, custom network features | >100 GB | - | -| [ADFA-WD](../datasets/adfa_wd) | Host | Mostly intended for anomaly-based stuff leveraging library calls, explores interesting concept of stealthy shellcode | 2014 | Single OS | Windows | ๐ŸŸจ | Sequences of dll calls, Windows events (dll calls only) | 403 MB | 13,6 GB | -| [Skopik 2014](../datasets/skopik_et_al) | Host | Focus on realistically emulating user behavior, does not include attacks | 2014 | Enterprise IT | Linux | ๐ŸŸฅ | misc. logs (Apache, database, mail server, bug tracker app) | - | - | -| [Twente 2014](../datasets/twente_2014) | Both | Anonymized network flows and host logs from real network, but only those related to ssh authentication, focusing on detecting related brute force attacks | 2014 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows | 2,42 GB | 5,8 GB | -| [User-Computer Associations in Time](../datasets/user_computer_associations) | Host | Large number of authentication events over a period of 9 months, but with very little detail and without any attacks | 2014 | Enterprise IT | Undisclosed | ๐ŸŸฅ | Custom auth event logs | 2,3 GB | - | -| [ADFA-LD](../datasets/adfa_ld) | Host | Purely intended for anomaly-based approaches, provides only syscall numbers | 2013 | Single OS | Linux | ๐ŸŸฉ | Sequences of syscall numbers | 2 MB | 17 MB | -| [CIDD](../datasets/cidd) | Network | Spin on the DARPA'98 dataset, correlating user behavior over different systems/environments for behavior-based IDSs | 2012 | Military IT | Unix | ๐ŸŸฉ | Sequences of user "audits" | - | 22 GB | -| [ISCX IDS 2012](../datasets/iscx_ids_2012) | Network | Focus on realistic traffic generation in a company network, combined with some basic attacks | 2012 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | pcaps | 84 GB | 87 GB | -| [TUIDS](../datasets/tuids) | Network | Dataset focusing on DoS attacks, but very poorly documented | 2012 | Enterprise IT | Undisclosed | ๐ŸŸฉ | pcaps, NetFlows | - | - | -| [VAST Challenge 2012](../datasets/vast_2012) | Network | Originated from a challenge about data analytics, focus an a large network being the victim of a botnet | 2012 | Enterprise IT | Undisclosed | ๐ŸŸจ | Snort alerts, firewall logs | 186 MB | 2,9 GB | -| [CTU 13](../datasets/ctu_13) | Network | Collection of various botnet behavior combined with loads of background traffic, but very limited feature space | 2011 | Enterprise IT | Windows, Undisclosed | ๐ŸŸฉ | pcaps, NetFlows, Bro logs | - | 697 GB | -| [VAST Challenge 2011](../datasets/vast_2011) | Both | Originated from a challenge about data analytics, focus on network but also contains host logs. Labeling is a bit lacking | 2011 | Enterprise IT | Windows | ๐ŸŸจ | pcaps, Windows events, misc. logs (firewall, Snort, Nessus) | 940 MB | 9,3 GB | -| [CDX CTF 2009](../datasets/cdx_2009) | Both | Dataset captured from a CTF event, generally intended to provide methods for reliable generating labeled datasets from such events | 2009 | Enterprise IT | Windows, Linux | ๐ŸŸจ | pcaps, Snort IDS alerts, Apache logs, Splunk logs | 12 GB | 15,3 GB | -| [NSL-KDD](../datasets/nsl_kdd_dataset) | Network | An improvement of the original KDD'99 dataset, but still outdated at its core | 2009 | Military IT | Unix | ๐ŸŸฉ | Connection records | 6 MB | 19 MB | -| [Twente 2009](../datasets/twente_2009) | Network | Intricately labeled network flows + alerts collected from a single honeypot over the span of 6 days | 2009 | Single OS | Linux | ๐ŸŸฉ | NetFlows | 303 MB | 1,9 GB | -| [gureKDDCup](../datasets/gure_kddcup) | Network | An extension of the KDDCup 1999 dataset, adding additional information about payloads to each connection record | 2008 | Military IT | Unix | ๐ŸŸฉ | Connection records with payload information | 10 GB | - | -| [KDD Cup 1999](../datasets/kdd_cup_1999) | Network | Network connection events derived from simulated U.S. Air Force network under attack. No longer appropriate to use for multiple reasons | 1999 | Military IT | Unix | ๐ŸŸฉ | Connection records | 18 MB | 743 MB | -| [DARPA'98 Intrusion Detection Program](../datasets/darpa98) | Both | Simulation of a small U.S. Air Force network under attack. No longer appropriate to use for a multiple reasons | 1998 | Military IT | Unix | ๐ŸŸจ | tcpdumps, host audit logs, file system dumps | 5 GB | - | +| Name | Network/Host Data | TL;DR | Year | Setting | OS Type | Labeled?ยน | Data Type/Source | Packed Size | Unpacked Size | +|----------------------------------------------------------------------------------------------------|:-----------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------:|---------------|-----------------------|:---------:|--------------------------------------------------------------------------------------|------------:|--------------:| +| [AIT Alert Dataset](../datasets/ait_alert_dataset) | Both | Alerts generated from the AIT log dataset, including labels. Only caveat is the lack of Windows machines | 2023 | Enterprise IT | Linux | ๐ŸŸฉ | Wazuh, Suricata and AMiner alerts | 96 MB | 2,9 GB | +| [OTFR Security Datasets - LSASS Campaign](../datasets/otfr_lsass_campaign) | Both | Very small simulation focusing on exploiting Windows' LSASS.exe. Lacking documentation, no labels and no user behavior | 2023 | Single OS | Windows | ๐ŸŸฅ | pcaps, Windows events, Zeek logs | 423 MB | 1 GB | +| [AIT Log Dataset](../datasets/ait_log_dataset) | Both | Huge variety of labeled logs collected from multiple simulation runs of an enterprise network under attack. With user emulation. but only Linux machines | 2022 | Enterprise IT | Linux | ๐ŸŸฉ | pcaps, Suricata alerts, misc. logs (Apache, auth, dns, vpn, audit, suricata, syslog) | 130 GB | 206 GB | +| [CLUE-LDS](../datasets/clue_lds) | Host | Database of real user behavior without known attacks, for evaluation of methods detecting shifts in user behavior | 2022 | Subsystem | Undisclosed | ๐ŸŸฅ | Custom event logs | 640 MB | 14,9 GB | +| [EVTX to MITRE ATT&CK](../datasets/evtx_to_mitre_attck) | Host | Small dataset providing various events corresponding to certain MITRE ATT&CK tactics/techniques | 2022 | Single OS | Windows | ๐ŸŸฉ | Windows events | <1 GB | <1 GB | +| [OTFR Security Datasets - Atomic](../datasets/otfr_atomic) | Both | Various small datasets, each corresponding to a specific MITRE ATT&CK tactic/technique. Lacks user simulation / underlying scenario and does not provide explicit labels | 2019-2022 | Single OS | Windows, Linux, Cloud | ๐ŸŸจ | pcaps, Windows events, auditd logs, AWS CloudTrail logs | 125 MB | - | +| [PWNJUTSU](../datasets/pwnjutsu) | Both | Rich collection of complex attacks executed by various red team participants each acting in a small network, but not labeled | 2022 | Miscellaneous | Windows, Linux | ๐ŸŸฅ | pcaps, Windows events, Sysmon, auditd, various logs (Apache, auth, dns, ssh, etc.) | 82 GB | - | +| [UWF-ZeekData22](../datasets/uwf_zeekdata22) | Network | Traffic collected from a university's wargaming course. Covers all MITRE ATT&CK tactics, though the overwhelming majority is simple recon and attacks are poorly documented | 2022 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | pcaps, Zeek logs | - | 209 GB | +| [NF-UQ-NIDS](../datasets/nf_uq_nids) | Network | Combination of four distinct network datasets using a newly proposed set of standardized features | 2021 | Miscellaneous | Windows, Linux, MacOS | ๐ŸŸฉ | Custom NetFlows | 2 GB | 14,8 GB | +| [OTFR Security Datasets - Log4Shell](../datasets/otfr_log4shell) | Both | Very small simulation focusing on the Log4j vulnerability. Lacking documentation, no explicit labels and no user behavior | 2021 | Single OS | Linux | ๐ŸŸจ | pcaps, Ubuntu events | <1 MB | 1 MB | +| [OTFR Security Datasets - SimuLand Golden SAML](../datasets/otfr_golden_saml) | Host | Barely a dataset, only contains very few traces for some specific events. At most usable to test specific Windows detection rules. | 2021 | Enterprise IT | Windows | ๐ŸŸฉ | Windows Events | - | <1 MB | +| [SOCBED Example Dataset](../datasets/socbed_dataset) | Both | Generated using the SOCBED framework, demonstrating reproducible dataset creation, though current attacks are on the basic side | 2021 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | Windows events, Linux events, packetbeat | 78 MB | 1,3 GB | +| [Unraveled](../datasets/unraveled) | Both | Large dataset with intricate labeling, though the focus seems to be on network flows. Mapping will be annoying. | 2021 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | pcaps, misc. logs (syslog, audit, auth, Snort) | - | 22 GB | +| [DAPT 2020](../datasets/dapt2020) | Both | Focuses on attacks mimicking those of an APT group, executed in a rather small environment | 2020 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows, misc. logs (DNS, syslog, auditd, apache, auth, various services) | 460 MB | - | +| [OpTC](../datasets/optc) | Both | Huge amount of data and interesting attacks, but possibly hard to use due to uncommon event format and requiring semi-manual labeling | 2020 | Enterprise IT | Windows | ๐ŸŸจ | Custom event logs, Zeek events | - | 1 TB | +| [OTFR Security Datasets - APT 29](../datasets/otfr_apt_29) | Both | Replication of APT29 evaluation developed by MITRE. Well made and documented, but without labels or user behavior | 2020 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | pcaps, Windows events, Zeek events | 126 MB | 2 GB | +| [CICDDoS2019](../datasets/cic_ddos) | Network | Dataset focusing on various DDoS attacks, covering a broad range of categories. Includes benign behavior, but only for Pcaps, not NetFlows | 2019 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | Pcaps, NetFlows, Windows events, Ubuntu events | 24,4 GB | - | +| [DARPA TC5](../datasets/darpa_tc5) | Host | Custom event logs from network under attack from APT groups, designed to facilitate provenance tracking | 2019 | Undisclosed | Undisclosed | ๐ŸŸจ | Custom event logs | - | - | +| [LID-DS 2019](../datasets/lids_ds_2019) | Host | Contains system calls + associated data/metadata for a variety of Linux exploits, includes normal behavior | 2019 | Single OS | Linux | ๐ŸŸจ | Sequences of syscalls with extended information | 13 GB | - | +| [OTFR Security Datasets - APT 3](../datasets/otfr_apt_3) | Host | Replication of APT3 evaluation developed by MITRE. Lacking documentation, no labels and no user behavior | 2019 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | Windows events | 30 MB | 855 MB | +| [ASNM Datasets](../datasets/asnm_datasets) | Network | Specialized features extracted from instances of remote buffer overflow attacks for the purpose of anomaly-based detection | 2009-2018 | Miscellaneous | Windows, Linux | ๐ŸŸฉ | Custom NetFlows | 21 MB | 95 GB | +| [AWSCTD](../datasets/awsctd) | Host | Syscalls collected from ~10k malware samples running on Windows 7, no user emulation | 2018 | Single OS | Windows | ๐ŸŸฉ | Sequences of syscall numbers | 10 MB | 558 MB | +| [CSE-CIC-IDS2018](../datasets/cse_cic_ids2018) | Both | Simulation of large enterprise IT (450 machines) with user emulation and various attacks, includes host and network logs, but only the latter are labeled | 2018 | Enterprise IT | Windows, Linux, MacOS | ๐ŸŸฉ | pcaps, NetFlows, Windows events, Ubuntu events | 220 GB | - | +| [DARPA TC3](../datasets/darpa_tc3) | Host | Custom event logs from network under attack, designed to facilitate provenance tracking | 2018 | Undisclosed | Undisclosed | ๐ŸŸจ | Custom event logs | 115 GB | - | +| [NGIDS-DS](../datasets/nigds_dataset) | Both | Enterprise network undergoing variety of attacks using IXIA PerfectStorm hardware. Seems to lack host user behavior, does not provide raw host logs | 2018 | Enterprise IT | Linux | ๐ŸŸฉ | pcaps, custom host features | 941 MB | 13,4 GB | +| [CIC DoS](../datasets/cic_dos) | Network | Dataset focusing on different DoS attacks targeting the application layer (instead of network layer), but no longer available | 2017 | Enterprise IT | Linux | ๐ŸŸฉ | Network traffic (unknown format) | - | 4,6 GB | +| [CIC-IDS2017](../datasets/cic_ids2017) | Network | Simulation of medium-sized company network under attack, focuses solely on network traffic | 2017 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | pcaps, NetFlows, custom network features | 48,4 GB | 50 GB | +| [Unified Host and Network Data Set](../datasets/unified_host_and_network_dataset) | Both | Selection of network and host events collected from operational environment, but without any attacks | 2017 | Enterprise IT | Windows, Linux | ๐ŸŸฅ | NetFlows, Windows events | - | - | +| [UGR'16](../datasets/ugr16) | Network | Network flows collected from real network over a long period of time, with some attack traffic injected | 2016 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows | 236 GB | - | +| [Comprehensive, Multi-Source Cyber-Security Events](../datasets/comp_multi_source_cybersec_events) | Both | Various events from production network with red team activity, but extremely limited information per event | 2015 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | Custom event logs (auth, proc, network flows, dns, redteam) | 12 GB | - | +| [Kyoto Honeypot](../datasets/kyoto_honeypot) | Network | Collection of features derived from attack traffic targeting honeypots over the span of 9 years | 2006-2015 | Miscellaneous | Windows, Unix, MacOS | ๐ŸŸฉ | Custom network features | 20 GB | - | +| [UNSW-NB15](../datasets/unsw_nb15) | Network | Custom network undergoing a variety of attacks using IXIA PerfectStorm hardware. Mostly geared towards anomaly-based NIDS | 2015 | Undisclosed | Undisclosed | ๐ŸŸฉ | pcaps, custom network features | >100 GB | - | +| [ADFA-WD](../datasets/adfa_wd) | Host | Mostly intended for anomaly-based stuff leveraging library calls, explores interesting concept of stealthy shellcode | 2014 | Single OS | Windows | ๐ŸŸจ | Sequences of dll calls, Windows events (dll calls only) | 403 MB | 13,6 GB | +| [Skopik 2014](../datasets/skopik_et_al) | Host | Focus on realistically emulating user behavior, does not include attacks | 2014 | Enterprise IT | Linux | ๐ŸŸฅ | misc. logs (Apache, database, mail server, bug tracker app) | - | - | +| [Twente 2014](../datasets/twente_2014) | Both | Anonymized network flows and host logs from real network, but only those related to ssh authentication, focusing on detecting related brute force attacks | 2014 | Enterprise IT | Undisclosed | ๐ŸŸฉ | NetFlows | 2,42 GB | 5,8 GB | +| [User-Computer Associations in Time](../datasets/user_computer_associations) | Host | Large number of authentication events over a period of 9 months, but with very little detail and without any attacks | 2014 | Enterprise IT | Undisclosed | ๐ŸŸฅ | Custom auth event logs | 2,3 GB | - | +| [ADFA-LD](../datasets/adfa_ld) | Host | Purely intended for anomaly-based approaches, provides only syscall numbers | 2013 | Single OS | Linux | ๐ŸŸฉ | Sequences of syscall numbers | 2 MB | 17 MB | +| [CIDD](../datasets/cidd) | Network | Spin on the DARPA'98 dataset, correlating user behavior over different systems/environments for behavior-based IDSs | 2012 | Military IT | Unix | ๐ŸŸฉ | Sequences of user "audits" | - | 22 GB | +| [ISCX IDS 2012](../datasets/iscx_ids_2012) | Network | Focus on realistic traffic generation in a company network, combined with some basic attacks | 2012 | Enterprise IT | Windows, Linux | ๐ŸŸฉ | pcaps | 84 GB | 87 GB | +| [TUIDS](../datasets/tuids) | Network | Dataset focusing on DoS attacks, but very poorly documented | 2012 | Enterprise IT | Undisclosed | ๐ŸŸฉ | pcaps, NetFlows | - | - | +| [VAST Challenge 2012](../datasets/vast_2012) | Network | Originated from a challenge about data analytics, focus an a large network being the victim of a botnet | 2012 | Enterprise IT | Undisclosed | ๐ŸŸจ | Snort alerts, firewall logs | 186 MB | 2,9 GB | +| [CTU 13](../datasets/ctu_13) | Network | Collection of various botnet behavior combined with loads of background traffic, but very limited feature space | 2011 | Enterprise IT | Windows, Undisclosed | ๐ŸŸฉ | pcaps, NetFlows, Bro logs | - | 697 GB | +| [VAST Challenge 2011](../datasets/vast_2011) | Both | Originated from a challenge about data analytics, focus on network but also contains host logs. Labeling is a bit lacking | 2011 | Enterprise IT | Windows | ๐ŸŸจ | pcaps, Windows events, misc. logs (firewall, Snort, Nessus) | 940 MB | 9,3 GB | +| [CDX CTF 2009](../datasets/cdx_2009) | Both | Dataset captured from a CTF event, generally intended to provide methods for reliable generating labeled datasets from such events | 2009 | Enterprise IT | Windows, Linux | ๐ŸŸจ | pcaps, Snort IDS alerts, Apache logs, Splunk logs | 12 GB | 15,3 GB | +| [NSL-KDD](../datasets/nsl_kdd_dataset) | Network | An improvement of the original KDD'99 dataset, but still outdated at its core | 2009 | Military IT | Unix | ๐ŸŸฉ | Connection records | 6 MB | 19 MB | +| [Twente 2009](../datasets/twente_2009) | Network | Intricately labeled network flows + alerts collected from a single honeypot over the span of 6 days | 2009 | Single OS | Linux | ๐ŸŸฉ | NetFlows | 303 MB | 1,9 GB | +| [gureKDDCup](../datasets/gure_kddcup) | Network | An extension of the KDDCup 1999 dataset, adding additional information about payloads to each connection record | 2008 | Military IT | Unix | ๐ŸŸฉ | Connection records with payload information | 10 GB | - | +| [KDD Cup 1999](../datasets/kdd_cup_1999) | Network | Network connection events derived from simulated U.S. Air Force network under attack. No longer appropriate to use for multiple reasons | 1999 | Military IT | Unix | ๐ŸŸฉ | Connection records | 18 MB | 743 MB | +| [DARPA'98 Intrusion Detection Program](../datasets/darpa98) | Both | Simulation of a small U.S. Air Force network under attack. No longer appropriate to use for a multiple reasons | 1998 | Military IT | Unix | ๐ŸŸจ | tcpdumps, host audit logs, file system dumps | 5 GB | - | ### Legend diff --git a/content/datasets/uwf_zeekdata22.md b/content/datasets/uwf_zeekdata22.md index b35c41a..895f14f 100644 --- a/content/datasets/uwf_zeekdata22.md +++ b/content/datasets/uwf_zeekdata22.md @@ -22,7 +22,7 @@ title: UWF-ZeekData22 | **Number of Machines** | 6 | | **Total Runtime** | 64 days | | **Year of Collection** | 2022 | -| **Attack Categories** | All MITRE tactics | +| **Attack Categories** | All MITRE ATT&CK tactics | | **User Emulation** | n/a | | | | | **Packed Size** | - | @@ -34,12 +34,12 @@ title: UWF-ZeekData22 ### Overview The University of West Florida Zeek Dataset (UWF-ZeekData22) consists of 64 days network traffic and related Zeek logs, collected from a "cyber wargaming course" held at the same university. This course leveraged the UWF's cyber range, a virtualized and relatively diverse environment of different systems which participants were instructed to attack and defend. -The datasets defining feature is the inclusion of MITRE tactic labels assigned to each packet or log, potentially allowing for attack chain detection or similar use cases. +The datasets' defining feature is the inclusion of MITRE ATT&CK tactic labels assigned to each packet or log, potentially allowing for attack chain detection or similar use cases. However, the vast majority (>99.9%) of malicious traffic consists of simple reconnaissance, and, apart from statistics, there is very little information about individual attacks. -The authors also detail the process of collecting these large amounts of data with a dedicated solution (Apache Hadoop), though this is considered out of scope for this survey. +The authors also detail the process of collecting these large amounts of data with a dedicated solution (Apache Hadoop). ### Environment -As mentioned, course participants leveraged the universities cyber range. +As mentioned, course participants leveraged the university's cyber range. Although the authors state that their dataset contains thousands of distinct IP addresses, this is most likely caused by the fact that each group of students (81 in total) was assigned their own environment (as opposed to one actually large network). Each individual network hosts, presumably, six machines with different versions of Windows and Linux operating systems, running various, partially vulnerable, services - presumably, because Section 4 of the underlying paper [1] is pretty unclear in this regard. @@ -65,7 +65,7 @@ In other words, the vast majority of malicious traffic consists most likely of p Additionally, while there seems to be some form of benign activity, it is in no way documented. ### Contained Data -Data is generally available in three different formats, all of which are labelled with the associated MITRE tactic: +Data is generally available in three different formats, all of which are labeled with the associated MITRE ATT&CK tactic: - pcaps: Contains captured traffic. Note that these are in a [custom binary format](https://docs.securityonion.net/en/latest/stenographer.html) generated by Security Onion. These files are divided into thousands of smaller files, each covering roughly one minute of traffic. From 897b77296079cbdb04c945ac48587a7bf5f9cdd4 Mon Sep 17 00:00:00 2001 From: Maspital Date: Tue, 4 Jun 2024 18:13:13 +0200 Subject: [PATCH 16/16] Clarify location of Zeek logs --- content/datasets/uwf_zeekdata22.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/content/datasets/uwf_zeekdata22.md b/content/datasets/uwf_zeekdata22.md index 895f14f..6d822c0 100644 --- a/content/datasets/uwf_zeekdata22.md +++ b/content/datasets/uwf_zeekdata22.md @@ -70,15 +70,19 @@ Data is generally available in three different formats, all of which are labeled Note that these are in a [custom binary format](https://docs.securityonion.net/en/latest/stenographer.html) generated by Security Onion. These files are divided into thousands of smaller files, each covering roughly one minute of traffic. - parquet: A binary column-oriented data storage format (basically a faster version of CSV when working with large files). -Contained information is very similar to that of network flows (see example of CSV files below). +These contain the Zeek logs generated during the collection period and are equal to the CSV data regarding feature count and names (see example of CSV files below). There are eight files in total, each covering eight days of traffic. - CSV: A subset of aforementioned parquet files, which, according to the authors, were mainly made available for people who do not have access to "Big Data" technologies. These files contain data from 2022/02/10, 0300-0600, 0900-1000, and 1400-1500, with one file per hour, thus five files in total. Each file contains one million entries with a benign/attack ratio of about 80/20. For attacks, only the tactics "Reconnaissance" and "Discovery" are included. -It is unclear where exactly Zeek logs can be found, though I am assuming they are part of the PCAP files. -The authors leverage what they call "mission logs" to perform labeling, though the nature of these logs is not further detailed. +It should be noted that some of the field names commonly used in Zeek logs seem to differ from what can be found in the present data. +For example, `conn` Zeek logs use the `id.orig_h` field for storing the host ip; +here, this information is stored in `src_ip`. +This also does not match with the authors own information about collected [Attributes per Zeek log type](https://datasets.uwf.edu/tables/table2.html), as, again, the features found in the example CSV data below are also the one used in the parquet files. + +The authors leverage what they call "mission logs" to perform labeling, though the nature of these mission logs is not further detailed. Section 6.1 in [1] seems to suggest that these are manually created by participants, who document their current activity in the form of timestamps, ports, IPs, tactics, etc. ### Papers