diff --git a/content/all_datasets.md b/content/all_datasets.md index 502b402..badcedc 100644 --- a/content/all_datasets.md +++ b/content/all_datasets.md @@ -14,19 +14,21 @@ before-content: gh_buttons.html | [AIT Log Dataset](../datasets/ait_log_dataset) | Host & Network | Huge variety of labeled logs collected from multiple simulation runs of an enterprise network under attack. With user emulation. but only Linux machines | 2023 | Enterprise IT | Linux | 🟩 | pcaps, Suricata alerts, misc. logs (Apache, auth, dns, vpn, audit, suricata, syslog) | 130 GB | 206 GB | | [ASNM Datasets](../datasets/asnm_datasets) | Network | Specialized features extracted from instances of remote buffer overflow attacks for the purpose of anomaly-based detection | 2009-2018 | Mixed | Windows, Linux | 🟩 | Custom NetFlows | 21 MB | 95 GB | | [AWSCTD](../datasets/awsctd) | Host | Syscalls collected from ~10k malware samples running on Windows 7, no user emulation | 2018 | Single OS | Windows | 🟩 | Sequences of syscall numbers | 10 MB | 558 MB | -| [BotsV3](../datasets/botsv3) [_ON HOLD_] | | _Requires usage of Splunk + a bunch of extensions, postponed_ | 2020 | | | | | 17 GB | - | | [CDX CTF 2009](../datasets/cdx_2009) | Network | Dataset captured from a CTF event, generally intended to provide methods for reliable generating labeled datasets from such events | 2009 | Enterprise IT | Windows, Linux | 🟨 | pcaps, Snort IDS alerts, Apache logs, Splunk logs | 12 GB | 15,3 GB | +| [CIC DoS](../datasets/cic_dos) | Network | Dataset focusing on different DoS attacks targeting the application layer (instead of network layer), but no longer available | 2017 | Enterprise IT | Linux | 🟩 | Network traffic (unknown format) | - | 4,6 GB | +| [CIC-DDoS2019](../datasets/cic_ddos) | Network | Dataset focusing on various DDoS attacks, covering a broad range of categories. Includes benign behavior, but only for Pcaps, not NetFlows | 2019 | Enterprise IT | Windows, Linux | 🟩 | Pcaps, NetFlows, Windows events, Ubuntu events | 24,4 GB | - | | [CIC-IDS2017](../datasets/cic_ids2017) | Network | Simulation of medium-sized company network under attack, focuses solely on network traffic | 2017 | Enterprise IT | Windows, Linux | 🟩 | pcaps, NetFlows, custom network features | 48,4 GB | 50 GB | | [CIDD](../datasets/cidd) | - | Spin on the DARPA'98 dataset, correlating user behavior over different systems/environments for behavior-based IDSs | 2012 | Military IT | Unix | 🟩 | Sequences of user "audits" | - | 22 GB | | [CLUE-LDS](../datasets/clue_lds) | - | Database of real user behavior without known attacks, for evaluation of methods detecting shifts in user behavior | 2022 | Enterprise Subsystem | - (hBox) | 🟥 | Custom event logs | 640 MB | 14,9 GB | | [Comprehensive, Multi-Source Cyber-Security Events](../datasets/comp_multi_source_cybersec_events) | Host & Network | Various events from production network with red team activity, but extremely limited information per event | 2015 | Enterprise IT | Windows, Linux | 🟩 | Custom event logs (auth, proc, network flows, dns, redteam) | 12 GB | - | -| [CSE-CIC-IDS2018](../datasets/cse_cic_ids2018) | Network | Simulation of large enterprise IT (450 machines) with user emulation and various attacks, includes host and network logs, but only the latter are labeled | 2018 | Enterprise IT | Windows, Linux, MacOS | 🟨 | pcaps, NetFlows, custom network features, Windows events, Ubuntu events | 220 GB | - | +| [CSE-CIC-IDS2018](../datasets/cse_cic_ids2018) | Network | Simulation of large enterprise IT (450 machines) with user emulation and various attacks, includes host and network logs, but only the latter are labeled | 2018 | Enterprise IT | Windows, Linux, MacOS | 🟩 | pcaps, NetFlows, Windows events, Ubuntu events | 220 GB | - | | [CTU 13](../datasets/ctu_13) | Network | Collection of various botnet behavior combined with loads of background traffic, but very limited feature space | 2011 | Enterprise IT | Windows, Undisclosed | 🟩 | pcaps, NetFlows, Bro logs | - | 697 GB | | [DAPT 2020](../datasets/dapt2020) | Network | Focuses on attacks mimicking those of an APT group, executed in a rather small environment | 2020 | Enterprise IT | Undisclosed | 🟩 | NetFlows, misc. logs (DNS, syslog, auditd, apache, auth, various services) | 460 MB | - | | [DARPA'98 Intrusion Detection Program](../datasets/darpa98) | Network | Simulation of a small U.S. Air Force network under attack. No longer appropriate to use for a multiple reasons | 1998 | Military IT | Unix | 🟩 | tcpdumps, host audit logs, file system dumps | 5 GB | - | | [DARPA TC3](../datasets/darpa_tc3) | Host | Custom event logs from network under attack, designed to facilitate provenance tracking | 2018 | Undisclosed | Undisclosed | 🟨 | Custom event logs | 115 GB | - | | [DARPA TC5](../datasets/darpa_tc5) | Host | Custom event logs from network under attack from APT groups, designed to facilitate provenance tracking | 2019 | Undisclosed | Undisclosed | 🟨 | Custom event logs | - | - | | [EVTX to MITRE ATT&CK](../datasets/evtx_to_mitre_attck) | Host | Small dataset providing various events corresponding to certain MITRE tactics/techniques | 2022 | Single OS | Windows | 🟩 | Windows events | <1 GB | <1 GB | +| [gureKDDCup](../datasets/gure_kddcup) | Network | An extension of the KDDCup 1999 dataset, adding additional information about payloads to each connection record | 2008 | Military IT | Unix | 🟩 | Connection records with payload information | 10 GB | - | | [ISCX Intrusion Detection Evaluation](../datasets/iscx_ids_2012) | Network | Focus on realistic traffic generation in a company network, combined with some basic attacks | 2012 | Enterprise IT | Windows, Ubuntu | 🟩 | pcaps | 84 GB | 87 GB | | [KDD Cup 1999](../datasets/kdd_cup_1999) | Network | Network connection events derived from simulated U.S. Air Force network under attack. No longer appropriate to use for multiple reasons | 1999 | Military IT | Unix | 🟩 | Connection records | 18 MB | 743 MB | | [Kyoto Honeypot](../datasets/kyoto_honeypot) | Network | Collection of features derived from attack traffic targeting honeypots over the span of 9 years | 2006-2015 | Diverse | Windows, Unix, MacOS | 🟩 | Custom network features | 20 GB | - | @@ -50,6 +52,7 @@ before-content: gh_buttons.html | [Unified Host and Network Data Set](../datasets/unified_host_and_network_dataset) | - | Selection of network and host events collected from operational environment, but without any attacks | 2017 | Enterprise IT | Windows, Linux | 🟥 | NetFlows, Windows events | - | - | | [Unraveled](../datasets/unraveled) | Host & Network | Large dataset with intricate labeling, though the focus seems to be on network flows. Mapping will be annoying. | 2021 | Enterprise IT | Windows, Ubuntu, Kali | 🟩 | pcaps, misc. logs (syslog, audit, auth, Snort) | - | 22 GB | | [UNSW-NB15](../datasets/unsw_nb15) | Network | Custom network undergoing a variety of attacks using IXIA PerfectStorm hardware. Mostly geared towards anomaly-based NIDS | 2015 | Undisclosed | Undisclosed | 🟩 | pcaps, custom network features | >100 GB | - | +| [User-Computer Associations in Time](../datasets/user_computer_associations) | - | Large number of authentication events over a period of 9 months, but with very little detail and without any attacks | 2014 | Enterprise IT | Undisclosed | 🟥 | Custom auth event logs | 2,3 GB | - | | [VAST Challenge 2011](../datasets/vast_2011) | Network | Originated from a challenge about data analytics, focus on network but also contains host logs. Labeling is a bit lacking | 2011 | Enterprise IT | Windows | 🟨 | pcaps, Windows events, misc. logs (firewall, Snort, Nessus) | 940 MB | 9,3 GB | | [VAST Challenge 2012](../datasets/vast_2012) | Network | Originated from a challenge about data analytics, focus an a large network being the victim of a botnet | 2012 | Enterprise IT | Undisclosed | 🟨 | Snort alerts, firewall logs | 186 MB | 2,9 GB | diff --git a/content/datasets/botsv3.md b/content/datasets/botsv3.md index ca80ea4..21811f5 100644 --- a/content/datasets/botsv3.md +++ b/content/datasets/botsv3.md @@ -1,5 +1,5 @@ --- -title: BOTSv3 +title: BOTSv3 [UNLISTED ENTRY] --- - [Overview](#overview) diff --git a/content/datasets/cic_ddos.md b/content/datasets/cic_ddos.md new file mode 100644 index 0000000..665cfca --- /dev/null +++ b/content/datasets/cic_ddos.md @@ -0,0 +1,81 @@ +--- +title: CIC-DDos2019 +--- + +- [Overview](#overview) +- [Environment](#environment) +- [Activity](#activity) +- [Contained Data](#contained-data) +- [Papers](#papers) +- [Links](#links) +- [Data Examples](#data-examples) + +| | | +|--------------------------|---------------------------------------------------------------| +| **Network Log Source** | Pcaps, NetFlows | +| **Network Logs Labeled** | Flows are labeled | +| **Host Log Source** | Windows event logs, Ubuntu event logs | +| **Host Logs Labeled** | No | +| | | +| **Overall Setting** | Enterprise IT | +| **OS Types** | Windows Vista/7/8.1/10
Ubuntu 16.04
Fortinet | +| **Number of Machines** | 6 | +| **Total Runtime** | ~16 hours | +| **Year of Collection** | 2019 | +| **Attack Categories** | Various DDoS attacks | +| **User Emulation** | Yes, models complex behavior | +| | | +| **Packed Size** | 24,4 GB | +| **Unpacked Size** | n/a | +| **Download Link** | [goto](http://205.174.165.80/CICDataset/CICDDoS2019/Dataset/) | + +*** + +### Overview +The CIC-DDos2019 dataset, developed by the Canadian Institute for Cybersecurity (CIC), was created to enable evaluation of new DDoS detection methods, which, according to the authors, was not possible with previously existing datasets containing DDoS attacks. +The dataset is accompanied by a newly proposed taxonomy for DDoS attacks, dividing them into several subclasses. +These attacks are then executed within a small testbed, consisting of a victim network performing benign behavior and a separate attacker network. +This simulation was run on two separate days, namely training and testing day; +data was collected in the form of pcaps, which are then processed into labeled NetFlows. + +### Environment +The victim network consists of four Windows machines (Vista/7/8.1/10), an Ubuntu 16.04 Web Server and a firewall. +Information regarding software is not available, IPs of individual machines can be found on the homepage. +Attacks originate from a separate attacker network, which is also not further detailed. + +### Activity +So-called B(enign)-Profiles are leveraged to define normal behavior which is performed during the collection period; +this simulates 25 distinct users interacting with HTTP, HTTPS, FTP, SSH, and email-protocols. +Statistics for these interactions have been derived from observing real human behavior. + +Executed attacks are based on the newly proposed taxonomy of DDoS attacks, for details regarding this refer to Chapter 3 of the cited paper. +On the first day (training day), 12 different DDoS attacks were executed at different points in time. +On the second day (testing day), a subset of 5 of these attacks were executed, plus a sixth one that was not performed previously. + +### Contained Data +Attacks were exclusively executed within the collection period, i.e., no attack is running when data collection starts. +Data is organized per day and consists of pcaps, which were then processed into NetFlows using CICFlowMeter and subsequently labeled. +These flows are grouped by attack in separate `csv` files, but there are no flows available for benign behavior. +While these probably could be extracted manually from the available pcaps, I'm honestly not quite sure why they weren't included in the first place. + +A detailed analysis of these flows, especially with respect to the effects of individual attacks on certain features, is available in Chapter 5 of the paper. + +### Papers +- [Developing Realistic Distributed Denial of Service (DDoS) Attack Dataset and Taxonomy (2019)](https://doi.org/10.1109/CCST.2019.8888419) + +### Links +- [Homepage](https://www.unb.ca/cic/datasets/ddos-2019.html) +- [Download](http://205.174.165.80/CICDataset/CICDDoS2019/Dataset/) + +### Data Examples +Labeled flows taken from `CSVs/CSV-03-11/Portmap.csv` +``` +Unnamed: 0,Flow ID, Source IP, Source Port, Destination IP, Destination Port, Protocol, Timestamp, Flow Duration, Total Fwd Packets, Total Backward Packets,Total Length of Fwd Packets, Total Length of Bwd Packets, Fwd Packet Length Max, Fwd Packet Length Min, Fwd Packet Length Mean, Fwd Packet Length Std,Bwd Packet Length Max, Bwd Packet Length Min, Bwd Packet Length Mean, Bwd Packet Length Std,Flow Bytes/s, Flow Packets/s, Flow IAT Mean, Flow IAT Std, Flow IAT Max, Flow IAT Min,Fwd IAT Total, Fwd IAT Mean, Fwd IAT Std, Fwd IAT Max, Fwd IAT Min,Bwd IAT Total, Bwd IAT Mean, Bwd IAT Std, Bwd IAT Max, Bwd IAT Min,Fwd PSH Flags, Bwd PSH Flags, Fwd URG Flags, Bwd URG Flags, Fwd Header Length, Bwd Header Length,Fwd Packets/s, Bwd Packets/s, Min Packet Length, Max Packet Length, Packet Length Mean, Packet Length Std, Packet Length Variance,FIN Flag Count, SYN Flag Count, RST Flag Count, PSH Flag Count, ACK Flag Count, URG Flag Count, CWE Flag Count, ECE Flag Count, Down/Up Ratio, Average Packet Size, Avg Fwd Segment Size, Avg Bwd Segment Size, Fwd Header Length.1,Fwd Avg Bytes/Bulk, Fwd Avg Packets/Bulk, Fwd Avg Bulk Rate, Bwd Avg Bytes/Bulk, Bwd Avg Packets/Bulk,Bwd Avg Bulk Rate,Subflow Fwd Packets, Subflow Fwd Bytes, Subflow Bwd Packets, Subflow Bwd Bytes,Init_Win_bytes_forward, Init_Win_bytes_backward, act_data_pkt_fwd, min_seg_size_forward,Active Mean, Active Std, Active Max, Active Min,Idle Mean, Idle Std, Idle Max, Idle Min,SimillarHTTP, Inbound, Label +[...] +162471,172.16.0.5-192.168.50.4-932-44723-17,172.16.0.5,932,192.168.50.4,44723,17,2018-11-03 10:01:35.983831,1,2,0,458.0,0.0,229.0,229.0,229.0,0.0,0.0,0.0,0.0,0.0,4.58E8,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,40,0,2000000.0,0.0,229.0,229.0,229.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,343.5,229.0,0.0,40,0,0,0,0,0,0,2,458,0,0,-1,-1,1,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,Portmap +61268,172.16.0.5-192.168.50.4-933-39983-17,172.16.0.5,933,192.168.50.4,39983,17,2018-11-03 10:01:35.984211,1,2,0,458.0,0.0,229.0,229.0,229.0,0.0,0.0,0.0,0.0,0.0,4.58E8,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,40,0,2000000.0,0.0,229.0,229.0,229.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,343.5,229.0,0.0,40,0,0,0,0,0,0,2,458,0,0,-1,-1,1,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,Portmap +27258,172.16.0.5-192.168.50.4-934-26737-17,172.16.0.5,934,192.168.50.4,26737,17,2018-11-03 10:01:35.984213,1,2,0,458.0,0.0,229.0,229.0,229.0,0.0,0.0,0.0,0.0,0.0,4.58E8,2000000.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,40,0,2000000.0,0.0,229.0,229.0,229.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,343.5,229.0,0.0,40,0,0,0,0,0,0,2,458,0,0,-1,-1,1,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,Portmap +85566,172.16.0.5-192.168.50.4-648-21313-17,172.16.0.5,648,192.168.50.4,21313,17,2018-11-03 10:01:35.984783,2,2,0,458.0,0.0,229.0,229.0,229.0,0.0,0.0,0.0,0.0,0.0,2.29E8,1000000.0,2.0,0.0,2.0,2.0,2.0,2.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,40,0,1000000.0,0.0,229.0,229.0,229.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,343.5,229.0,0.0,40,0,0,0,0,0,0,2,458,0,0,-1,-1,1,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,Portmap +108025,172.16.0.5-192.168.50.4-935-15051-17,172.16.0.5,935,192.168.50.4,15051,17,2018-11-03 10:01:35.984786,0,2,0,530.0,0.0,265.0,265.0,265.0,0.0,0.0,0.0,0.0,0.0,Infinity,Infinity,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,40,0,0.0,0.0,265.0,265.0,265.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,397.5,265.0,0.0,40,0,0,0,0,0,0,2,530,0,0,-1,-1,1,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,Portmap +87041,172.16.0.5-192.168.50.4-936-49469-17,172.16.0.5,936,192.168.50.4,49469,17,2018-11-03 10:01:35.985305,2,2,0,458.0,0.0,229.0,229.0,229.0,0.0,0.0,0.0,0.0,0.0,2.29E8,1000000.0,2.0,0.0,2.0,2.0,2.0,2.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,40,0,1000000.0,0.0,229.0,229.0,229.0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,343.5,229.0,0.0,40,0,0,0,0,0,0,2,458,0,0,-1,-1,1,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,Portmap +``` \ No newline at end of file diff --git a/content/datasets/cic_dos.md b/content/datasets/cic_dos.md new file mode 100644 index 0000000..ae1823b --- /dev/null +++ b/content/datasets/cic_dos.md @@ -0,0 +1,73 @@ +--- +title: CIC DoS +--- + +- [Overview](#overview) +- [Environment](#environment) +- [Activity](#activity) +- [Contained Data](#contained-data) +- [Papers](#papers) +- [Links](#links) +- [Related Entries](#related-entries) + +| | | +|--------------------------|-----------------------| +| **Network Log Source** | Unknown | +| **Network Logs Labeled** | Presumably | +| **Host Log Source** | - | +| **Host Logs Labeled** | - | +| | | +| **Overall Setting** | Single OS | +| **OS Types** | Apache Linux | +| **Number of Machines** | 1 | +| **Total Runtime** | 24 hours | +| **Year of Collection** | 2017 | +| **Attack Categories** | Application-layer DoS | +| **User Emulation** | n/a | +| | | +| **Packed Size** | n/a | +| **Unpacked Size** | 4,6 GB | +| **Download Link** | Currently unavailable | + +*** + +### Overview +The Canadian Institute for Cybersecurity (CIC) DoS dataset focuses on Denial-of-Services attacks targeting the application layer (as opposed to the network layer). +The authors argue that these types of DoS attacks commonly avoid traditional network-layer based detection mechanisms, requiring a novel approach. +Specifically, they focus mostly on low-volume DoS attacks, which are characterized by "small amounts of attack traffic transmitted strategically to a victim", whereas high-volume attacks are more similar to traditional DoS attacks, relying on flooding the application layer with requests. +As part of this research, and due to the lack of usable datasets of this kind, the authors introduce the CIC DoS dataset, which consists of 24 hours of traffic collected from a webserver being the victim of such attacks. +However, the dataset is no longer available for unknown reasons, making it both difficult and somewhat pointless to provide a lot of detailed information here. + +### Environment +The victim setup consists of a webserver running Apache Linux v2.2.22, PHP5 and Drupal v7 as a content management system. +Further details are not available. + +### Activity +The declared goal of executed attacks was to render services on the server side unresponsive while being as stealthy and resource-efficient as possible, including stopping attacks as soon as servers became unresponsive. +The authors state that attacks were selected to match the most common types of application layer DoS, resulting in a mix of high- and log-volume attacks. +These attacks were executed leveraging several publicly available tools such as [Goldeneye](https://github.com/jseidl/GoldenEye) or [Slowloris](https://github.com/gkbrk/slowloris), for a total of eight attacks: +- High-volume HTTP attacks: + - DoS improved GET + - DDoS GET + - DoS GET +- Low-volume HTTP attacks + - slow-send body (twice with different tools) + - slow-send headers (twice with different tools) + - slow-read + +Additional details can be found in chapter 6 of the cited paper. + +### Contained Data +Traffic from executed attacks was intermixed with benign traces from the [ISCX Intrusion Detection Evaluation Dataset](iscx_ids_2012.md). +Attack traffic was presumably modified to target servers from the ISCX environment, for a total of 24 hours of attack traffic. +In which format (pcaps, NetFlows, custom features, etc.) this data is available is unknown and also not detailed in the paper. +I would assume data is labeled, but obviously have no way to confirm this. + +### Papers +- [Detecting HTTP-based Application Layer DoS Attacks on Web Servers in the Presence of Sampling (2017)](https://doi.org/10.1016/j.comnet.2017.03.018) + +### Links +- [Homepage](https://www.unb.ca/cic/datasets/dos-dataset.html) + +### Related Entries +- [ISCX Intrusion Detection Evaluation Dataset](iscx_ids_2012.md) \ No newline at end of file diff --git a/content/datasets/comp_multi_source_cybersec_events.md b/content/datasets/comp_multi_source_cybersec_events.md index 2a8a4b5..8fa4f1f 100644 --- a/content/datasets/comp_multi_source_cybersec_events.md +++ b/content/datasets/comp_multi_source_cybersec_events.md @@ -82,6 +82,11 @@ or correlated. - [Homepage](https://csr.lanl.gov/data/cyber1/) - Data access must be requested here +### Related Entries +- Other LANL datasets: + - [Unified Host and Network Dataset](unified_host_and_network_dataset.md) + - [User-Computer Authentication Associations in Time](user_computer_associations.md) + ### Example Data Authentication events in `auth.txt` diff --git a/content/datasets/cse_cic_ids2018.md b/content/datasets/cse_cic_ids2018.md index 9b205a4..04d44af 100644 --- a/content/datasets/cse_cic_ids2018.md +++ b/content/datasets/cse_cic_ids2018.md @@ -13,8 +13,8 @@ title: CSE-CIC-IDS2018 | | | |--------------------------|----------------------------------------------------------------------------------------------------------| -| **Network Log Source** | pcaps, network features | -| **Network Logs Labeled** | Only features are labeled | +| **Network Log Source** | pcaps, NetFlows | +| **Network Logs Labeled** | NetFlows are labeled | | **Host Log Source** | Ubuntu event logs, Windows event logs | | **Host Logs Labeled** | No | | | | diff --git a/content/datasets/gure_kddcup.md b/content/datasets/gure_kddcup.md new file mode 100644 index 0000000..9a695a2 --- /dev/null +++ b/content/datasets/gure_kddcup.md @@ -0,0 +1,101 @@ +--- +title: gureKDDCup +--- + +- [Overview](#overview) +- [Environment](#environment) +- [Activity](#activity) +- [Contained Data](#contained-data) +- [Papers](#papers) +- [Links](#links) +- [Related Entries](#related-entries) +- [Data Examples](#data-examples) + +| | | +|--------------------------|------------------------------------------------------------------------------------------------------------| +| **Network Log Source** | Connection records with payload | +| **Network Logs Labeled** | Yes | +| **Host Log Source** | - | +| **Host Logs Labeled** | - | +| | | +| **Overall Setting** | Military IT | +| **OS Types** | Linux 2.0.27
SunOS 4.1.4
Sun Solaris 2.5.1
Windows NT | +| **Number of Machines** | 1000's | +| **Total Runtime** | Nine weeks | +| **Year of Collection** | 1998 | +| **Attack Categories** | DoS
Remote to Local
User to Root
Surveillance/Probing | +| **User Emulation** | Scripts for traffic generation, actual humans for performing complex tasks | +| | | +| **Packed Size** | 10 GB | +| **Unpacked Size** | n/a | +| **Download Link** | [goto](http://www.sc.ehu.es/acwaldap/gureKddcup/gureKDDCup/gureKddcup/complete_database/gureKddcup.tar.gz) | + +*** + +### Overview +The gureKDDCup dataset is an extension of the well known KDDCup 1999 dataset -- which consists of connection records --, adding additional information regarding payloads. +Consequently, it is also based on the DARPA'98 Intrusion Detection Program; +information about both of these datasets can be found in the [Related Entries](#related-entries) section. +Note that the authors did not directly copy the KDDCup 1999 dataset, but instead recreated it using the same methodology, including additional information in the process. + +### Environment +Refer to the underlying [DARPA'98 Intrusion Detection Program](darpa98.md). + +### Activity +Refer to the underlying [DARPA'98 Intrusion Detection Program](darpa98.md). + +### Contained Data +The raw DARPA data, which comes in the form of binary TCP dumps, is transformed into connection records, mimicking the methodology of the KDDCup 1999 dataset. +This entire process is documented extensively in a separate document, which is linked below. +A connection record is defined as "a sequence of TCP packets starting and ending at some well-defined times, between +which data flows to and from a source IP address to a target IP address under some well-defined protocol". +Just as with the KDDCup dataset, each record contains 41 features (described in section C.2 of the documentation), with a 42nd label indicating whether this event is normal or malicious, which in the latter case also references the specific attack that event belongs to. + +As mentioned, the distinguishing factor here is the inclusion of additional payload information. +That is, for each connection record, three additional files are generated: +- `*.a`: sent packets' payloads, sorted by time +- `*.b`: received packets' payloads, sorted by time +- `*.c`: all packet payloads of the connection, sorted by time + +The filename before the extension is equal to the number of the associated conneciton record. +Data is divided into seven weeks, which then each contain five folders, one for every workday (MON-FR). +Each of those contains the following data: +- `gureKddcup.list`: Connection records for that day. +The first 6 attributes are: connection_number, start_time, orig_port, resp_port, orig_ip, resp_ip (information to identify the connection), followed by the cited 41 attributes plus class (see data example below) +- `a-matched`: All sent packets' payloads of that days connections, one file per connection record. +Each filename matches to a connection_number in the list of connection records. +- `b-matched`: All received packets' payloads of that days connections, one file per connection record. +Each filename matches to a connection_number in the list of connection records. +- `a-matched`: All packets' payloads of that days connections, one file per connection record. +Each filename matches to a connection_number in the list of connection records. + +The authors also supply a subset of this data called gureKddcup6percent. +It supplies the same information in the same way, but, as the name suggests, only supplies 6% of the original connection records plus associated payloads. +This sample contains all no-flood attacks, and a random selection of normal connections. + +### Papers +- [Service-independent payload analysis to improve intrusion detection in network traffic (2008)](https://dl.acm.org/doi/10.5555/2449288.2449315) + +### Links +- [Homepage](http://www.sc.ehu.es/acwaldap/gureKddcup/galdetegia_jaso.php) (form does not have to be filled out) +- [Documentation](https://addi.ehu.es/bitstream/handle/10810/20608/20160601_Txostena_gurekddcup_InigoPeronaBalda.pdf?sequence=1) +- [Link Hub](http://www.sc.ehu.es/acwaldap/) (in case homepage link deprecates) + +## Related Entries +- [DARPA'98 Intrusion Detection Program](darpa98.md) +- [KDD Cup 1999](kdd_cup_1999.md) + +### Data Examples +Connection records taken from `gureKddcup/Week6/Thursday/gureKddcup.list/gureKddcup-matched.list` +``` +64558768 899989341.327858 8 0 197.218.177.69 172.16.114.115 0.000000 icmp 8 SH 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0 12 0.000000 0.000000 0.120000 1.000000 0.000000 0.000000 0.000000 0.000000 ipsweep +64558769 899989341.638201 4136 80 172.16.113.84 192.43.70.122 0.039594 tcp 80 SF 160 479 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0 9 0.000000 0.000000 0.000000 1.000000 0.000000 0.111111 0.000000 0.000000 +64558771 899989342.617289 1904 161 194.27.251.21 192.168.1.1 0.000000 udp 161 S0 105 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 38 28 0.736842 0.263158 0.000000 0.000000 0.368421 0.500000 0.000000 0.000000 +64558772 899989342.617289 161 1904 192.168.1.1 194.27.251.21 0.045382 udp 161 SF 0 146 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 39 29 0.743590 0.256410 0.010000 0.000000 0.384615 0.517241 0.000000 0.000000 +64558773 899989343.121947 49724 928 206.48.44.18 172.16.112.50 0.000449 tcp 928 REJ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 25 0 0.000000 1.000000 0.250000 0.000000 0.000000 0.000000 1.000000 0.000000 portsweep +64558774 899989343.345483 4141 25 172.16.113.84 194.7.248.153 2.057617 tcp 25 SF 3044 325 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0 12 0.000000 0.000000 0.000000 1.000000 0.000000 0.166667 0.000000 0.000000 +64558776 899989345.407192 4144 25 172.16.113.84 196.37.75.158 3.208491 tcp 25 SF 3047 331 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0 14 0.000000 0.000000 0.000000 1.000000 0.000000 0.142857 0.000000 0.000000 +64558777 899989346.151906 49724 91 206.48.44.18 172.16.112.50 0.000430 tcp 91 REJ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 26 0 0.000000 1.000000 0.260000 0.000000 0.000000 0.000000 1.000000 0.000000 portsweep +64558778 899989346.203066 26326 25 197.182.91.233 172.16.112.207 0.905250 tcp 25 SF 4536 329 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0 13 0.000000 0.000000 0.000000 1.000000 0.000000 0.153846 0.000000 0.000000 +64558779 899989346.716433 8 0 197.218.177.69 172.16.114.116 0.000000 icmp 8 SH 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0 13 0.000000 0.000000 0.130000 1.000000 0.000000 0.000000 0.000000 0.000000 ipsweep +``` \ No newline at end of file diff --git a/content/datasets/kdd_cup_1999.md b/content/datasets/kdd_cup_1999.md index 484b10e..43a6cd2 100644 --- a/content/datasets/kdd_cup_1999.md +++ b/content/datasets/kdd_cup_1999.md @@ -43,24 +43,11 @@ Like the dataset it is based on, due to its age and a number of flaws, it should ### Environment -The simulated Air Force base consists of a small number of hosts, leveraging "custom software" to appear as if they were -1000s of hosts with different IP addresses. +Refer to the underlying [DARPA'98 Intrusion Detection Program](darpa98.md). ### Activity -Within the network, automated users perform an array of tasks such as sending mails, browsing, or using services like -FTP, telnet or SNMP. -The total duration of this simulation was nine weeks. -Any protective devices such as firewalls are omitted, as "the focus was on detecting attacks, and not preventing -attacks". -All attacks are performed from the outside of this network, and a sniffer is located at the entry point of the network -to capture this traffic. -Attacks belong to one of four categories: - -- DoS -- Remote to Local -- User to Root -- Surveillance/Probing +Refer to the underlying [DARPA'98 Intrusion Detection Program](darpa98.md). ### Contained Data @@ -68,7 +55,7 @@ The raw DARPA data, which comes in the form of binary TCP dumps, is divided and million connection records) of training data, and two weeks (~two million connection records) of test data. A connection record is defined as "a sequence of TCP packets starting and ending at some well-defined times, between which data flows to and from a source IP address to a target IP address under some well-defined protocol". -Each of these connection records contains 41 features (description linked below), including a label indicating whether +Each of these connection records contains 41 features (description linked below), with a 42nd label indicating whether this event is normal or malicious, which in the latter case also references the specific attack that event belongs to. The KDD'99 dataset fixes some issues present in its DARPA foundation, which was severely affected by simulation diff --git a/content/datasets/unified_host_and_network_dataset.md b/content/datasets/unified_host_and_network_dataset.md index 8a7f504..11f99d7 100644 --- a/content/datasets/unified_host_and_network_dataset.md +++ b/content/datasets/unified_host_and_network_dataset.md @@ -78,6 +78,11 @@ Note: Their service is currently unavailable, preventing downloads :( - [Homepage](https://csr.lanl.gov/data/2017/) +### Related Entries +- Other LANL datasets: + - [Comprehensive, Multi-Source Cybersecurity Events](comp_multi_source_cybersec_events.md) + - [User-Computer Authentication Associations in Time](user_computer_associations.md) + ### Data Examples Example network event data diff --git a/content/datasets/user_computer_associations.md b/content/datasets/user_computer_associations.md new file mode 100644 index 0000000..ec6d390 --- /dev/null +++ b/content/datasets/user_computer_associations.md @@ -0,0 +1,82 @@ +--- +title: User-Computer Authentication Associations in Time +--- + +- [Overview](#overview) +- [Environment](#environment) +- [Activity](#activity) +- [Contained Data](#contained-data) +- [Papers](#papers) +- [Links](#links) +- [Related Entries](#related-entries) +- [Data Examples](#data-examples) + +| | | +|--------------------------|-----------------------| +| **Network Log Source** | - | +| **Network Logs Labeled** | - | +| **Host Log Source** | Authentication events | +| **Host Logs Labeled** | No | +| | | +| **Overall Setting** | Enterprise IT | +| **OS Types** | Undisclosed | +| **Number of Machines** | 22,284 | +| **Total Runtime** | 9 months | +| **Year of Collection** | 2014 | +| **Attack Categories** | none | +| **User Emulation** | Real users | +| | | +| **Packed Size** | 2,3 GB | +| **Unpacked Size** | n/a | +| **Download Link** | see below | + +*** + +### Overview +The "User-Computer Authentication in Time" dataset contains 708.304.516 anonymized authentication events collected over a period of 9 continuous months from the Los Alamos National Laboratory (LANL) enterprise network. +This data was used to study the concept of "Credential Hopping" - where an attacker uses stored/cached credentials on a machine to authenticate on another in the network - and which strategies can be taken to mitigate associated risks. +For this purpose, an "authentication graph" was built from these events and analyzed and/or modified using various assumptions and methodologies. +As it does not contain any known malicious events, it is most likely of little interest for intrusion detection research. + +### Environment +Details other than "LANL enterprise network" are not provided, except for some statistical values. +There are a total of 11,362 distinct users, and 22,284 distinct computers. + +### Activity +There is no mention of any malicious activity during the collection period. + +### Contained Data +The dataset consists of 708,304,516 authentication events, one event per line, each described with three values: +- an epoch timestamp (epoch 1 is the start of the collection period, an exact time is not provided) +- the user who successfully authenticated, in the form of "U" plus a unique number for that user (e.g., U1337) +- the computer the given user authenticated on, in the form of "C" plus a unique number for that computer (e.g., C42) + +The authors note that authentication events for some centralized computers, specifically the Active Directory Servers, have been removed. +Access to the dataset can easily be requested on the homepage linked below (the email address you need to supply does not have to be valid). +It can either be downloaded as one text file containing the full nine months (2,3 GB), or nine text files with 30 days of events each. + +### Papers +- [Connected Components and Credential Hopping in Authentication Graphs (2014)](https://doi.org/10.1109/SITIS.2014.95) + +### Links +- [Homepage](https://csr.lanl.gov/data/auth/) + +### Related Entries +- Other LANL datasets: + - [Comprehensive, Multi-Source Cybersecurity Events](comp_multi_source_cybersec_events.md) + - [Unified Host and Network Dataset](unified_host_and_network_dataset.md) + +### Data Examples +The first 10 lines of the dataset +``` +1,U1,C1 +1,U1,C2 +2,U2,C3 +3,U3,C4 +6,U4,C5 +7,U4,C5 +7,U5,C6 +8,U6,C7 +11,U7,C8 +12,U8,C9 +``` \ No newline at end of file diff --git a/content/related_work.md b/content/related_work.md index cce1a59..7a70a75 100644 --- a/content/related_work.md +++ b/content/related_work.md @@ -3,10 +3,10 @@ title: Related Work --- This page lists publications and collections covering IDS datasets. -Related publications, sorted by year or release, are any academic work that at least partially covers the topic of available IDS datasets. +Related publications, sorted by year of release, are any academic work that at least partially covers the topic of available IDS datasets. Collections, sorted alphabetically, simply features agglomerations of IDS-related datasets not backed by a scientific publication. -Each entry consists of citation and a brief description of the survey's scope of selected datasets. +Each entry consists of a citation and a brief description of the survey's scope of selected datasets. Additionally, for publications, all datasets discussed in the survey are also listed, linking to their respective entries on this website, if available. ## Contents @@ -68,14 +68,13 @@ Referenced datasets: - [CIC-IDS 2017](/intrusion-detection-datasets/content/datasets/cic_ids2017) - [CSE-CIC-IDS 2018](/intrusion-detection-datasets/content/datasets/cse_cic_ids2018) - [CTU 13](/intrusion-detection-datasets/content/datasets/ctu_13) -- CIC DoS +- [CIC DoS](/intrusion-detection-datasets/content/datasets/cic_dos) - [DARPA'98 Intrusion Detection Program](/intrusion-detection-datasets/content/datasets/darpa98) -- Gure-KDD-Cup +- [gureKDDCup](/intrusion-detection-datasets/content/datasets/gure_kddcup) - [ISCX IDS 2012](/intrusion-detection-datasets/content/datasets/iscx_ids_2012) - ISOT - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) - [Kyoto Honeypot](/intrusion-detection-datasets/content/datasets/kyoto_honeypot) -- Lawrence Berkeley National Laboratory Traces - [NSL-KDD](/intrusion-detection-datasets/content/datasets/nsl_kdd_dataset) - [Twente 2009](/intrusion-detection-datasets/content/datasets/twente_2009) - [UNSW NB15](/intrusion-detection-datasets/content/datasets/unsw_nb15) @@ -84,6 +83,7 @@ Referenced datasets: Referenced collections: - CAIDA - DEFCON CTF Archive +- [Lawrence Berkeley National Laboratory Traces](#the-internet-traffic-archive) (alias for: The Internet Traffic Archive) - MAWILab - UMass Trace Repository @@ -106,12 +106,12 @@ Referenced datasets: - [ISCX IDS 2012](/intrusion-detection-datasets/content/datasets/iscx_ids_2012) - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) - [Kyoto Honeypot](/intrusion-detection-datasets/content/datasets/kyoto_honeypot) -- Lawrence Berkeley National Laboratory Traces - [NSL-KDD](/intrusion-detection-datasets/content/datasets/nsl_kdd_dataset) - [Twente 2009](/intrusion-detection-datasets/content/datasets/twente_2009) Referenced Collections: - CAIDA +- [Lawrence Berkeley National Laboratory Traces](#the-internet-traffic-archive) (alias for: The Internet Traffic Archive) - DEFCON CTF Archive ### A Survey of Intrusion Detection Systems leveraging Host Data (2019) @@ -130,14 +130,14 @@ Referenced datasets: - [Comprehensive Multi-Source Cybersecurity Events](/intrusion-detection-datasets/content/datasets/comp_multi_source_cybersec_events) - [CTU 13](/intrusion-detection-datasets/content/datasets/ctu_13) - [DARPA'98 Intrusion Detection Program](/intrusion-detection-datasets/content/datasets/darpa98) -- GURE-KDD +- [gureKDDCup](/intrusion-detection-datasets/content/datasets/gure_kddcup) - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) - Malware Capture Facility Project - [NSL-KDD](/intrusion-detection-datasets/content/datasets/nsl_kdd_dataset) - UNM system call dataset - [Unified Host and Network dataset](/intrusion-detection-datasets/content/datasets/unified_host_and_network_dataset) - [UNSW-NB15](/intrusion-detection-datasets/content/datasets/unsw_nb15) -- User-Computer Authentication Associations in Time +- [User-Computer Authentication Associations in Time](/intrusion-detection-datasets/content/datasets/user_computer_associations) - [Vast Challenge 2012]((/intrusion-detection-datasets/content/datasets/vast_2012)) - Vast Challenge 2013 @@ -165,7 +165,7 @@ Referenced datasets: - Booters Dataset - ISCX Botnet 2014 - [CDX CTF 2009](/intrusion-detection-datasets/content/datasets/cdx_2009) -- CIC DoS +- [CIC DoS](/intrusion-detection-datasets/content/datasets/cic_dos) - [CIC-IDS 2017](/intrusion-detection-datasets/content/datasets/cic_ids2017) - CIDDS-001 & 002 - [CSE-CIC-IDS 2018](/intrusion-detection-datasets/content/datasets/cse_cic_ids2018) @@ -175,9 +175,8 @@ Referenced datasets: - [ISCX IDS 2012](/intrusion-detection-datasets/content/datasets/iscx_ids_2012) - ISOT - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) -- Kent 2016 +- [Kent 2016](/intrusion-detection-datasets/content/datasets/comp_multi_source_cybersec_events) (alias for: Comprehensive, Multi-Source Cybersecurity Events) - [Kyoto Honeypot](/intrusion-detection-datasets/content/datasets/kyoto_honeypot) -- Lawrence Berkeley National Laboratory Traces - NDSec-1 - [NGIDS-DS](/intrusion-detection-datasets/content/datasets/ngids_dataset) - [NSL-KDD](/intrusion-detection-datasets/content/datasets/nsl_kdd_dataset) @@ -203,6 +202,7 @@ Referenced collections: - [IMPACT](#impact) - [Internet Traffic Archive](#the-internet-traffic-archive) - Kaggle +- [Lawrence Berkeley National Laboratory Traces](#the-internet-traffic-archive) (alias for: The Internet Traffic Archive) - [Malware Traffic Analysis](#malware-traffic-analysis) - Mid-Atlantic CCDC - MAWILab @@ -282,7 +282,7 @@ Sahu, S. K., Sarangi, S., & Jena, S. K. (2014, February). A detail analysis on i This paper shortly analyzed three papers the authors deem suitable to test their novel preprocessing techniques, which are supposed to improve the performance of various data mining algorithms. Referenced datasets: -- GURE-KDD +- [gureKDDCup](/intrusion-detection-datasets/content/datasets/gure_kddcup) - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) - [NSL-KDD](/intrusion-detection-datasets/content/datasets/nsl_kdd_dataset) @@ -298,7 +298,6 @@ Datasets that are suitable for this purpose are mentioned as a secondary talking Referenced datasets: - [DARPA'98 Intrusion Detection Program](/intrusion-detection-datasets/content/datasets/darpa98) - [ISCX IDS 2012](/intrusion-detection-datasets/content/datasets/iscx_ids_2012) -- Lawrence Berkeley National Laboratory Traces - [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999) - [NSL-KDD](/intrusion-detection-datasets/content/datasets/nsl_kdd_dataset) - TUIDS @@ -306,6 +305,7 @@ Referenced datasets: Referenced collections: - CAIDA - DEFCON CTF archive +- [Lawrence Berkeley National Laboratory Traces](#the-internet-traffic-archive) (alias for: The Internet Traffic Archive) ## Other collections