Skip to content

Commit

Permalink
Merge pull request #20 from fkie-cad/add-gure-kddcup-dataset
Browse files Browse the repository at this point in the history
Add "gureKDDCup" dataset
  • Loading branch information
ru37z authored Apr 2, 2024
2 parents d9f40f0 + 20af38e commit 59af3e0
Show file tree
Hide file tree
Showing 4 changed files with 108 additions and 19 deletions.
1 change: 1 addition & 0 deletions content/all_datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ before-content: gh_buttons.html
| [DARPA TC3](../datasets/darpa_tc3) | Host | Custom event logs from network under attack, designed to facilitate provenance tracking | 2018 | Undisclosed | Undisclosed | 🟨 | Custom event logs | 115 GB | - |
| [DARPA TC5](../datasets/darpa_tc5) | Host | Custom event logs from network under attack from APT groups, designed to facilitate provenance tracking | 2019 | Undisclosed | Undisclosed | 🟨 | Custom event logs | - | - |
| [EVTX to MITRE ATT&CK](../datasets/evtx_to_mitre_attck) | Host | Small dataset providing various events corresponding to certain MITRE tactics/techniques | 2022 | Single OS | Windows | 🟩 | Windows events | <1 GB | <1 GB |
| [gureKDDCup](../datasets/gure_kddcup) | Network | An extension of the KDDCup 1999 dataset, adding additional information about payloads to each connection record | 2008 | Military IT | Unix | 🟩 | Connection records with payload information | 10 GB | - |
| [ISCX Intrusion Detection Evaluation](../datasets/iscx_ids_2012) | Network | Focus on realistic traffic generation in a company network, combined with some basic attacks | 2012 | Enterprise IT | Windows, Ubuntu | 🟩 | pcaps | 84 GB | 87 GB |
| [KDD Cup 1999](../datasets/kdd_cup_1999) | Network | Network connection events derived from simulated U.S. Air Force network under attack. No longer appropriate to use for multiple reasons | 1999 | Military IT | Unix | 🟩 | Connection records | 18 MB | 743 MB |
| [Kyoto Honeypot](../datasets/kyoto_honeypot) | Network | Collection of features derived from attack traffic targeting honeypots over the span of 9 years | 2006-2015 | Diverse | Windows, Unix, MacOS | 🟩 | Custom network features | 20 GB | - |
Expand Down
101 changes: 101 additions & 0 deletions content/datasets/gure_kddcup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
title: gureKDDCup
---

- [Overview](#overview)
- [Environment](#environment)
- [Activity](#activity)
- [Contained Data](#contained-data)
- [Papers](#papers)
- [Links](#links)
- [Related Entries](#related-entries)
- [Data Examples](#data-examples)

| <!-- --> | <!-- --> |
|--------------------------|------------------------------------------------------------------------------------------------------------|
| **Network Log Source** | Connection records with payload |
| **Network Logs Labeled** | Yes |
| **Host Log Source** | - |
| **Host Logs Labeled** | - |
| | |
| **Overall Setting** | Military IT |
| **OS Types** | Linux 2.0.27<br/>SunOS 4.1.4<br/>Sun Solaris 2.5.1<br/>Windows NT |
| **Number of Machines** | 1000's |
| **Total Runtime** | Nine weeks |
| **Year of Collection** | 1998 |
| **Attack Categories** | DoS<br/>Remote to Local<br/>User to Root<br/>Surveillance/Probing |
| **User Emulation** | Scripts for traffic generation, actual humans for performing complex tasks |
| | |
| **Packed Size** | 10 GB |
| **Unpacked Size** | n/a |
| **Download Link** | [goto](http://www.sc.ehu.es/acwaldap/gureKddcup/gureKDDCup/gureKddcup/complete_database/gureKddcup.tar.gz) |

***

### Overview
The gureKDDCup dataset is an extension of the well known KDDCup 1999 dataset -- which consists of connection records --, adding additional information regarding payloads.
Consequently, it is also based on the DARPA'98 Intrusion Detection Program;
information about both of these datasets can be found in the [Related Entries](#related-entries) section.
Note that the authors did not directly copy the KDDCup 1999 dataset, but instead recreated it using the same methodology, including additional information in the process.

### Environment
Refer to the underlying [DARPA'98 Intrusion Detection Program](darpa98.md).

### Activity
Refer to the underlying [DARPA'98 Intrusion Detection Program](darpa98.md).

### Contained Data
The raw DARPA data, which comes in the form of binary TCP dumps, is transformed into connection records, mimicking the methodology of the KDDCup 1999 dataset.
This entire process is documented extensively in a separate document, which is linked below.
A connection record is defined as "a sequence of TCP packets starting and ending at some well-defined times, between
which data flows to and from a source IP address to a target IP address under some well-defined protocol".
Just as with the KDDCup dataset, each record contains 41 features (described in section C.2 of the documentation), with a 42nd label indicating whether this event is normal or malicious, which in the latter case also references the specific attack that event belongs to.

As mentioned, the distinguishing factor here is the inclusion of additional payload information.
That is, for each connection record, three additional files are generated:
- `*.a`: sent packets' payloads, sorted by time
- `*.b`: received packets' payloads, sorted by time
- `*.c`: all packet payloads of the connection, sorted by time

The filename before the extension is equal to the number of the associated conneciton record.
Data is divided into seven weeks, which then each contain five folders, one for every workday (MON-FR).
Each of those contains the following data:
- `gureKddcup.list`: Connection records for that day.
The first 6 attributes are: connection_number, start_time, orig_port, resp_port, orig_ip, resp_ip (information to identify the connection), followed by the cited 41 attributes plus class (see data example below)
- `a-matched`: All sent packets' payloads of that days connections, one file per connection record.
Each filename matches to a connection_number in the list of connection records.
- `b-matched`: All received packets' payloads of that days connections, one file per connection record.
Each filename matches to a connection_number in the list of connection records.
- `a-matched`: All packets' payloads of that days connections, one file per connection record.
Each filename matches to a connection_number in the list of connection records.

The authors also supply a subset of this data called gureKddcup6percent.
It supplies the same information in the same way, but, as the name suggests, only supplies 6% of the original connection records plus associated payloads.
This sample contains all no-flood attacks, and a random selection of normal connections.

### Papers
- [Service-independent payload analysis to improve intrusion detection in network traffic (2008)](https://dl.acm.org/doi/10.5555/2449288.2449315)

### Links
- [Homepage](http://www.sc.ehu.es/acwaldap/gureKddcup/galdetegia_jaso.php) (form does not have to be filled out)
- [Documentation](https://addi.ehu.es/bitstream/handle/10810/20608/20160601_Txostena_gurekddcup_InigoPeronaBalda.pdf?sequence=1)
- [Link Hub](http://www.sc.ehu.es/acwaldap/) (in case homepage link deprecates)

## Related Entries
- [DARPA'98 Intrusion Detection Program](darpa98.md)
- [KDD Cup 1999](kdd_cup_1999.md)

### Data Examples
Connection records taken from `gureKddcup/Week6/Thursday/gureKddcup.list/gureKddcup-matched.list`
```
64558768 899989341.327858 8 0 197.218.177.69 172.16.114.115 0.000000 icmp 8 SH 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0 12 0.000000 0.000000 0.120000 1.000000 0.000000 0.000000 0.000000 0.000000 ipsweep
64558769 899989341.638201 4136 80 172.16.113.84 192.43.70.122 0.039594 tcp 80 SF 160 479 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0 9 0.000000 0.000000 0.000000 1.000000 0.000000 0.111111 0.000000 0.000000
64558771 899989342.617289 1904 161 194.27.251.21 192.168.1.1 0.000000 udp 161 S0 105 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 38 28 0.736842 0.263158 0.000000 0.000000 0.368421 0.500000 0.000000 0.000000
64558772 899989342.617289 161 1904 192.168.1.1 194.27.251.21 0.045382 udp 161 SF 0 146 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 39 29 0.743590 0.256410 0.010000 0.000000 0.384615 0.517241 0.000000 0.000000
64558773 899989343.121947 49724 928 206.48.44.18 172.16.112.50 0.000449 tcp 928 REJ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 25 0 0.000000 1.000000 0.250000 0.000000 0.000000 0.000000 1.000000 0.000000 portsweep
64558774 899989343.345483 4141 25 172.16.113.84 194.7.248.153 2.057617 tcp 25 SF 3044 325 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0 12 0.000000 0.000000 0.000000 1.000000 0.000000 0.166667 0.000000 0.000000
64558776 899989345.407192 4144 25 172.16.113.84 196.37.75.158 3.208491 tcp 25 SF 3047 331 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0 14 0.000000 0.000000 0.000000 1.000000 0.000000 0.142857 0.000000 0.000000
64558777 899989346.151906 49724 91 206.48.44.18 172.16.112.50 0.000430 tcp 91 REJ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 26 0 0.000000 1.000000 0.260000 0.000000 0.000000 0.000000 1.000000 0.000000 portsweep
64558778 899989346.203066 26326 25 197.182.91.233 172.16.112.207 0.905250 tcp 25 SF 4536 329 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0 13 0.000000 0.000000 0.000000 1.000000 0.000000 0.153846 0.000000 0.000000
64558779 899989346.716433 8 0 197.218.177.69 172.16.114.116 0.000000 icmp 8 SH 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0 13 0.000000 0.000000 0.130000 1.000000 0.000000 0.000000 0.000000 0.000000 ipsweep
```
19 changes: 3 additions & 16 deletions content/datasets/kdd_cup_1999.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,32 +43,19 @@ Like the dataset it is based on, due to its age and a number of flaws, it should

### Environment

The simulated Air Force base consists of a small number of hosts, leveraging "custom software" to appear as if they were
1000s of hosts with different IP addresses.
Refer to the underlying [DARPA'98 Intrusion Detection Program](darpa98.md).

### Activity

Within the network, automated users perform an array of tasks such as sending mails, browsing, or using services like
FTP, telnet or SNMP.
The total duration of this simulation was nine weeks.
Any protective devices such as firewalls are omitted, as "the focus was on detecting attacks, and not preventing
attacks".
All attacks are performed from the outside of this network, and a sniffer is located at the entry point of the network
to capture this traffic.
Attacks belong to one of four categories:

- DoS
- Remote to Local
- User to Root
- Surveillance/Probing
Refer to the underlying [DARPA'98 Intrusion Detection Program](darpa98.md).

### Contained Data

The raw DARPA data, which comes in the form of binary TCP dumps, is divided and processed into seven weeks (~five
million connection records) of training data, and two weeks (~two million connection records) of test data.
A connection record is defined as "a sequence of TCP packets starting and ending at some well-defined times, between
which data flows to and from a source IP address to a target IP address under some well-defined protocol".
Each of these connection records contains 41 features (description linked below), including a label indicating whether
Each of these connection records contains 41 features (description linked below), with a 42nd label indicating whether
this event is normal or malicious, which in the latter case also references the specific attack that event belongs to.

The KDD'99 dataset fixes some issues present in its DARPA foundation, which was severely affected by simulation
Expand Down
6 changes: 3 additions & 3 deletions content/related_work.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ Referenced datasets:
- [CTU 13](/intrusion-detection-datasets/content/datasets/ctu_13)
- CIC DoS
- [DARPA'98 Intrusion Detection Program](/intrusion-detection-datasets/content/datasets/darpa98)
- Gure-KDD-Cup
- [gureKDDCup](/intrusion-detection-datasets/content/datasets/gure_kddcup)
- [ISCX IDS 2012](/intrusion-detection-datasets/content/datasets/iscx_ids_2012)
- ISOT
- [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999)
Expand Down Expand Up @@ -130,7 +130,7 @@ Referenced datasets:
- [Comprehensive Multi-Source Cybersecurity Events](/intrusion-detection-datasets/content/datasets/comp_multi_source_cybersec_events)
- [CTU 13](/intrusion-detection-datasets/content/datasets/ctu_13)
- [DARPA'98 Intrusion Detection Program](/intrusion-detection-datasets/content/datasets/darpa98)
- GURE-KDD
- [gureKDDCup](/intrusion-detection-datasets/content/datasets/gure_kddcup)
- [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999)
- Malware Capture Facility Project
- [NSL-KDD](/intrusion-detection-datasets/content/datasets/nsl_kdd_dataset)
Expand Down Expand Up @@ -282,7 +282,7 @@ Sahu, S. K., Sarangi, S., & Jena, S. K. (2014, February). A detail analysis on i
This paper shortly analyzed three papers the authors deem suitable to test their novel preprocessing techniques, which are supposed to improve the performance of various data mining algorithms.

Referenced datasets:
- GURE-KDD
- [gureKDDCup](/intrusion-detection-datasets/content/datasets/gure_kddcup)
- [KDD Cup 1999](/intrusion-detection-datasets/content/datasets/kdd_cup_1999)
- [NSL-KDD](/intrusion-detection-datasets/content/datasets/nsl_kdd_dataset)

Expand Down

0 comments on commit 59af3e0

Please sign in to comment.