Skip to content

Commit

Permalink
Merge pull request #65 from fkie-cad/issue-13-add-datasets
Browse files Browse the repository at this point in the history
Merge new dataset entries into main
  • Loading branch information
ru37z authored Jun 5, 2024
2 parents ad740fa + 7313647 commit be72871
Show file tree
Hide file tree
Showing 6 changed files with 301 additions and 55 deletions.
101 changes: 51 additions & 50 deletions content/all_datasets.md

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions content/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,7 @@ If you want to contribute a new dataset entry, please use this [template](https:
A new entry should consist of said template filled out and named appropriately, placed in `/content/datasets/`.
Additionally, a new row should be added to the list of all datasets in `/content/all_datasets.md`, adding information to each cell as needed.

You can find a list of datasets that we are aware of, but which do not have an entry yet, in [this issue](https://github.com/fkie-cad/intrusion-detection-datasets/issues/13)

On every page you will also find an "Edit Page" button at the bottom leading you to GitHub, where you will be prompted to fork this repository - saving you a few clicks when you want to edit an existing entry.
While contributions should generally be aimed towards datasets, suggestions regarding the underlying structure (like the website itself) are of course also welcome.
62 changes: 62 additions & 0 deletions content/datasets/isot_botnet.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
title: ISOT BOTNET
---

- [Overview](#overview)
- [Environment](#environment)
- [Activity](#activity)
- [Contained Data](#contained-data)
- [Papers](#papers)
- [Links](#links)

| <!-- --> | <!-- --> |
|--------------------------|--------------------------------------------------------------------------------|
| **Network Data Source** | pcaps |
| **Network Data Labeled** | Yes |
| **Host Data Source** | - |
| **Host Data Labeled** | - |
| | |
| **Overall Setting** | Enterprise IT |
| **OS Types** | Undisclosed |
| **Number of Machines** | 2000+ |
| **Total Runtime** | n/a |
| **Year of Collection** | 2004-2010 |
| **Attack Categories** | Botnets (Storm, Waledac) |
| **Benign Activity** | Real users |
| | |
| **Packed Size** | 3 GB |
| **Unpacked Size** | 10,6 GB |
| **Download Link** | [goto](https://drive.google.com/file/d/1X1zPBJFPHU1ToQbpyd1Is1tJJuz2BeRd/view) |

***

### Overview
The ISOT Botnet dataset is an amalgamation of several individual datasets, two containing malicious botnet traffic, and five datasets consisting of benign traffic.
Malicious data was taken from the "French Chapter" of the Honeynet project, while (anonymized) benign traces come from the LBNL Enterprise Trace Repository.
The combination of these traces, after some preprocessing to make them appear as if they would stem from the same network, are then used to test several botnet detection methods leveraging network behavior analysis and machine learning.
However, we were unable to find any information regarding the source of malicious traces, as linked pages no longer exist and further search remained fruitless.

### Environment
The merged dataset contains traces from 23 individual subnets, 22 with only benign traffic (stemming from the LBNL traces) and one with both malicious and benign traffic (merged traffic from both sources).
The IPs of the latter subnet can be obtained from Table 2 of the linked documentation.
Information regarding services, operating systems and so on are not available.

### Activity
Details regarding activity are not available;
there might be some additional information hidden in LBNL publications, but we consider this to be out of scope.

### Contained Data
As a first step to merge benign and malicious traces, the IP addresses of infected machines were mapped to two of the machines providing benign background traffic.
Then, the authors used to the `TcpReplay` tool to replay all traces on the same network interface in order to homogenize the network behavior shown by individual datasets.
These traces are simply available in the form of a single large pcap file with 1,675,424 unique flows, of which 3.33% are malicious.
Labels are available via malicious traffic having a specific MAC, as per Table 2 of the linked documentation.

It should be noted that the application of methods based on machine learning on merged datasets bears some additional risks;
researchers must ensure that results are not a byproducts of anomalies that remained after the merging process, which might not actually be caused by the malicious behavior, but rather the simple fact that these traces stem from separate environments.

### Papers
- [Detecting P2P botnets through network behavior analysis and machine learning (2011)](https://doi.org/10.1109/PST.2011.5971980)

### Links
- [Documentation](https://onlineacademiccommunity.uvic.ca/isot/wp-content/uploads/sites/7295/2023/03/ISOT-Dataset-Overview-v0.5.pdf)
- [LBNL/ICSI Enterprise Tracing Project](https://www.icir.org/enterprise-tracing/download.html)
61 changes: 61 additions & 0 deletions content/datasets/unibs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
---
title: UNIBS
---

- [Overview](#overview)
- [Environment](#environment)
- [Activity](#activity)
- [Contained Data](#contained-data)
- [Papers](#papers)
- [Links](#links)

| <!-- --> | <!-- --> |
|--------------------------|--------------------------------------------------------------------|
| **Network Data Source** | NetFlows |
| **Network Data Labeled** | No |
| **Host Data Source** | - |
| **Host Data Labeled** | - |
| | |
| **Overall Setting** | Enterprise IT |
| **OS Types** | Undisclosed |
| **Number of Machines** | 20 |
| **Total Runtime** | 3 days |
| **Year of Collection** | 2009 |
| **Attack Categories** | None |
| **Benign Activity** | Real users |
| | |
| **Packed Size** | - |
| **Unpacked Size** | 2,7 GB |
| **Download Link** | [must be requested](http://netweb.ing.unibs.it/~ntw/tools/traces/) |

***

### Overview
The University of Brescia (UNIBS) dataset was created to showcase the capabilities of the "GT" software, an open source toolset facilitating the association of application-level ground truth with network traffic traces.
This is done by probing a monitored host's kernel to gather ground truth at the application level, which can then later be assigned to any collected traces with minimal CPU overhead.
Beyond this, the dataset does not seem to serve a greater purpose, as it does not contain any malicious activity (that the authors are aware of) and is also anonymized.

### Environment
Traffic was collected from 20 workstations located in the campus network of the University of Brescia over the course of three consecutive days (2009-09-30 to 2009-10-02).
Each workstation is running a "GT client daemon", information regarding network configuration or specific operating systems is not available.

### Activity
(Presumably) real users used a variety of traffic generating applications and protocols, namely:
- Web (HTTP, HTTPS)
- Mail (POP3, IMAP4, SMTP)
- Skype
- P2P (Bittorrent, Edonkey)
- Other (FTP, SSH, MSN)

Any further details are not available, most likely because the focus of this dataset was simply on correctly assigning flows to these services or protocols.
Intentional malicious activity is not present.

### Contained Data
Traffic was collected from the central faculty router via `tcpdump` and enriched with ground truth from the GT tool (in the form of related protocol and application).
It is available in an anonymized and payload-stripped form, presumably as NetFlows, but has to be requested via mail.

### Papers
- [GT: picking up the truth from the ground for internet traffic (2009)](https://doi.org/10.1145/1629607.1629610)

### Links
- [Homepage](http://netweb.ing.unibs.it/~ntw/tools/traces/)
Loading

0 comments on commit be72871

Please sign in to comment.