Skip to content

Commit

Permalink
Merge branch 'main' into issue-13-add-datasets
Browse files Browse the repository at this point in the history
  • Loading branch information
Maspital committed Jun 5, 2024
2 parents e6126c2 + 7454611 commit 7313647
Show file tree
Hide file tree
Showing 11 changed files with 261 additions and 115 deletions.
74 changes: 74 additions & 0 deletions .github/workflows/jekyll.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
# documentation.

# Sample workflow for building and deploying a Jekyll site to GitHub Pages
name: Deploy Jekyll site to Pages

on:
# Runs on pushes targeting the default branch
push:
branches: ["main"]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
permissions:
contents: read
pages: write
id-token: write

# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
concurrency:
group: "pages"
cancel-in-progress: false

jobs:
# Build job
build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup Ruby
uses: ruby/setup-ruby@8575951200e472d5f2d95c625da0c7bec8217c42 # v1.161.0
with:
ruby-version: '3.1' # Not needed with a .ruby-version file
bundler-cache: true # runs 'bundle install' and caches installed gems automatically
cache-version: 0 # Increment this number if you need to re-download cached gems
- name: Setup Pages
id: pages
uses: actions/configure-pages@v5
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
- name: Install Python dependencies
run: pip install pandas matplotlib seaborn
- name: Generate CSV assets
run: python tools/generate_csv.py
- name: Generate plot assets
run: python tools/generate_plots.py
- name: Build with Jekyll
# Outputs to the './_site' directory by default
run: bundle exec jekyll build --baseurl "${{ steps.pages.outputs.base_path }}"
env:
JEKYLL_ENV: production
- name: Upload artifact
# Automatically uploads an artifact from the './_site' directory by default
uses: actions/upload-pages-artifact@v3

# Deployment job
deploy:
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
runs-on: ubuntu-latest
needs: build
steps:
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v4
3 changes: 3 additions & 0 deletions _config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ navbar-links:
CSV Download: "content/csv_download"
Related Work: "content/related_work"
More:
- Statistics: "content/statistics"
- Contributing: "content/contributing"
- About: "content/about"
# Author's home: "https://deanattali.com"
Expand Down Expand Up @@ -290,6 +291,8 @@ exclude:
- README.md
- screenshot.png
- docs/
- vendor/
- tools/

plugins:
- jekyll-paginate
Expand Down
49 changes: 0 additions & 49 deletions assets/data/datasets.csv

This file was deleted.

Empty file added assets/data/plots/.keep.txt
Empty file.
Binary file removed assets/data/plots/datasets_over_years.pdf
Binary file not shown.
Binary file removed assets/data/plots/datasets_over_years.png
Binary file not shown.
Binary file removed assets/data/plots/datatypes_percentages.pdf
Binary file not shown.
Binary file removed assets/data/plots/datatypes_percentages.png
Binary file not shown.
39 changes: 39 additions & 0 deletions content/statistics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
title: Dataset Statistics
---

The following plots are generated from the CSV file provided in [CSV Download](/intrusion-detection-datasets/content/csv_download).

### Distribution of datasets over time

This figure presents the distribution of currently surveyed datasets over time, where "time" refers to the year the underlying data was generated in, which may differ from the year of publication -- if this information is not available, the latter datum is chosen instead.
Datasets containing data from more than one year are represented accordingly.
Additionally, data sources and label availability are shown:
Data sources are grouped into "Network Data" (e.g., packet captures or network flows), "Host Data" (e.g., system logs or syscalls), and "Both" (any combination of the previous two);
label availability for each dataset has been classified into either "Direct" (explicit labels for at least a subset of data), "Indirect" (meta-information allowing for manual labeling), or "No Labels".

Even though this simplifies certain aspects, the figure provides a reasonably broad overview of the current landscape of IDS-related datasets.
As an example, while the [DARPA '98](/intrusion-detection-datasets/content/datasets/darpa98) and [CSE-CIC-IDS2018](/intrusion-detection-datasets/content/datasets/cse_cic_ids2018) datasets contain both network and host data and are visualized as such, only their network data is labeled and thus typically used by other publications.
Still, declaring these datasets to contain only network data would go beyond the purpose of a survey, as it is up to other researchers to decide whether the (in this case host) data can be utilized for their purposes.

<p style="text-align: center;">
<img src="{{ "/assets/data/plots/datasets_over_years.png" | relative_url }}" alt="Figure 1: Distribution of datasets in time" />
</p>

<p style="text-align: center;font-size:0.8em;">
<a href="{{ site.baseurl }}/assets/data/plots/datasets_over_years.pdf" download>Download PDF</a>
</p>

### Dataset characteristics

This figure lists various characteristics of surveyed datasets, grouped into five categories: Source of network data, source of host data, how benign activity was generated, which operating systems were included, and how many systems in total were part of the scenario.
Except for the final category, these classifications are not mutually exclusive -- consequently, the sum of a specific category might not align with the total number of datasets surveyed.
This discrepancy occurs because some datasets, for example, do not include network data, while others may include multiple operating systems, affecting the sums respectively.

<p style="text-align: center;">
<img src="{{ "/assets/data/plots/datatypes_count.png" | relative_url }}" alt="Figure 2: Characteristics of surveyed datasets, grouped into categories." />
</p>

<p style="text-align: center;font-size:0.8em;">
<a href="{{ site.baseurl }}/assets/data/plots/datatypes_count.pdf" download>Download PDF</a>
</p>
2 changes: 1 addition & 1 deletion tools/generate_plots.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

def main():
with open("assets/data/datasets.csv") as file:
data = pd.read_csv(file, delimiter=";")
data = pd.read_csv(file, delimiter=";", keep_default_na=False)

gc.datasets_over_years(data)
gc.datatypes_count(data)
Expand Down
Loading

0 comments on commit 7313647

Please sign in to comment.