Merge branch 'main' into issue-13-add-datasets

fkie-cad · Jun 5, 2024 · 7313647 · 7313647
2 parents e6126c2 + 7454611
commit 7313647
Show file tree

Hide file tree

Showing 11 changed files with 261 additions and 115 deletions.
diff --git a/.github/workflows/jekyll.yml b/.github/workflows/jekyll.yml
@@ -0,0 +1,74 @@
+# This workflow uses actions that are not certified by GitHub.
+# They are provided by a third-party and are governed by
+# separate terms of service, privacy policy, and support
+# documentation.
+
+# Sample workflow for building and deploying a Jekyll site to GitHub Pages
+name: Deploy Jekyll site to Pages
+
+on:
+  # Runs on pushes targeting the default branch
+  push:
+    branches: ["main"]
+
+  # Allows you to run this workflow manually from the Actions tab
+  workflow_dispatch:
+
+# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
+permissions:
+  contents: read
+  pages: write
+  id-token: write
+
+# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
+# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
+concurrency:
+  group: "pages"
+  cancel-in-progress: false
+
+jobs:
+  # Build job
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+      - name: Setup Ruby
+        uses: ruby/setup-ruby@8575951200e472d5f2d95c625da0c7bec8217c42 # v1.161.0
+        with:
+          ruby-version: '3.1' # Not needed with a .ruby-version file
+          bundler-cache: true # runs 'bundle install' and caches installed gems automatically
+          cache-version: 0 # Increment this number if you need to re-download cached gems
+      - name: Setup Pages
+        id: pages
+        uses: actions/configure-pages@v5
+      - name: Setup Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
+      - name: Install Python dependencies
+        run: pip install pandas matplotlib seaborn
+      - name: Generate CSV assets
+        run: python tools/generate_csv.py
+      - name: Generate plot assets
+        run: python tools/generate_plots.py
+      - name: Build with Jekyll
+        # Outputs to the './_site' directory by default
+        run: bundle exec jekyll build --baseurl "${{ steps.pages.outputs.base_path }}"
+        env:
+          JEKYLL_ENV: production
+      - name: Upload artifact
+        # Automatically uploads an artifact from the './_site' directory by default
+        uses: actions/upload-pages-artifact@v3
+
+  # Deployment job
+  deploy:
+    environment:
+      name: github-pages
+      url: ${{ steps.deployment.outputs.page_url }}
+    runs-on: ubuntu-latest
+    needs: build
+    steps:
+      - name: Deploy to GitHub Pages
+        id: deployment
+        uses: actions/deploy-pages@v4
diff --git a/_config.yml b/_config.yml
@@ -24,6 +24,7 @@ navbar-links:
   CSV Download: "content/csv_download"
   Related Work: "content/related_work"
   More:
+    - Statistics: "content/statistics"
     - Contributing: "content/contributing"
     - About: "content/about"
 #  Author's home: "https://deanattali.com"
@@ -290,6 +291,8 @@ exclude:
   - README.md
   - screenshot.png
   - docs/
+  - vendor/
+  - tools/
 
 plugins:
   - jekyll-paginate

diff --git a/assets/data/datasets.csv b/assets/data/datasets.csv
diff --git a/assets/data/plots/.keep.txt b/assets/data/plots/.keep.txt
diff --git a/assets/data/plots/datasets_over_years.pdf b/assets/data/plots/datasets_over_years.pdf
diff --git a/assets/data/plots/datasets_over_years.png b/assets/data/plots/datasets_over_years.png
diff --git a/assets/data/plots/datatypes_percentages.pdf b/assets/data/plots/datatypes_percentages.pdf
diff --git a/assets/data/plots/datatypes_percentages.png b/assets/data/plots/datatypes_percentages.png
diff --git a/content/statistics.md b/content/statistics.md
@@ -0,0 +1,39 @@
+---
+title: Dataset Statistics
+---
+
+The following plots are generated from the CSV file provided in [CSV Download](/intrusion-detection-datasets/content/csv_download).
+
+### Distribution of datasets over time
+
+This figure presents the distribution of currently surveyed datasets over time, where "time" refers to the year the underlying data was generated in, which may differ from the year of publication -- if this information is not available, the latter datum is chosen instead.
+Datasets containing data from more than one year are represented accordingly.
+Additionally, data sources and label availability are shown:
+Data sources are grouped into "Network Data" (e.g., packet captures or network flows), "Host Data" (e.g., system logs or syscalls), and "Both" (any combination of the previous two);
+label availability for each dataset has been classified into either "Direct" (explicit labels for at least a subset of data), "Indirect" (meta-information allowing for manual labeling), or "No Labels".
+
+Even though this simplifies certain aspects, the figure provides a reasonably broad overview of the current landscape of IDS-related datasets.
+As an example, while the [DARPA '98](/intrusion-detection-datasets/content/datasets/darpa98) and [CSE-CIC-IDS2018](/intrusion-detection-datasets/content/datasets/cse_cic_ids2018) datasets contain both network and host data and are visualized as such, only their network data is labeled and thus typically used by other publications.
+Still, declaring these datasets to contain only network data would go beyond the purpose of a survey, as it is up to other researchers to decide whether the (in this case host) data can be utilized for their purposes.
+
+<p style="text-align: center;">
+    <img src="{{ "/assets/data/plots/datasets_over_years.png" | relative_url }}" alt="Figure 1: Distribution of datasets in time" />
+</p>
+
+<p style="text-align: center;font-size:0.8em;">
+    <a href="{{ site.baseurl }}/assets/data/plots/datasets_over_years.pdf" download>Download PDF</a>
+</p>
+
+### Dataset characteristics
+
+This figure lists various characteristics of surveyed datasets, grouped into five categories: Source of network data, source of host data, how benign activity was generated, which operating systems were included, and how many systems in total were part of the scenario.
+Except for the final category, these classifications are not mutually exclusive -- consequently, the sum of a specific category might not align with the total number of datasets surveyed.
+This discrepancy occurs because some datasets, for example, do not include network data, while others may include multiple operating systems, affecting the sums respectively.
+
+<p style="text-align: center;">
+    <img src="{{ "/assets/data/plots/datatypes_count.png" | relative_url }}" alt="Figure 2: Characteristics of surveyed datasets, grouped into categories." />
+</p>
+
+<p style="text-align: center;font-size:0.8em;">
+    <a href="{{ site.baseurl }}/assets/data/plots/datatypes_count.pdf" download>Download PDF</a>
+</p>
diff --git a/tools/generate_plots.py b/tools/generate_plots.py
@@ -5,7 +5,7 @@
 
 def main():
     with open("assets/data/datasets.csv") as file:
-        data = pd.read_csv(file, delimiter=";")
+        data = pd.read_csv(file, delimiter=";", keep_default_na=False)
 
     gc.datasets_over_years(data)
     gc.datatypes_count(data)