Merge pull request #7 from online-ml/ahnentafel

Ahnentafel
online-ml · Dec 14, 2023 · e55df97 · e55df97
2 parents 729fbbc + 566d803
commit e55df97
Show file tree

Hide file tree

Showing 16 changed files with 589 additions and 16 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,5 @@
-# Mini river
 Cargo.lock
-
-target/
+.DS_Store
+target/
+*.csv
+*.zip
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,31 @@
+## Benchmarking
+
+```sh
+cargo bench --bench credit_card
+```
+
+## Changelog
+
+### 2023-10-04
+
+We test with:
+
+- 50 trees
+- Tree height of 6
+- Window size of 1000
+
+The Python baseline runs in **~60 seconds** using Python 3.11 on MacOS. It uses the classic left/right class-based implementation.
+
+We coded a first array based implementation in Rust. It runs in **~6 seconds**. Each tree is a struct. Each struct contains one array for each node attribute. We wonder if we can do better by storing all attributes in a matrix.
+
+ROC AUC appears roughly similar between the Python and Rust implementations. Note that we didn't activate min-max scaling in both cases.
+
+### 2023-10-05
+
+- Using `with_capacity` on each `Vec` in `HST`, as well as the list of HSTs, we gain 1 second. We are now at **~5 seconds**.
+- We can't find a nice profiler. So for now we comment code and measure time.
+- Storing all attributes in a single array, instead of one array per tree, makes us reach **~3 seconds**.
+- We removed the CSV logic from the benchmark, which brings us under **~2.5 second**.
+- Fixing some algorithmic issues actually brings us to **~5 seconds** :(
+- We tried using rayon to parallelize over trees, but it didn't bring any improvement whatsoever. Maybe we used it wrong, but we believe its because our loop is too cheap to be worth the overhead of spawning threads -- or whatever it is rayon does.
+- There is an opportunity to do the scoring and update logic in one fell swoop. This is because of the nature of online anomaly detection. This would bring us to **~2.5 seconds**. We are not sure if this is a good design choice though, so we may revisit this later.
diff --git a/Cargo.toml b/Cargo.toml
@@ -6,11 +6,29 @@ edition = "2021"
 # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
 
 [dependencies]
-
 csv = "1.2.0"
 num = "0.4.0"
 tempfile = "3.4.0"
 maplit = "1.0.2"
 reqwest = { version = "0.11.4", features = ["blocking"] }
 zip = "0.6.4"
 rand = "0.8.5"
+time = "0.3.29"
+half = "2.3.1"
+
+[dev-dependencies]
+criterion = { version = "0.5", features = ["html_reports"] }
+
+[profile.dev]
+opt-level = 0
+
+[profile.release]
+opt-level = 3
+
+[[example]]
+name = "credit_card"
+path = "examples/anomaly_detection/credit_card.rs"
+
+[[bench]]
+name = "hst"
+harness = false
diff --git a/README.md b/README.md
@@ -1 +1,58 @@
-# Mini river
+<h1>🦀 LightRiver • fast and simple online machine learning</h1>
+
+<p>
+
+<!-- Tests -->
+<!-- <a href="https://github.com/online-ml/beaver/actions/workflows/unit-tests.yml">
+<img src="https://github.com/online-ml/beaver/actions/workflows/unit-tests.yml/badge.svg" alt="tests">
+</a> -->
+
+<!-- Code quality -->
+<!-- <a href="https://github.com/online-ml/beaver/actions/workflows/code-quality.yml">
+<img src="https://github.com/online-ml/beaver/actions/workflows/code-quality.yml/badge.svg" alt="code_quality">
+</a> -->
+
+<!-- License -->
+<a href="https://opensource.org/licenses/BSD-3-Clause">
+<img src="https://img.shields.io/badge/License-BSD%203--Clause-blue.svg?style=flat-square" alt="bsd_3_license">
+</a>
+
+</p>
+
+[![Discord](https://dcbadge.vercel.app/api/server/qNmrKEZMAn)](https://discord.gg/qNmrKEZMAn)
+
+<div align="center" >
+  <img src="https://github.com/online-ml/light-river/assets/8095957/fc8ea218-62f9-4643-b25d-f9265ef962f8" width="25%" align="right" />
+</div>
+
+LightRiver is an online machine learning library written in Rust. It is meant to be used in high-throughput environments, as well as TinyML systems.
+
+This library is complementary to [River](https://github.com/online-ml/river/). The latter provides a wide array of online methods, but is not ideal when it comes to performance. The idea is to take the algorithms that work best in River, and implement them in a way that is more performant. As such, LightRiver is not meant to be a general purpose library. It is meant to be a fast online machine learning library that provides a few algorithms that are known to work well in online settings. This is a akin to the way [scikit-learn](https://scikit-learn.org/) and [LightGBM](https://lightgbm.readthedocs.io/en/stable/) are complementary to each other.
+
+## 🧑‍💻 Usage
+
+### 🚨 Anomaly detection
+
+```sh
+cargo run --release --example credit_card
+```
+
+### 📈 Regression
+
+🏗️ We plan to implement Aggregated Mondrian Forests.
+
+### 📊 Classification
+
+🏗️ We plan to implement Aggregated Mondrian Forests.
+
+### 🛒 Recsys
+
+🏗️ [Vowpal Wabbit](https://vowpalwabbit.org/) is very good at recsys via contextual bandits. We don't plan to compete with it. Eventually we want to research a tree-based contextual bandit.
+
+## 🚀 Performance
+
+TODO: add a `benches` directory
+
+## 📝 License
+
+LightRiver is free and open-source software licensed under the [3-clause BSD license](LICENSE).
diff --git a/benches/hst.rs b/benches/hst.rs
@@ -0,0 +1,59 @@
+use criterion::{criterion_group, criterion_main, Criterion, Throughput};
+use light_river::anomaly::half_space_tree::HalfSpaceTree;
+
+fn creation(c: &mut Criterion) {
+    let mut group = c.benchmark_group("creation");
+
+    let features: Vec<String> = vec![
+        String::from("V1"),
+        String::from("V2"),
+        String::from("V3"),
+        String::from("V4"),
+        String::from("V5"),
+        String::from("V6"),
+        String::from("V7"),
+        String::from("V8"),
+        String::from("V9"),
+        String::from("V10"),
+        String::from("V11"),
+        String::from("V12"),
+        String::from("V13"),
+        String::from("V14"),
+        String::from("V15"),
+        String::from("V16"),
+        String::from("V17"),
+        String::from("V18"),
+        String::from("V19"),
+        String::from("V20"),
+        String::from("V21"),
+        String::from("V22"),
+        String::from("V23"),
+        String::from("V24"),
+        String::from("V25"),
+        String::from("V26"),
+        String::from("V27"),
+        String::from("V28"),
+        String::from("V29"),
+        String::from("V30"),
+    ];
+
+    for height in [2, 6, 10, 14].iter() {
+        for n_trees in [3, 30, 300].iter() {
+            let input = (*height, *n_trees);
+            // Calculate the throughput based on the provided formula
+            let throughput = ((2u32.pow(*height) - 1) * *n_trees) as u64;
+            group.throughput(Throughput::Elements(throughput));
+            group.bench_with_input(
+                format!("height={}-n_trees={}", height, n_trees),
+                &input,
+                |b, &input| {
+                    b.iter(|| HalfSpaceTree::new(0, input.1, input.0, Some(features.clone())));
+                },
+            );
+        }
+    }
+    group.finish();
+}
+
+criterion_group!(benches, creation);
+criterion_main!(benches);
diff --git a/examples/anomaly_detection/credit_card.rs b/examples/anomaly_detection/credit_card.rs
@@ -0,0 +1,41 @@
+use light_river::anomaly::half_space_tree::HalfSpaceTree;
+use light_river::common::ClassifierOutput;
+use light_river::common::ClassifierTarget;
+use light_river::datasets::credit_card::CreditCard;
+use light_river::metrics::rocauc::ROCAUC;
+use light_river::metrics::traits::ClassificationMetric;
+use light_river::stream::data_stream::DataStream;
+use light_river::stream::iter_csv::IterCsv;
+use std::fs::File;
+use std::time::Instant;
+
+fn main() {
+    let now = Instant::now();
+
+    // PARAMETERS
+    let window_size: u32 = 1000;
+    let n_trees: u32 = 50;
+    let height: u32 = 6;
+    let pos_val_metric = ClassifierTarget::from("1".to_string());
+    let pos_val_tree = pos_val_metric.clone();
+    let mut roc_auc: ROCAUC<f32> = ROCAUC::new(Some(10), pos_val_metric.clone());
+    // INITIALIZATION
+    let mut hst: HalfSpaceTree<f32> =
+        HalfSpaceTree::new(window_size, n_trees, height, None, Some(pos_val_tree));
+
+    // LOOP
+    let transactions: IterCsv<f32, File> = CreditCard::load_credit_card_transactions().unwrap();
+    for transaction in transactions {
+        let data = transaction.unwrap();
+        let observation = data.get_observation();
+        let label = data.to_classifier_target("Class").unwrap();
+        let score = hst.update(&observation, true, true).unwrap();
+        // println!("Label: {:?}", label);
+        // println!("Score: {:?}", score);
+        roc_auc.update(&score, &label, Some(1.));
+    }
+
+    let elapsed_time = now.elapsed();
+    println!("Took {}ms", elapsed_time.as_millis());
+    println!("ROCAUC: {:.2}%", roc_auc.get() * (100.0 as f32));
+}
diff --git a/measure_auc.py b/measure_auc.py
@@ -0,0 +1,6 @@
+import pandas as pd
+from sklearn import metrics
+
+scores = pd.read_csv('scores.csv', names=['score'])['score']
+labels = pd.read_csv('creditcard.csv')['Class']
+print(f"{metrics.roc_auc_score(labels, -scores):.2%}")
diff --git a/python_baseline.py b/python_baseline.py
@@ -0,0 +1,21 @@
+from river import anomaly
+from river import datasets
+from time import time
+
+scores = []
+hst = anomaly.HalfSpaceTrees(
+    n_trees=50,
+    height=6,
+    window_size=1000,
+)
+dataset = [x for x, _ in datasets.CreditCard()]
+start = time()
+for x in dataset:
+    score = hst.score_one(x)
+    scores.append(score)
+    hst.learn_one(x)
+print(f"Time: {time() - start:.2f}s")
+
+with open('scores_py.csv', 'w') as f:
+    for score in scores:
+        f.write(f"{score}\n")