Skip to content

Commit

Permalink
Merge pull request #7 from online-ml/ahnentafel
Browse files Browse the repository at this point in the history
Ahnentafel
  • Loading branch information
MaxHalford authored Dec 14, 2023
2 parents 729fbbc + 566d803 commit e55df97
Show file tree
Hide file tree
Showing 16 changed files with 589 additions and 16 deletions.
7 changes: 4 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Mini river
Cargo.lock

target/
.DS_Store
target/
*.csv
*.zip
31 changes: 31 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
## Benchmarking

```sh
cargo bench --bench credit_card
```

## Changelog

### 2023-10-04

We test with:

- 50 trees
- Tree height of 6
- Window size of 1000

The Python baseline runs in **~60 seconds** using Python 3.11 on MacOS. It uses the classic left/right class-based implementation.

We coded a first array based implementation in Rust. It runs in **~6 seconds**. Each tree is a struct. Each struct contains one array for each node attribute. We wonder if we can do better by storing all attributes in a matrix.

ROC AUC appears roughly similar between the Python and Rust implementations. Note that we didn't activate min-max scaling in both cases.

### 2023-10-05

- Using `with_capacity` on each `Vec` in `HST`, as well as the list of HSTs, we gain 1 second. We are now at **~5 seconds**.
- We can't find a nice profiler. So for now we comment code and measure time.
- Storing all attributes in a single array, instead of one array per tree, makes us reach **~3 seconds**.
- We removed the CSV logic from the benchmark, which brings us under **~2.5 second**.
- Fixing some algorithmic issues actually brings us to **~5 seconds** :(
- We tried using rayon to parallelize over trees, but it didn't bring any improvement whatsoever. Maybe we used it wrong, but we believe its because our loop is too cheap to be worth the overhead of spawning threads -- or whatever it is rayon does.
- There is an opportunity to do the scoring and update logic in one fell swoop. This is because of the nature of online anomaly detection. This would bring us to **~2.5 seconds**. We are not sure if this is a good design choice though, so we may revisit this later.
20 changes: 19 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,29 @@ edition = "2021"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]

csv = "1.2.0"
num = "0.4.0"
tempfile = "3.4.0"
maplit = "1.0.2"
reqwest = { version = "0.11.4", features = ["blocking"] }
zip = "0.6.4"
rand = "0.8.5"
time = "0.3.29"
half = "2.3.1"

[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }

[profile.dev]
opt-level = 0

[profile.release]
opt-level = 3

[[example]]
name = "credit_card"
path = "examples/anomaly_detection/credit_card.rs"

[[bench]]
name = "hst"
harness = false
59 changes: 58 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,58 @@
# Mini river
<h1>🦀 LightRiver • fast and simple online machine learning</h1>

<p>

<!-- Tests -->
<!-- <a href="https://github.com/online-ml/beaver/actions/workflows/unit-tests.yml">
<img src="https://github.com/online-ml/beaver/actions/workflows/unit-tests.yml/badge.svg" alt="tests">
</a> -->

<!-- Code quality -->
<!-- <a href="https://github.com/online-ml/beaver/actions/workflows/code-quality.yml">
<img src="https://github.com/online-ml/beaver/actions/workflows/code-quality.yml/badge.svg" alt="code_quality">
</a> -->

<!-- License -->
<a href="https://opensource.org/licenses/BSD-3-Clause">
<img src="https://img.shields.io/badge/License-BSD%203--Clause-blue.svg?style=flat-square" alt="bsd_3_license">
</a>

</p>

[![Discord](https://dcbadge.vercel.app/api/server/qNmrKEZMAn)](https://discord.gg/qNmrKEZMAn)

<div align="center" >
<img src="https://github.com/online-ml/light-river/assets/8095957/fc8ea218-62f9-4643-b25d-f9265ef962f8" width="25%" align="right" />
</div>

LightRiver is an online machine learning library written in Rust. It is meant to be used in high-throughput environments, as well as TinyML systems.

This library is complementary to [River](https://github.com/online-ml/river/). The latter provides a wide array of online methods, but is not ideal when it comes to performance. The idea is to take the algorithms that work best in River, and implement them in a way that is more performant. As such, LightRiver is not meant to be a general purpose library. It is meant to be a fast online machine learning library that provides a few algorithms that are known to work well in online settings. This is a akin to the way [scikit-learn](https://scikit-learn.org/) and [LightGBM](https://lightgbm.readthedocs.io/en/stable/) are complementary to each other.

## 🧑‍💻 Usage

### 🚨 Anomaly detection

```sh
cargo run --release --example credit_card
```

### 📈 Regression

🏗️ We plan to implement Aggregated Mondrian Forests.

### 📊 Classification

🏗️ We plan to implement Aggregated Mondrian Forests.

### 🛒 Recsys

🏗️ [Vowpal Wabbit](https://vowpalwabbit.org/) is very good at recsys via contextual bandits. We don't plan to compete with it. Eventually we want to research a tree-based contextual bandit.

## 🚀 Performance

TODO: add a `benches` directory

## 📝 License

LightRiver is free and open-source software licensed under the [3-clause BSD license](LICENSE).
59 changes: 59 additions & 0 deletions benches/hst.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
use criterion::{criterion_group, criterion_main, Criterion, Throughput};
use light_river::anomaly::half_space_tree::HalfSpaceTree;

fn creation(c: &mut Criterion) {
let mut group = c.benchmark_group("creation");

let features: Vec<String> = vec![
String::from("V1"),
String::from("V2"),
String::from("V3"),
String::from("V4"),
String::from("V5"),
String::from("V6"),
String::from("V7"),
String::from("V8"),
String::from("V9"),
String::from("V10"),
String::from("V11"),
String::from("V12"),
String::from("V13"),
String::from("V14"),
String::from("V15"),
String::from("V16"),
String::from("V17"),
String::from("V18"),
String::from("V19"),
String::from("V20"),
String::from("V21"),
String::from("V22"),
String::from("V23"),
String::from("V24"),
String::from("V25"),
String::from("V26"),
String::from("V27"),
String::from("V28"),
String::from("V29"),
String::from("V30"),
];

for height in [2, 6, 10, 14].iter() {
for n_trees in [3, 30, 300].iter() {
let input = (*height, *n_trees);
// Calculate the throughput based on the provided formula
let throughput = ((2u32.pow(*height) - 1) * *n_trees) as u64;
group.throughput(Throughput::Elements(throughput));
group.bench_with_input(
format!("height={}-n_trees={}", height, n_trees),
&input,
|b, &input| {
b.iter(|| HalfSpaceTree::new(0, input.1, input.0, Some(features.clone())));
},
);
}
}
group.finish();
}

criterion_group!(benches, creation);
criterion_main!(benches);
41 changes: 41 additions & 0 deletions examples/anomaly_detection/credit_card.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
use light_river::anomaly::half_space_tree::HalfSpaceTree;
use light_river::common::ClassifierOutput;
use light_river::common::ClassifierTarget;
use light_river::datasets::credit_card::CreditCard;
use light_river::metrics::rocauc::ROCAUC;
use light_river::metrics::traits::ClassificationMetric;
use light_river::stream::data_stream::DataStream;
use light_river::stream::iter_csv::IterCsv;
use std::fs::File;
use std::time::Instant;

fn main() {
let now = Instant::now();

// PARAMETERS
let window_size: u32 = 1000;
let n_trees: u32 = 50;
let height: u32 = 6;
let pos_val_metric = ClassifierTarget::from("1".to_string());
let pos_val_tree = pos_val_metric.clone();
let mut roc_auc: ROCAUC<f32> = ROCAUC::new(Some(10), pos_val_metric.clone());
// INITIALIZATION
let mut hst: HalfSpaceTree<f32> =
HalfSpaceTree::new(window_size, n_trees, height, None, Some(pos_val_tree));

// LOOP
let transactions: IterCsv<f32, File> = CreditCard::load_credit_card_transactions().unwrap();
for transaction in transactions {
let data = transaction.unwrap();
let observation = data.get_observation();
let label = data.to_classifier_target("Class").unwrap();
let score = hst.update(&observation, true, true).unwrap();
// println!("Label: {:?}", label);
// println!("Score: {:?}", score);
roc_auc.update(&score, &label, Some(1.));
}

let elapsed_time = now.elapsed();
println!("Took {}ms", elapsed_time.as_millis());
println!("ROCAUC: {:.2}%", roc_auc.get() * (100.0 as f32));
}
6 changes: 6 additions & 0 deletions measure_auc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
import pandas as pd
from sklearn import metrics

scores = pd.read_csv('scores.csv', names=['score'])['score']
labels = pd.read_csv('creditcard.csv')['Class']
print(f"{metrics.roc_auc_score(labels, -scores):.2%}")
21 changes: 21 additions & 0 deletions python_baseline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
from river import anomaly
from river import datasets
from time import time

scores = []
hst = anomaly.HalfSpaceTrees(
n_trees=50,
height=6,
window_size=1000,
)
dataset = [x for x, _ in datasets.CreditCard()]
start = time()
for x in dataset:
score = hst.score_one(x)
scores.append(score)
hst.learn_one(x)
print(f"Time: {time() - start:.2f}s")

with open('scores_py.csv', 'w') as f:
for score in scores:
f.write(f"{score}\n")
Loading

0 comments on commit e55df97

Please sign in to comment.