Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mondrian Forests #10

Open
wants to merge 60 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
6023dd1
Update ClassifierOutput docstring
MarcoDiFrancesco Apr 11, 2024
feba8a0
Add RegressionOutput to common
MarcoDiFrancesco Apr 11, 2024
c13d3c6
Merge branch 'online-ml:main' into main
MarcoDiFrancesco Apr 11, 2024
308a082
Add boilerplate code for mondrian forest
MarcoDiFrancesco Apr 12, 2024
3ba0e3a
Add keystroke dataset
MarcoDiFrancesco Apr 12, 2024
2f9e03d
Add all functions calls with unimplemented errors
MarcoDiFrancesco Apr 15, 2024
7b63db5
Add predict steps to be refactored
MarcoDiFrancesco Apr 15, 2024
d5bb6db
Add get features function
MarcoDiFrancesco Apr 16, 2024
b5b7ec4
Add Array library
MarcoDiFrancesco Apr 16, 2024
d613df2
Add randomization for cache tests
MarcoDiFrancesco Apr 16, 2024
2174472
Disable test github actions and enable only check
MarcoDiFrancesco Apr 17, 2024
1c91530
Remove verbose from build and test
MarcoDiFrancesco Apr 17, 2024
44cfba4
Add Stats struct and impl
MarcoDiFrancesco Apr 18, 2024
4c6ebe4
Add rust caching in actions
MarcoDiFrancesco Apr 18, 2024
1ccabc4
Split MondrianTree and MondrianForest
MarcoDiFrancesco Apr 22, 2024
ac71b06
Refactor to use Tree Vector indicies instead of pointers
MarcoDiFrancesco Apr 23, 2024
8aad4ed
Change actions cargo.lock to cargo.toml
MarcoDiFrancesco Apr 23, 2024
8c91dd8
Add print function for MondrianTree
MarcoDiFrancesco Apr 23, 2024
6b38849
Adding print functions to mondriantree and node
MarcoDiFrancesco Apr 23, 2024
107354a
Implement and test predict_proba
MarcoDiFrancesco Apr 24, 2024
4385fe8
Add unit test for predict_proba
MarcoDiFrancesco Apr 24, 2024
49d4e3e
Add final implementation of inference (predict_proba)
MarcoDiFrancesco Apr 24, 2024
a16d3e7
Add random distribution to extend mondrian block
MarcoDiFrancesco Apr 25, 2024
de5d67a
Add full extend_mondrian_block implementation
MarcoDiFrancesco Apr 25, 2024
667d35e
Add synthetic dataset and tree integrity tests
MarcoDiFrancesco Apr 25, 2024
f79864d
Fix pointer of grandpa on extend_mondrian_block
MarcoDiFrancesco Apr 26, 2024
989c176
Add recursive repr mondrian forest
MarcoDiFrancesco Apr 26, 2024
75e5feb
Add score function
MarcoDiFrancesco Apr 29, 2024
da4a00a
Remove debug statements
MarcoDiFrancesco Apr 30, 2024
717161f
Adjust code to River behaviour
MarcoDiFrancesco Apr 30, 2024
a9ca4bc
Adapt _go_downwards from River
MarcoDiFrancesco May 3, 2024
ccc9b1d
Update function names from nel215 to River
MarcoDiFrancesco May 3, 2024
30fb86b
Comment debug prints
MarcoDiFrancesco May 3, 2024
a619415
Remove unused imports
MarcoDiFrancesco May 3, 2024
da23d14
Add synthetic dataset download
MarcoDiFrancesco May 3, 2024
85030ad
Rename MondrianForest to MondrianForestClassifier
MarcoDiFrancesco May 6, 2024
c4753f1
Update readme with classification run instructions
MarcoDiFrancesco May 6, 2024
a08f922
Add update_leaf flag to create_leaf
MarcoDiFrancesco May 13, 2024
a00cfe5
Fix mondrian forest classifier test
MarcoDiFrancesco May 13, 2024
4d9ef48
Remove create_leaf flag
MarcoDiFrancesco May 20, 2024
0217db2
Add create leafs when reaching a leaf
MarcoDiFrancesco May 24, 2024
1e5a874
Add assert to check for NaN probability
MarcoDiFrancesco May 24, 2024
6971c21
Revert removal of split_time
MarcoDiFrancesco May 24, 2024
782d1f2
Add test cases
MarcoDiFrancesco May 29, 2024
a5bd895
Remove unused `child_is_on_edge_parent` test case
MarcoDiFrancesco May 29, 2024
3544c28
Add debug statement for overwriting variance aware estimation
MarcoDiFrancesco May 29, 2024
9083d8e
Add synthetic regression target boilerplate
MarcoDiFrancesco Jun 4, 2024
43cce28
Add Classification and Regression division of MF
MarcoDiFrancesco Jun 7, 2024
e58638b
Add regression task and parent_has_finite_values test
MarcoDiFrancesco Jun 11, 2024
fed6daf
Fix child_inside_parent test
MarcoDiFrancesco Jun 11, 2024
760de79
Remove prints in excess
MarcoDiFrancesco Jun 11, 2024
54bb202
Add regression metrics
MarcoDiFrancesco Jun 12, 2024
0d74d3f
Fix test keystroke dataset
MarcoDiFrancesco Jun 12, 2024
c60b381
Change description of synthetic dataset
MarcoDiFrancesco Jun 12, 2024
ec2109a
Add baseline comparison for regression
MarcoDiFrancesco Jun 24, 2024
b77ba69
Add machine degradation dataset
MarcoDiFrancesco Jul 9, 2024
a6c1b8b
Add genesis demostrator dataset
MarcoDiFrancesco Jul 10, 2024
4a4b9f5
Update machine degradation with redirect
MarcoDiFrancesco Jul 10, 2024
23c109e
Update src/datasets/synthetic_regression.rs
smastelini Jul 29, 2024
38e64ee
Update src/datasets/synthetic.rs
smastelini Jul 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions .github/workflows/clippy_check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ jobs:
clippy_check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v1
- run: rustup component add clippy
- uses: actions-rs/clippy-check@v1
with:
token: ${{ secrets.GITHUB_TOKEN }}
args: --all-features
- uses: actions/checkout@v3
# - run: rustup component add clippy
# - uses: actions-rs/clippy-check@v1
# with:
# token: ${{ secrets.GITHUB_TOKEN }}
# args: --all-features
25 changes: 24 additions & 1 deletion .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,35 @@ env:

jobs:
build:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

# Setup Rust toolchain
- name: Set up Rust
uses: actions-rs/toolchain@v1
with:
profile: minimal
toolchain: stable
override: true

# Cache Cargo registry, index, and build output
- name: Cache Cargo dependencies
uses: actions/cache@v3
with:
path: |
~/.cargo/registry
~/.cargo/git
target
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.toml') }}
restore-keys: |
${{ runner.os }}-cargo-

# Build project
- name: Build
run: cargo build --verbose

# Run tests
- name: Run tests
run: cargo test --verbose
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ Cargo.lock
target/
*.csv
*.zip

/.vscode/
# Local configuration
.cargo/config.toml
/.venv*/
generate_data_synthetic.py
/run_synthetic_output*.txt
22 changes: 22 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ zip = "0.6.4"
rand = "0.8.5"
time = "0.3.29"
half = "2.3.1"
ndarray = "0.15.6"
rand_distr = "0.4.3"

[dev-dependencies]
criterion = { version = "0.5", features = ["html_reports"] }
Expand All @@ -29,6 +31,26 @@ opt-level = 3
name = "credit_card"
path = "examples/anomaly_detection/credit_card.rs"

[[example]]
name = "genesis_demonstrator"
path = "examples/classification/genesis_demonstrator.rs"

[[example]]
name = "keystroke"
path = "examples/classification/keystroke.rs"

[[example]]
name = "machine_degradations"
path = "examples/regression/machine_degradations.rs"

[[example]]
name = "synthetic"
path = "examples/classification/synthetic.rs"

[[example]]
name = "synthetic-regression"
path = "examples/regression/synthetic_regression.rs"

[[bench]]
name = "hst"
harness = false
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,9 @@ cargo run --release --example credit_card

### 📊 Classification

🏗️ We plan to implement Aggregated Mondrian Forests.
```sh
RUSTFLAGS=-Awarnings cargo run --example synthetic
```

### 🛒 Recsys

Expand Down
6 changes: 3 additions & 3 deletions examples/anomaly_detection/credit_card.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
use light_river::anomaly::half_space_tree::HalfSpaceTree;
use light_river::common::ClassifierOutput;
use light_river::common::ClassifierTarget;
use light_river::common::ClfTarget;
use light_river::datasets::credit_card::CreditCard;
use light_river::metrics::rocauc::ROCAUC;
use light_river::metrics::traits::ClassificationMetric;
Expand All @@ -16,7 +16,7 @@ fn main() {
let window_size: u32 = 1000;
let n_trees: u32 = 50;
let height: u32 = 6;
let pos_val_metric = ClassifierTarget::from("1".to_string());
let pos_val_metric = ClfTarget::from("1".to_string());
let pos_val_tree = pos_val_metric.clone();
let mut roc_auc: ROCAUC<f32> = ROCAUC::new(Some(10), pos_val_metric.clone());
// INITIALIZATION
Expand All @@ -32,7 +32,7 @@ fn main() {
let score = hst.update(&observation, true, true).unwrap();
// println!("Label: {:?}", label);
// println!("Score: {:?}", score);
roc_auc.update(&score, &label, Some(1.));
// roc_auc.update(&score, &label, Some(1.));
}

let elapsed_time = now.elapsed();
Expand Down
110 changes: 110 additions & 0 deletions examples/classification/genesis_demonstrator.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
use light_river::datasets::genesis_demonstrator::GenesisDemostrator;
use light_river::mondrian_forest::mondrian_forest::MondrianForestClassifier;

use light_river::common::{Classifier, ClfTarget};
use light_river::datasets::synthetic::Synthetic;
use light_river::stream::iter_csv::IterCsv;
use ndarray::Array1;
use num::ToPrimitive;

use std::fs::File;
use std::time::Instant;

/// Get list of features of the dataset.
///
/// e.g. features: ["H.e", "UD.t.i", "H.i", ...]
fn get_features(transactions: IterCsv<f32, File>) -> Vec<String> {
let sample = transactions.into_iter().next();
let observation = sample.unwrap().unwrap().get_observation();
let mut out: Vec<String> = observation.iter().map(|(k, _)| k.clone()).collect();
out.sort();
out
}

fn get_labels(transactions: IterCsv<f32, File>, label_name: &str) -> Vec<String> {
let mut labels = vec![];
for t in transactions {
let data = t.unwrap();
// TODO: use instead 'to_classifier_target' and a vector of 'ClfTarget'
let target = data.get_y().unwrap()[label_name].to_string();
if !labels.contains(&target) {
labels.push(target);
}
}
labels
}

fn get_dataset_size(transactions: IterCsv<f32, File>) -> usize {
let mut length = 0;
for _ in transactions {
length += 1;
}
length
}

fn main() {
let n_trees: usize = 10;

let transactions_f = GenesisDemostrator::load_data();
let features = get_features(transactions_f);

let transactions_c = GenesisDemostrator::load_data();
let labels = get_labels(transactions_c, "Label");
println!("labels: {labels:?}, features: {features:?}");
let mut mf: MondrianForestClassifier<f32> =
MondrianForestClassifier::new(n_trees, features.len(), labels.len());
let mut score_total = 0.0;

let transactions_l = GenesisDemostrator::load_data();
let dataset_size = get_dataset_size(transactions_l);

let now = Instant::now();

let transactions = GenesisDemostrator::load_data();
for (idx, transaction) in transactions.enumerate() {
let data = transaction.unwrap();

let x = data.get_observation();
let x = Array1::<f32>::from_vec(features.iter().map(|k| x[k]).collect());

let y = data.to_classifier_target("Label").unwrap();
let y = match y {
ClfTarget::String(y) => y,
_ => unimplemented!(),
};
let y = labels.clone().iter().position(|l| l == &y).unwrap();
let y = ClfTarget::from(y);
// println!("=M=1 x:{}, idx: {}", x, idx);

// Skip first sample since tree has still no node
if idx != 0 {
let score = mf.predict_one(&x, &y);
score_total += score;
// println!(
// "Accuracy: {} / {} = {}",
// score_total,
// dataset_size - 1,
// score_total / idx.to_f32().unwrap()
// );
}

// if idx == 4 {
// break;
// }

mf.learn_one(&x, &y);
}

let elapsed_time = now.elapsed();
println!("Took {}ms", elapsed_time.as_millis());

println!(
"Accuracy: {} / {} = {}",
score_total,
dataset_size - 1,
score_total / (dataset_size - 1).to_f32().unwrap()
);

let forest_size = mf.get_forest_size();
println!("Forest tree sizes: {:?}", forest_size);
}
109 changes: 109 additions & 0 deletions examples/classification/keystroke.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
use light_river::common::{Classifier, ClfTarget};
use light_river::datasets::keystroke::Keystroke;
use light_river::mondrian_forest::mondrian_forest::MondrianForestClassifier;

use light_river::stream::iter_csv::IterCsv;
use ndarray::Array1;

use num::ToPrimitive;
use std::fs::File;
use std::time::Instant;

/// Get list of features of the dataset.
///
/// e.g. features: ["H.e", "UD.t.i", "H.i", ...]
fn get_features(transactions: IterCsv<f32, File>) -> Vec<String> {
let sample = transactions.into_iter().next();
let observation = sample.unwrap().unwrap().get_observation();
let mut out: Vec<String> = observation.iter().map(|(k, _)| k.clone()).collect();
out.sort();
out
}

fn get_labels(transactions: IterCsv<f32, File>, label_name: &str) -> Vec<String> {
let mut labels = vec![];
for t in transactions {
let data = t.unwrap();
// TODO: use instead 'to_classifier_target' and a vector of 'ClfTarget'
let target = data.get_y().unwrap()[label_name].to_string();
if !labels.contains(&target) {
labels.push(target);
}
}
labels
}

fn get_dataset_size(transactions: IterCsv<f32, File>) -> usize {
let mut length = 0;
for _ in transactions {
length += 1;
}
length
}

fn main() {
let n_trees: usize = 1;

let transactions_f = Keystroke::load_data();
let features = get_features(transactions_f);

let transactions_c = Keystroke::load_data();
let labels = get_labels(transactions_c, "subject");
println!("labels: {labels:?}, features: {features:?}");
let mut mf: MondrianForestClassifier<f32> =
MondrianForestClassifier::new(n_trees, features.len(), labels.len());
let mut score_total = 0.0;

let transactions_l = Keystroke::load_data();
let dataset_size = get_dataset_size(transactions_l);

let now = Instant::now();

let transactions = Keystroke::load_data();
for (idx, transaction) in transactions.enumerate() {
let data = transaction.unwrap();

let x = data.get_observation();
let x = Array1::<f32>::from_vec(features.iter().map(|k| x[k]).collect());

let y = data.to_classifier_target("subject").unwrap();
let y = match y {
ClfTarget::String(y) => y,
_ => unimplemented!(),
};
let y = labels.clone().iter().position(|l| l == &y).unwrap();
let y = ClfTarget::from(y);
// println!("=M=1 x:{}, idx: {}", x, idx);

// Skip first sample since tree has still no node
if idx != 0 {
let score = mf.predict_one(&x, &y);
score_total += score;
// println!(
// "Accuracy: {} / {} = {}",
// score_total,
// dataset_size - 1,
// score_total / idx.to_f32().unwrap()
// );
}

// if idx == 4 {
// break;
// }

mf.learn_one(&x, &y);
}

let elapsed_time = now.elapsed();
println!("Took {}ms", elapsed_time.as_millis());

println!(
"Accuracy: {} / {} = {}",
score_total,
dataset_size - 1,
score_total / (dataset_size - 1).to_f32().unwrap()
);

let forest_size = mf.get_forest_size();
println!("Forest tree sizes: {:?}", forest_size);
}
Loading
Loading