is a Python package for large-scale distributed training and inference of random forests on
supercomputers. It uses the message passing interface (MPI) for communication between parallel processes.
- You can install
directly from the main branch. Pull and run:pip install .
- If you want to contribute to
as a developer, you need to install the required dependencies with the package:pip install -e ."[dev]"
depends on mpi4py
and requires an MPI implementation
under the hood. Currently, it is only tested with OpenMPI.
We generate synthetic datasets using sklearn's make_classification
All additional options, such as the number of features, clusters per class, and class separation, can be passed through.
The class imbalance follows a Skellam distribution, defined by a peak (the class with the highest probability) and μ = μ₁ = μ₂ which adjusts the spread of the distribution. The larger μ, the larger the spread. Edge cases: For μ=0, the peak class has weight 1, while all other classes have weight 0. For μ=inf, the generated dataset is balanced, i.e., all classes have equal weights. Beware that with very strong imbalances, the train-test split may fail if a class has only a single sample. This is especially common when using local test sets.
The following figure illustrates the class weight for 11 classes using different μ and a peak at class 5.
The dataset is generated using sklearn's make_classification
The classes can be either balanced or imbalanced using a Skellam distribution.
This is referred to as global (im)balance in contrast to the local imbalance caused by the partition (see below).
To distribute the dataset across ranks, it is first generated locally on each rank using the same random seed. Then, the samples are assigned to ranks and each rank selects its assigned subset. This partition can again be either balanced or imbalanced (referred to as local (im)balance). In the balanced case, the samples of a class are distributed equally across all ranks, i.e., each class has (roughly) the same frequency in the global and all local datasets.
In the imbalanced case, the samples of a class are distributed based on rank-wise class weights. The class weights are based on a Skellam distribution. Each rank has its own distribution, the peaks are spaced linearly on [0, num_classes - 1] considered as ring. The spread of all distributions depends on the μ = μ₁ = μ₂ parameter of the Skellam distribution.
The samples of a class are partitioned according to the relative weights of all ranks for that class. When combining global and local imbalance, this results in the ranks receiving different amounts of data with ranks whose local peak aligns more with the global peak receiving more samples.
Combining the dataset generation with the partitioning results in four different combinations of global and local (im)balance.
The following image illustrates the class distribution of the training set for these four combinations for 11 classes, 8k samples, and four ranks.
The peak of the global distribution is at the center class (id five) and μ is set to five for both dataset generation and partition.
The test dataset can be either private or shared between all ranks.
This also affects the class distributions and size of the local test set.
With a shared test set, the train-test split is performed before partitioning the data among the ranks.
With a private test set, each rank performs its own train-test split on the local data.
To generate a dataset in Python, synthetic_classification_data.generate_and_distribute_synthetic_dataset
combines both data generation and partition into a single function.
Alternatively, rf_training_breaking_iid
can be used directly which combines the above function with random forest training and evaluation.
Minimum working example:
from mpi4py import MPI
from specialcouscous.synthetic_classification_data import generate_and_distribute_synthetic_dataset
from specialcouscous.rf_parallel import DistributedRandomForest
num_samples = 10000
num_classes = 10
seed = 0
test_size = 0.2
n_trees = 100
shared_test_set = True
global_model = True
globally_balanced = False
mu_data = 5
peak = num_classes // 2
locally_balanced = False
mu_partition = 5
global_dataset, local_train, local_test = generate_and_distribute_synthetic_dataset(globally_balanced, locally_balanced,
num_samples, num_classes, comm.rank,
comm.size, seed, test_size,
mu_partition, mu_data, peak,
distributed_random_forest = DistributedRandomForest(n_trees_global=n_trees, comm=comm, random_state=seed,
distributed_random_forest.train(local_train.x, local_train.y, global_model)
distributed_random_forest.evaluate(local_test.x, local_test.y, num_classes, global_model)
To ease the evaluation of large-scale datasets, we implement multi-class evaluation metrics operating directly on the confusion matrix (instead of the true vs predicted values for all samples).
We support the following metrics, with the interfaces based on the corresponding sklearn.metrics
- Accuracy: The global accuracy
- Balanced Accuracy: The accuracy as average over class-wise recalls
- Precision, Recall, and Fβ-Score: With the following averaging options (
: No averaging, return class-wise"micro"
: Compute metrics globally → equal importance on each sample"macro"
: Compute metrics class-wise, then average over classes → equal importance on each class, minority classes can outweigh majority classes"weighted"
: Compute metrics class-wise, then average over classes weighted by their support (#true samples)- As we focus on multi-class classification, the average options
are not included.
- Cohen's Kappa: Compare classification to random guessing, values from -1 to +1, the higher the better, robust to class imbalance but not originally intended for classification problems
- Matthews Correlation Coefficient (MCC): See https://en.wikipedia.org/wiki/Phi_coefficient, values from -1 to +1, the higher the better, robust to class imbalance
import numpy as np
from specialcouscous import evaluation_metrics
path_to_confusion_matrix_csv = "example.csv"
confusion_matrix = np.loadtxt(path_to_confusion_matrix_csv)
accuracy = evaluation_metrics.accuracy_score(confusion_matrix)
balanced_accuracy = evaluation_metrics.balanced_accuracy_score(confusion_matrix)
# `precision_recall_fscore` computes all three metrics at once.
for average in [None, "micro", "macro", "weighted"]:
precision, recall, f_score = evaluation_metrics.precision_recall_fscore(confusion_matrix, beta=1.0, average=average)
# Additionally, you can also call each specific metric on their own, but underneath, they just call `precision_recall_fscore`.
precision = evaluation_metrics.precision_score(confusion_matrix, average=None)
recall = evaluation_metrics.recall_score(confusion_matrix, average=None)
fbeta = evaluation_metrics.fbeta_score(confusion_matrix, beta=2.0, average=None)
f1 = evaluation_metrics.f1_score(confusion_matrix, average=None)
cohen_kappa = evaluation_metrics.cohen_kappa_score(confusion_matrix)
matthews_corrcoef = evaluation_metrics.matthews_corrcoef(confusion_matrix)