Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add library benchmarks #256

Merged
merged 18 commits into from
Dec 24, 2024
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.ipynb linguist-documentation
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -94,3 +94,6 @@ venv.bak/

# Changelog entry
ENTRY.md

# Jupyter Notebook checkpoints
*.ipynb_checkpoints/
66 changes: 65 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
<a href="#about">About</a> ·
<a href="#build-status">Build Status</a> ·
<a href="#features">Features</a> ·
<a href="#installation">Installation</a> ·
<a href="#documentation">Documentation</a> ·
<a href="#examples">Examples</a> ·
<a href="#acknowledgments">Acknowledgments</a> ·
Expand Down Expand Up @@ -96,6 +97,63 @@ Parameter estimation with the Baum-Welch algorithm and prediction with the forwa

In most cases, the only necessary change is to add a `lengths` key-word argument to provide sequence length information, e.g. `fit(X, y, lengths=lengths)` instead of `fit(X, y)`.

### Similar libraries

As DTW k-nearest neighbors is the core algorithm offered by Sequentia, below is a comparison of the DTW k-nearest neighbors algorithm features supported by Sequentia and similar libraries.

||**`sequentia`**|[`aeon`](https://github.com/aeon-toolkit/aeon)|[`tslearn`](https://github.com/tslearn-team/tslearn)|[`sktime`](https://github.com/sktime/sktime)|[`pyts`](https://github.com/johannfaouzi/pyts)|
|-|:-:|:-:|:-:|:-:|:-:|
|Scikit-Learn compatible|✅|✅|✅|✅|✅|
|Multivariate sequences|✅|✅|✅|✅|❌|
|Variable length sequences|✅|✅|➖<sup>1</sup>|❌<sup>2</sup>|❌<sup>3</sup>|
|No padding required|✅|❌|➖<sup>1</sup>|❌<sup>2</sup>|❌<sup>3</sup>|
|Classification|✅|✅|✅|✅|✅|
|Regression|✅|✅|✅|✅|❌|
|Preprocessing|✅|✅|✅|✅|✅|
|Multiprocessing|✅|✅|✅|✅|✅|
|Custom weighting|✅|✅|✅|✅|✅|
|Sakoe-Chiba band constraint|✅|✅|✅|✅|✅|
|Itakura paralellogram constraint|❌|✅|✅|✅|✅|
|Dependent DTW (DTWD)|✅|✅|✅|✅|❌|
|Independent DTW (DTWI)|✅|❌|❌|❌|✅|
|Custom DTW measures|❌<sup>4</sup>|✅|❌|✅|✅|

- <sup>1</sup>`tslearn` supports variable length sequences with padding, but doesn't seem to mask the padding.
- <sup>2</sup>`sktime` does not support variable length sequences, so they are padded (and padding is not masked).
- <sup>3</sup>`pyts` does not support variable length sequences, so they are padded (and padding is not masked).
- <sup>4</sup>`sequentia` only supports [`dtaidistance`](https://github.com/wannesm/dtaidistance), which is one of the fastest DTW libraries as it is written in C.

### Benchmarks

To compare the above libraries in runtime performance on dynamic time warping k-nearest neighbors classification tasks, a simple benchmark was performed on a univariate sequence dataset.

The [Free Spoken Digit Dataset](https://sequentia.readthedocs.io/en/latest/sections/datasets/digits.html) was used for benchmarking and consists of:

- 3000 recordings of 10 spoken digits (0-9)
- 50 recordings of each digit for each of 6 speakers
- 1500 used for training, 1500 used for testing (split via label stratification)
- 13 features ([MFCCs](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum))
- Only the first feature was used as not all of the above libraries support multivariate sequences
- Sequence length statistics:
- Minimum: 6
- Median: 17
- Maximum: 92

Each result measures the total time taken to complete training and prediction repeated 10 times.

All of the above libraries support multiprocessing, and prediction was performed using 16 workers.

<sup>*</sup>: `sktime`, `tslearn` and `pyts` seem to not mask padding, which may result in incorrect predictions.

<img src="benchmarks/benchmark.svg" width="100%"/>

> **Device information**:
> - Product: ThinkPad T14s (Gen 6)
> - Processor: AMD Ryzen™ AI 7 PRO 360 (8 cores, 16 threads, 2-5GHz)
> - Memory: 64 GB LPDDR5X-7500MHz
> - Solid State Drive: 1 TB SSD M.2 2280 PCIe Gen4 Performance TLC Opal
> - Operating system: Fedora Linux 41 (Workstation Edition)

## Installation

The latest stable version of Sequentia can be installed with the following command:
Expand Down Expand Up @@ -169,7 +227,13 @@ lengths = np.array([3, 5, 2])
# Sequence classes
y = np.array([0, 1, 1])

# Create a transformation pipeline that feeds into a KNNClassifier
# Train and predict (without preprocessing)
clf = KNNClassifier(k=1)
clf.fit(X, y, lengths=lengths)
y_pred = clf.predict(X, lengths=lengths)
acc = pipeline.score(X, y, lengths=lengths)

# Create a preprocessing pipeline that feeds into a KNNClassifier
# 1. Individually denoise each sequence by applying a median filter for each feature
# 2. Individually standardize each sequence by subtracting the mean and dividing the s.d. for each feature
# 3. Reduce the dimensionality of the data to a single feature by using PCA
Expand Down
8 changes: 8 additions & 0 deletions benchmarks/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Copyright (c) 2019 Sequentia Developers.
# Distributed under the terms of the MIT License (see the LICENSE file).
# SPDX-License-Identifier: MIT
# This source code is part of the Sequentia project (https://github.com/eonu/sequentia).

"""Collection of runtime benchmarks for Python packages
providing dynamic time warping k-nearest neighbors algorithms.
"""
Loading
Loading