Skip to content

Commit

Permalink
Add README
Browse files Browse the repository at this point in the history
  • Loading branch information
fcharras committed Sep 1, 2023
1 parent 869c486 commit 6405e2e
Show file tree
Hide file tree
Showing 3 changed files with 167 additions and 5 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/run_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,10 @@ jobs:
pip install cython numpy scipy joblib threadpoolctl &&
# Build and install
pip install torch --index-url https://download.pytorch.org/whl/cpu &&
pip install pytest git+https://github.com/fcharras/scikit-learn.git@80f58bf10d2f8b8cb43f6253bbe13413985a1413#egg=scikit-learn -e .
pip install pytest git+https://github.com/fcharras/scikit-learn.git@2ccfc8c4bdf66db005d7681757b4145842944fb9#egg=scikit-learn -e .
- name: Run sklearn_numba_dpex tests
- name: Run sklearn_pytorch_engine tests
run: pytest -v sklearn_pytorch_engine/

# TODO: run those tests in a separate pipeline
Expand All @@ -47,5 +47,5 @@ jobs:
# that fitted attributes and outputs are numpy arrays. Currently this behavior is
# activated when the environment variable SKLEARN_PYTORCH_ENGINE_TESTING_MODE is set
# to 1.
- name: Run sklearn test suites with sklearn_numba_dpex engines
- name: Run sklearn test suites with sklearn_pytorch_engine
run: SKLEARN_RUN_FLOAT32_TESTS=1 SKLEARN_PYTORCH_ENGINE_TESTING_MODE=1 pytest -v --sklearn-engine-provider sklearn_pytorch_engine --pyargs sklearn.cluster.tests.test_k_means
162 changes: 162 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# sklearn-pytorch0engine

Experimental plugin for scikit-learn that implements a backend for (some) scikit-learn
estimators, written in `pytorch`, so that it benefits from `pytorch` ability to
dispatch data and compute to many devices, providing the appropriate pytorch extensions
are installed.

This package requires working with the following experimental branch of scikit-learn:

- `feature/engine-api` branch on https://github.com/scikit-learn/scikit-learn

## List of Included Engines

- `sklearn.cluster.KMeans` for the standard LLoyd's algorithm on dense data arrays,
including `kmeans++` support.

## Getting started:

### Pre-requisites

#### Step 1: Install PyTorch

Getting started requires a working python environment for using `pytorch`. Depending on
the device you target, install PyTorch extensions accordingly, including (but not
limited to):

- using one of the [native distributions](https://pytorch.org/get-started/locally/) for
cuda (for nvidia gpus), rocm (amd gpus) or mps (apple gpus) support

- using [Intel distributions](https://intel.github.io/intel-extension-for-pytorch/xpu/2.0.110+xpu/tutorials/installations/linux.html#install-via-prebuilt-wheel-files)
for xpu (for intel gpus), has experimental (unofficial) support for igpus if compiling
from source
[with appropriate flags](https://intel.github.io/intel-extension-for-pytorch/xpu/2.0.110+xpu/tutorials/installations/linux.html#configure-the-aot-optional)

#### Step 2: Install scikit-learn from source

Using the plugin requires the experimental development branch `feature/engine-api` of
scikit-learn that implements the compatible plugin system. The `sklearn_pytorch_engine`
plugin is compatible with the commit 2ccfc8c4bdf66db005d7681757b4145842944fb9 available
in the fork [fcharras/scikit-learn](https://github.com/fcharras/scikit-learn/) .

Please refer to the relevant [scikit-learn documentation page](https://scikit-learn.org/stable/developers/advanced_installation.html#install-bleeding-edge)
for a comprehensive guide regarding installing from source. For instance, using `pip`
and `apt` (assuming `apt`-based environment):

```
apt-get update --quiet
# Install prerequisites
apt-get install -y build-essential python3-dev git
pip install cython numpy scipy joblib threadpoolctl
# Build and install
pip install git+https://github.com/fcharras/scikit-learn.git@2ccfc8c4bdf66db005d7681757b4145842944fb9#egg=scikit-learn
```

#### Step 3: Install this plugin

When loaded into your PyTorch + scikit-learn environment, run:

```
git clone https://github.com/soda-inria/sklearn-pytorch-engine
cd sklearn-pytorch-engine
pip install -e .
```

## Using the plugin

See the `sklearn_pytorch_engine/kmeans/tests` folder for example usage.

🚧 TODO: write some examples here instead.

### Running the tests

To run the tests run the following from the root of the `sklearn_pytorch_engine`
repository:

```bash
pytest sklearn_pytorch_engine
```

To run the `scikit-learn` tests with the `sklearn_pytorch_engine` engine you can run the
following:

```bash
SKLEARN_PYTORCH_ENGINE_TESTING_MODE=1 pytest --sklearn-engine-provider sklearn_pytorch_engine --pyargs sklearn.cluster.tests.test_k_means
```

(change the `--pyargs` option accordingly to select other test suites).

The `--sklearn-engine-provider sklearn_pytorch_engine` option offered by the sklearn
pytest plugin will automatically activate the `sklearn_pytorch_engine` engine for all
tests.

Tests covering unsupported features (that trigger
`sklearn.exceptions.FeatureNotCoveredByPluginError`) will be automatically marked as
_xfailed_.

### Additional environment variables for device selection behavior

By default, the engine will use the _compute follow data_ principle, meaning that it
will run the compute on the device that manages the data. For instance `kmeans.fit(X)`
will run compute on corresponding xpu device if `X` is a `torch.tensor` array such that
`X.device.type` is `"xpu"`, and will run on cpu if `X.device.type` is `"cpu"`, etc.

It's possible to alter this behavior and have the engine force offload the compute to
a specific device, using the environment variable
`SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE`. For instance, on a compatible computer,
`SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE=mps` will force the compute to the
`mps`-compatible device, even if it requires copying the input data under the hood to
do so.

Both internal and scikit-learn test suites can run with any value of
`SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE` as long as the compatible pytorch extension
is available and that the host hardware is compatible, for instance:

```bash
export SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE=xpu
pytest sklearn_pytorch_engine
SKLEARN_PYTORCH_ENGINE_TESTING_MODE=1 pytest --sklearn-engine-provider sklearn_pytorch_engine --pyargs sklearn.cluster.tests.test_k_means
```

will run all compute on the relevant `xpu` device.

At the moment, both tests suite will create test data that is hosted on the CPU by
default. For internal tests, this behavior can be changed with the environment variable
`SKLEARN_PYTORCH_ENGINE_TEST_INPUTS_DEVICE`, for instance the command

```bash
SKLEARN_PYTORCH_ENGINE_TEST_INPUTS_DEVICE=cuda SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE=cpu pytest sklearn_pytorch_engine
```

will run the tests while enforcing that the test data is generated on the cuda device
but the compute is done on cpu (since `SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE` is set
to `cpu`).

All combinations of those two environment variables makes for a reasonably exhaustive
test matrix regarding internal data conversions.

### Notes about the preferred floating point precision (float32)

In many machine learning applications, operations using single-precision (float32)
floating point data require twice as less memory that double-precision (float64), are
regarded as faster, accurate enough and more suitable for GPU compute. Besides, most
GPUs used in machine learning projects are significantly faster with float32 than with
double-precision (float64) floating point data.

To leverage the full potential of GPU execution, it's strongly advised to use a float32
data type.

By default, unless specified otherwise numpy array are created with type float64, so be
especially careful to the type whenever the loader does not explicitly document the
type nor expose a type option.

Transforming NumPy arrays from float64 to float32 is also possible using
[`numpy.ndarray.astype`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html),
although it is less recommended to prevent avoidable data copies. `numpy.ndarray.astype`
can be used as follows:

```python
X = my_data_loader()
X_float32 = X.astype(float32)
my_gpu_compute(X_float32)
```
4 changes: 2 additions & 2 deletions sklearn_pytorch_engine/kmeans/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,8 @@ class KMeansEngine(KMeansCythonEngine):

# This class attribute can alter globally the attributes `device` and `order` of
# future instances. It is only used for testing purposes, using
# `sklearn_numba_dpex.testing.config.override_attr_context` context, for instance
# in the benchmark script.
# `sklearn_pytorch_engine.testing.config.override_attr_context` context, for
# instance in the benchmark script.
# For normal usage, the compute will follow the *compute follows data* principle.
_CONFIG: Dict[str, Any] = dict()

Expand Down

0 comments on commit 6405e2e

Please sign in to comment.