Add README

soda-inria · Sep 1, 2023 · 6405e2e · 6405e2e
1 parent 869c486
commit 6405e2e
Show file tree

Hide file tree

Showing 3 changed files with 167 additions and 5 deletions.
diff --git a/.github/workflows/run_tests.yaml b/.github/workflows/run_tests.yaml
@@ -31,10 +31,10 @@ jobs:
         pip install cython numpy scipy joblib threadpoolctl &&
         # Build and install
         pip install torch --index-url https://download.pytorch.org/whl/cpu &&
-        pip install pytest git+https://github.com/fcharras/scikit-learn.git@80f58bf10d2f8b8cb43f6253bbe13413985a1413#egg=scikit-learn -e .
+        pip install pytest git+https://github.com/fcharras/scikit-learn.git@2ccfc8c4bdf66db005d7681757b4145842944fb9#egg=scikit-learn -e .
 
 
-    - name: Run sklearn_numba_dpex tests
+    - name: Run sklearn_pytorch_engine tests
       run: pytest -v sklearn_pytorch_engine/
 
       # TODO: run those tests in a separate pipeline
@@ -47,5 +47,5 @@ jobs:
       # that fitted attributes and outputs are numpy arrays. Currently this behavior is
       # activated when the environment variable SKLEARN_PYTORCH_ENGINE_TESTING_MODE is set
       # to 1.
-    - name: Run sklearn test suites with sklearn_numba_dpex engines
+    - name: Run sklearn test suites with sklearn_pytorch_engine
       run: SKLEARN_RUN_FLOAT32_TESTS=1 SKLEARN_PYTORCH_ENGINE_TESTING_MODE=1 pytest -v --sklearn-engine-provider sklearn_pytorch_engine --pyargs sklearn.cluster.tests.test_k_means
diff --git a/README.md b/README.md
@@ -0,0 +1,162 @@
+# sklearn-pytorch0engine
+
+Experimental plugin for scikit-learn that implements a backend for (some) scikit-learn
+estimators, written in `pytorch`, so that it benefits from `pytorch` ability to
+dispatch data and compute to many devices, providing the appropriate pytorch extensions
+are installed.
+
+This package requires working with the following experimental branch of scikit-learn:
+
+- `feature/engine-api` branch on https://github.com/scikit-learn/scikit-learn
+
+## List of Included Engines
+
+- `sklearn.cluster.KMeans` for the standard LLoyd's algorithm on dense data arrays,
+  including `kmeans++` support.
+
+## Getting started:
+
+### Pre-requisites
+
+#### Step 1: Install PyTorch
+
+Getting started requires a working python environment for using `pytorch`. Depending on
+the device you target, install PyTorch extensions accordingly, including (but not
+limited to):
+
+- using one of the [native distributions](https://pytorch.org/get-started/locally/) for
+cuda (for nvidia gpus), rocm (amd gpus) or mps (apple gpus) support
+
+- using [Intel distributions](https://intel.github.io/intel-extension-for-pytorch/xpu/2.0.110+xpu/tutorials/installations/linux.html#install-via-prebuilt-wheel-files)
+for xpu (for intel gpus), has experimental (unofficial) support for igpus if compiling
+from source
+[with appropriate flags](https://intel.github.io/intel-extension-for-pytorch/xpu/2.0.110+xpu/tutorials/installations/linux.html#configure-the-aot-optional)
+
+#### Step 2: Install scikit-learn from source
+
+Using the plugin requires the experimental development branch `feature/engine-api` of
+scikit-learn that implements the compatible plugin system. The `sklearn_pytorch_engine`
+plugin is compatible with the commit 2ccfc8c4bdf66db005d7681757b4145842944fb9 available
+in the fork [fcharras/scikit-learn](https://github.com/fcharras/scikit-learn/) .
+
+Please refer to the relevant [scikit-learn documentation page](https://scikit-learn.org/stable/developers/advanced_installation.html#install-bleeding-edge)
+for a comprehensive guide regarding installing from source. For instance, using `pip`
+and `apt` (assuming `apt`-based environment):
+
+```
+apt-get update --quiet
+# Install prerequisites
+apt-get install -y build-essential python3-dev git
+pip install cython numpy scipy joblib threadpoolctl
+# Build and install
+pip install git+https://github.com/fcharras/scikit-learn.git@2ccfc8c4bdf66db005d7681757b4145842944fb9#egg=scikit-learn
+```
+
+#### Step 3: Install this plugin
+
+When loaded into your PyTorch + scikit-learn environment, run:
+
+```
+git clone https://github.com/soda-inria/sklearn-pytorch-engine
+cd sklearn-pytorch-engine
+pip install -e .
+```
+
+## Using the plugin
+
+See the `sklearn_pytorch_engine/kmeans/tests` folder for example usage.
+
+🚧 TODO: write some examples here instead.
+
+### Running the tests
+
+To run the tests run the following from the root of the `sklearn_pytorch_engine`
+repository:
+
+```bash
+pytest sklearn_pytorch_engine
+```
+
+To run the `scikit-learn` tests with the `sklearn_pytorch_engine` engine you can run the
+following:
+
+```bash
+SKLEARN_PYTORCH_ENGINE_TESTING_MODE=1 pytest --sklearn-engine-provider sklearn_pytorch_engine --pyargs sklearn.cluster.tests.test_k_means
+```
+
+(change the `--pyargs` option accordingly to select other test suites).
+
+The `--sklearn-engine-provider sklearn_pytorch_engine` option offered by the sklearn
+pytest plugin will automatically activate the `sklearn_pytorch_engine` engine for all
+tests.
+
+Tests covering unsupported features (that trigger
+`sklearn.exceptions.FeatureNotCoveredByPluginError`) will be automatically marked as
+_xfailed_.
+
+### Additional environment variables for device selection behavior
+
+By default, the engine will use the _compute follow data_ principle, meaning that it
+will run the compute on the device that manages the data. For instance `kmeans.fit(X)`
+will run compute on corresponding xpu device if `X` is a `torch.tensor` array such that
+`X.device.type` is `"xpu"`, and will run on cpu if `X.device.type` is `"cpu"`, etc.
+
+It's  possible to alter this behavior and have the engine force offload the compute to
+a specific device, using the environment variable
+`SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE`. For instance, on a compatible computer,
+`SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE=mps` will force the compute to the
+`mps`-compatible device, even if it requires copying the input data under the hood to
+do so.
+
+Both internal and scikit-learn test suites can run with any value of
+`SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE` as long as the compatible pytorch extension
+is available and that the host hardware is compatible, for instance:
+
+```bash
+export SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE=xpu
+pytest sklearn_pytorch_engine
+SKLEARN_PYTORCH_ENGINE_TESTING_MODE=1 pytest --sklearn-engine-provider sklearn_pytorch_engine --pyargs sklearn.cluster.tests.test_k_means
+```
+
+will run all compute on the relevant `xpu` device.
+
+At the moment, both tests suite will create test data that is hosted on the CPU by
+default. For internal tests, this behavior can be changed with the environment variable
+`SKLEARN_PYTORCH_ENGINE_TEST_INPUTS_DEVICE`, for instance the command
+
+```bash
+SKLEARN_PYTORCH_ENGINE_TEST_INPUTS_DEVICE=cuda SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE=cpu pytest sklearn_pytorch_engine
+```
+
+will run the tests while enforcing that the test data is generated on the cuda device
+but the compute is done on cpu (since `SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE` is set
+to `cpu`).
+
+All combinations of those two environment variables makes for a reasonably exhaustive
+test matrix regarding internal data conversions.
+
+### Notes about the preferred floating point precision (float32)
+
+In many machine learning applications, operations using single-precision (float32)
+floating point data require twice as less memory that double-precision (float64), are
+regarded as faster, accurate enough and more suitable for GPU compute. Besides, most
+GPUs used in machine learning projects are significantly faster with float32 than with
+double-precision (float64) floating point data.
+
+To leverage the full potential of GPU execution, it's strongly advised to use a float32
+data type.
+
+By default, unless specified otherwise numpy array are created with type float64, so be
+especially careful to the type whenever the loader does not explicitly document the
+type nor expose a type option.
+
+Transforming NumPy arrays from float64 to float32 is also possible using
+[`numpy.ndarray.astype`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html),
+although it is less recommended to prevent avoidable data copies. `numpy.ndarray.astype`
+can be used as follows:
+
+```python
+X = my_data_loader()
+X_float32 = X.astype(float32)
+my_gpu_compute(X_float32)
+```
diff --git a/sklearn_pytorch_engine/kmeans/engine.py b/sklearn_pytorch_engine/kmeans/engine.py
@@ -62,8 +62,8 @@ class KMeansEngine(KMeansCythonEngine):
 
     # This class attribute can alter globally the attributes `device` and `order` of
     # future instances. It is only used for testing purposes, using
-    # `sklearn_numba_dpex.testing.config.override_attr_context` context, for instance
-    # in the benchmark script.
+    # `sklearn_pytorch_engine.testing.config.override_attr_context` context, for
+    # instance in the benchmark script.
     # For normal usage, the compute will follow the *compute follows data* principle.
     _CONFIG: Dict[str, Any] = dict()