GPU implementation of hamming distance (#541)

* take static methods out of tcrdist * made _tcrdist_mat a normal class method * parent method NumbaDistanceCalculator extracted * numba version of hamming distance implemented * hamming numba tests passed and reference test added * hamming numba distance calculator implemented and tested * n_jobs parameter handling done in NumbaDistanceCalculator superclass * documentation adapted * removed unnecessary import * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * hamming distance with numba parallelization implemented * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * imports fixed * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * implemented parallelization with n_jobs and n_blocks for hamming and tcrdist distance metrics * performance optimization for hamming and tcrdist * more documentation added * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * documentation adapted * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * documentation adapted * signature of _calc_dist_mat_block changed * the alphabet for the hamming distance is now the unique characters occuring in all sequences * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * normalized hamming distance added * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * renaming test * histogram creation for hamming distance added * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactored * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * hamming histogram adjustments * reference test cases added for normalized hamming and hamming histogram * Update src/scirpy/ir_dist/metrics.py Co-authored-by: Gregor Sturm <[email protected]> * test cases for normalized hamming and hamming histogram adapted * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * docstring for normalized hamming distance and tcrdist distance added * adapted default parameters and tests for n_jobs and n_blocks * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * test_sequence_dist_all_metrics adaptions * n_jobs default value set to -1 * docstring of ir_dist for n_jobs adapted * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * docstring change to test cicd pipeline * docstring for n_jobs of _ir_dist changed * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * docstring for n_jobs of _ir_dist changed * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * moved histogram creation to parent class of hamming distance calculator * histogram computation adaptions * test case test_tcrdist_histogram_not_implemented added * documentation for histogram adapted * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reformatted doc string * handling of symmetric matrices with respect to histogram variable changed * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * retrieval of usable cpus for numba adapted * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * more documentation for histogram and (hamming) normalize added * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added GPUHammingDistanceCalculator * added test case for gpu hamming distance * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * documentation for GPUHammingDistanceCalculator adapted * adapted documentation of _tcrdist_mat * cuda numba experiments * cupy experiments * cupy experiments * scaled cupy to 1 million cells * sorted sequences by length * textures used for seqs_mat1 and seqs_mat2 * texture mit up to 100k cells * sorted seqs with multiple blocks * scaled textures to 1 million cells * use char for sequences * shared memory used * experiments, run 1 million cells with global memory * run 1 million cells with only global memory * refactoring and time measurements * optimized seqs2mat * increased result matrix stacking speed * changed data dtype to int8 * scaled to 1 million cells * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * sort indices of result csr matrix * refactoring * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove test from ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move cupy import to func * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add GPU test * rename * Rename .cirun.yaml to .cirun.yml * print library versions for debugging * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * data types updated * refactored * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update meta.yaml * Update meta.yaml * Update meta.yaml * removed time measurement prints and added progress bar * removed cuda synchronize statements used for performance testing * added parameters gpu_n_blocks and gpu_block_width * added parameter documentation * cleaned up hamming kernel * catch sequence conversion error by retrying with non ascii implementation * use parameters gpu_n_blocks and gpu_block_width for testing * adapted documentation about n_blocks * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adapted error handling for ascii UnicodeError * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed reference to :class:`~scirpy.ir_dist.metrics.GPUHammingDistanceCalculator` in documentation * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: update docs * Update changelog * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added an explanation for choosing parameters to GPU hamming class documentation. * logging.info used instead of print * Add tutorial for large datasets * Bump version * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added checks for int32 overflow * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adapted error text of int32 overflow check. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Gregor Sturm <[email protected]> Co-authored-by: Intron7 <[email protected]> Co-authored-by: Severin Dicks <[email protected]>
scverse · Feb 25, 2025 · 622be74 · 622be74
1 parent 4157ae4
commit 622be74
Show file tree

Hide file tree

Showing 12 changed files with 568 additions and 9 deletions.
diff --git a/.cirun.yml b/.cirun.yml
@@ -0,0 +1,11 @@
+runners:
+  - name: aws-gpu-runner
+    cloud: aws
+    instance_type: g4dn.xlarge
+    machine_image: ami-067a4ba2816407ee9
+    region: eu-north-1
+    preemptible:
+      - true
+      - false
+    labels:
+      - cirun-aws-gpu
diff --git a/.conda/meta.yaml b/.conda/meta.yaml
@@ -40,7 +40,7 @@ requirements:
     - numba >=0.41.0
     - pooch >=1.7.0
     - joblib >=1.3.1
-    - logomaker
+    - logomaker !=0.8.5
 
 test:
   source_files:
@@ -57,7 +57,7 @@ test:
   imports:
     - scirpy
   commands:
-    - pytest --pyargs scirpy -m "not extra"
+    - pytest --pyargs scirpy -m "not extra and not gpu"
 
 about:
   home: https://scirpy.scverse.org

diff --git a/.github/workflows/test-gpu.yaml b/.github/workflows/test-gpu.yaml
@@ -0,0 +1,65 @@
+name: GPU-CI
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+    types:
+      - labeled
+      - opened
+      - synchronize
+
+# Cancel the job if new commits are pushed
+# https://stackoverflow.com/questions/66335225/how-to-cancel-previous-runs-in-the-pr-when-you-push-new-commitsupdate-the-curre
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  check:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: flying-sheep/check@v1
+        with:
+          success: ${{ github.event_name == 'push' || contains(github.event.pull_request.labels.*.name, 'run-gpu-ci') }}
+  test:
+    name: GPU Tests
+    needs: check
+    runs-on: "cirun-aws-gpu--${{ github.run_id }}"
+    timeout-minutes: 30
+
+    defaults:
+      run:
+        shell: bash -el {0}
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
+      - name: Nvidia SMI sanity check
+        run: nvidia-smi
+
+      - name: Install Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+
+      - name: Install UV
+        uses: hynek/setup-cached-uv@v2
+        with:
+          cache-dependency-path: pyproject.toml
+
+      - name: Install scirpy
+        run: uv pip install --system -e ".[dev,test,rpack,dandelion,diversity,parasail,cupy]"
+      - name: Pip list
+        run: pip list
+
+      - name: Run test
+        run: pytest -m gpu
+
+      - name: Remove 'run-gpu-ci' Label
+        if: always()
+        uses: actions-ecosystem/action-remove-labels@v1
+        with:
+          labels: "run-gpu-ci"
+          github_token: ${{ secrets.GITHUB_TOKEN }}
diff --git a/.github/workflows/test.yaml b/.github/workflows/test.yaml
@@ -59,7 +59,7 @@ jobs:
           PLATFORM: ${{ matrix.os }}
           DISPLAY: :42
         run: |
-          coverage run -m pytest -v --color=yes
+          coverage run -m pytest -v --color=yes -m "not gpu"
       - name: Report coverage
         run: |
           coverage report

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,16 @@ and this project adheres to [Semantic Versioning][].
 [keep a changelog]: https://keepachangelog.com/en/1.0.0/
 [semantic versioning]: https://semver.org/spec/v2.0.0.html
 
+## v0.21.0
+
+### Additions
+
+- Add GPU implementation of Hamming distance ([#541](https://github.com/scverse/scirpy/pull/541))
+
+### Documentation
+
+- Add tutorial with tips for large datasets ([#541](https://github.com/scverse/scirpy/pull/541))
+
 ## v0.20.1
 
 ### Fixes

diff --git a/docs/api.rst b/docs/api.rst
@@ -301,6 +301,7 @@ distance metrics
    ir_dist.metrics.IdentityDistanceCalculator
    ir_dist.metrics.LevenshteinDistanceCalculator
    ir_dist.metrics.HammingDistanceCalculator
+   ir_dist.metrics.GPUHammingDistanceCalculator
    ir_dist.metrics.AlignmentDistanceCalculator
    ir_dist.metrics.FastAlignmentDistanceCalculator
    ir_dist.metrics.TCRdistDistanceCalculator
diff --git a/docs/tutorials.rst b/docs/tutorials.rst
@@ -8,3 +8,4 @@ Tutorials
    tutorials/tutorial_io.ipynb
    tutorials/tutorial_3k_tcr.ipynb
    tutorials/tutorial_5k_bcr.ipynb
+   tutorials/large-datasets.md
diff --git a/docs/tutorials/large-datasets.md b/docs/tutorials/large-datasets.md
@@ -0,0 +1,68 @@
+# Working with >1M cells
+
+Scirpy scales to millions of cells on a single workstation. This page is a work-in-progess collection with advice how to
+work with large datasets.
+
+:::{admonition} Use an up-to-date version!
+:class: tip
+
+Scalability has been a major focus of recent developments in Scirpy. Make sure you use the latest version
+when working with large datasets to take advantage of all speedups.
+:::
+
+## Distance metrics
+
+Computing pairwise sequence distances the major bottleneck for large datasets in the scirpy workflow.
+Here is some advice on how to maximize the speed of this step:
+
+## Choose an appropriate distance metric for `pp.ir_dist`
+
+Some distance metrics are significantly faster than others. Here are the distance metrics, roughly ordered by speed:
+
+`identity` > `gpu_hamming` > `hamming` = `normalized_hamming` > `tcrdist` > `levenshtein` > `fastalignment` > `alignment`
+
+TCRdist, fastalignment and alignment produce very similar distance matrices, but tcrdist is by far the fastest. For this
+reason, we'd always recommend to go with `tcrdist`, when looking for a metric taking into account a substitution matrix.
+
+## Multi-machine parallelization with dask
+
+The `hamming`, `normalized_hamming`, `tcrdist`, `levenshtein`, `fastalignment`, and `alignment` metrics are parallelized
+using [joblib](https://joblib.readthedocs.io/en/stable/). This makes it very easy to switch the backend to
+[dask](https://www.dask.org/) to distribute jobs across a multi machine cluster. Note that this comes with a
+considerable overhead for communication between the workers. It's only worthwhile when processing on a single
+machine becomes infeasible.
+
+```python
+from dask.distributed import Client, LocalCluster
+import joblib
+
+# substitute this with a multi-machine cluster...
+cluster = LocalCluster(n_workers=16)
+client = Client(cluster)
+
+with joblib.parallel_config(backend="dask", n_jobs=200, verbose=10):
+    ir.pp.ir_dist(
+        mdata,
+        metric="tcrdist",
+        n_jobs=1, # jobs per worker
+        n_blocks = 20, # number of blocks sent to dask
+    )
+```
+
+## Using GPU acceleration for hamming distance
+
+The Hamming distance metric supports GPU acceleration via [cupy](https://cupy.dev/).
+
+First, install the optional `cupy` dependency:
+
+```
+!pip install scirpy[cupy]
+```
+
+Then simply run
+
+```
+ir.pp.ir_dist(mdata, metric="gpu_hamming")
+```
+
+to take advantage of GPU acceleration.
diff --git a/pyproject.toml b/pyproject.toml
@@ -81,6 +81,9 @@ test = [
     'rectangle-packer',
     'parasail != 1.2.1',
 ]
+cupy = [
+    'cupy-cuda12x',
+]
 dandelion = [
     'sc-dandelion>=0.3.5',
 ]
@@ -112,7 +115,8 @@ xfail_strict = true
 # ]
 markers = [
     "conda: marks a subset of tests to be ran on the Bioconda CI.",
-    "extra: marks tests that require extra dependencies."
+    "extra: marks tests that require extra dependencies.",
+    "gpu: mark test to run on GPU",
 ]
 minversion = 6.0
 norecursedirs = [ '.*', 'build', 'dist', '*.egg', 'data', '__pycache__']

diff --git a/src/scirpy/ir_dist/__init__.py b/src/scirpy/ir_dist/__init__.py
@@ -35,7 +35,16 @@ def IrNeighbors(*args, **kwargs):
 
 
 MetricType = (
-    Literal["alignment", "fastalignment", "identity", "levenshtein", "hamming", "normalized_hamming", "tcrdist"]
+    Literal[
+        "alignment",
+        "fastalignment",
+        "identity",
+        "levenshtein",
+        "hamming",
+        "gpu_haming",
+        "normalized_hamming",
+        "tcrdist",
+    ]
     | metrics.DistanceCalculator
 )
 
@@ -51,6 +60,8 @@ def IrNeighbors(*args, **kwargs):
         See :class:`~scirpy.ir_dist.metrics.TCRdistDistanceCalculator`.
       * `hamming` -- Hamming distance for CDR3 sequences of equal length.
         See :class:`~scirpy.ir_dist.metrics.HammingDistanceCalculator`.
+      * `gpu_hamming` -- Hamming distance for CDR3 sequences of equal length calculated with a GPU.
+        See :class:`~scirpy.ir_dist.metrics.GPUHammingDistanceCalculator`.
       * `normalized_hamming` -- Normalized Hamming distance (in percent) for CDR3 sequences of equal length.
         See :class:`~scirpy.ir_dist.metrics.HammingDistanceCalculator`.
       * `alignment` -- Distance based on pairwise sequence alignments using the
@@ -105,6 +116,8 @@ def _get_distance_calculator(metric: MetricType, cutoff: int | None, *, n_jobs=-
         dist_calc = metrics.HammingDistanceCalculator(n_jobs=n_jobs, **kwargs)
     elif metric == "normalized_hamming":
         dist_calc = metrics.HammingDistanceCalculator(n_jobs=n_jobs, normalize=True, **kwargs)
+    elif metric == "gpu_hamming":
+        dist_calc = metrics.GPUHammingDistanceCalculator(**kwargs)
     elif metric == "tcrdist":
         dist_calc = metrics.TCRdistDistanceCalculator(n_jobs=n_jobs, **kwargs)
     else:
@@ -184,6 +197,8 @@ def _ir_dist(
         Like `chain_idx_key`, but for `reference`.
     **kwargs
         Arguments are passed to the respective :class:`~scirpy.ir_dist.metrics.DistanceCalculator` class.
+        Check out the distance calculator for the respective metric to see parameters specific to
+        individual distance calculators that can be passed via kwargs.
 
     Returns
     -------