Skip to content

Commit

Permalink
GPU implementation of hamming distance (#541)
Browse files Browse the repository at this point in the history
* take static methods out of tcrdist

* made _tcrdist_mat a normal class method

* parent method NumbaDistanceCalculator extracted

* numba version of hamming distance implemented

* hamming numba tests passed and reference test added

* hamming numba distance calculator implemented and tested

* n_jobs parameter handling done in NumbaDistanceCalculator superclass

* documentation adapted

* removed unnecessary import

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* hamming distance with numba parallelization implemented

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* imports fixed

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* implemented parallelization with n_jobs and n_blocks for hamming and tcrdist distance metrics

* performance optimization for hamming and tcrdist

* more documentation added

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* documentation adapted

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* documentation adapted

* signature of _calc_dist_mat_block changed

* the alphabet for the hamming distance is now the unique characters occuring in all sequences

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* normalized hamming distance added

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* renaming test

* histogram creation for hamming distance added

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactored

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* hamming histogram adjustments

* reference test cases added for normalized hamming and hamming histogram

* Update src/scirpy/ir_dist/metrics.py

Co-authored-by: Gregor Sturm <[email protected]>

* test cases for normalized hamming and hamming histogram adapted

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* docstring for normalized hamming distance and tcrdist distance added

* adapted default parameters and tests for n_jobs and n_blocks

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* test_sequence_dist_all_metrics adaptions

* n_jobs default value set to -1

* docstring of ir_dist for n_jobs adapted

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* docstring change to test cicd pipeline

* docstring for n_jobs of _ir_dist changed

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* docstring for n_jobs of _ir_dist changed

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* moved histogram creation to parent class of hamming distance calculator

* histogram computation adaptions

* test case test_tcrdist_histogram_not_implemented added

* documentation for histogram adapted

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* reformatted doc string

* handling of symmetric matrices with respect to histogram variable changed

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* retrieval of usable cpus for numba adapted

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* more documentation for histogram and (hamming) normalize added

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* added GPUHammingDistanceCalculator

* added test case for gpu hamming distance

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* documentation for GPUHammingDistanceCalculator adapted

* adapted documentation of _tcrdist_mat

* cuda numba experiments

* cupy experiments

* cupy experiments

* scaled cupy to 1 million cells

* sorted sequences by length

* textures used for seqs_mat1 and seqs_mat2

* texture mit up to 100k cells

* sorted seqs with multiple blocks

* scaled textures to 1 million cells

* use char for sequences

* shared memory used

* experiments, run 1 million cells with global memory

* run 1 million cells with only global memory

* refactoring and time measurements

* optimized seqs2mat

* increased result matrix stacking speed

* changed data dtype to int8

* scaled to 1 million cells

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* sort indices of result csr matrix

* refactoring

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove test from ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* move cupy import to func

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add GPU test

* rename

* Rename .cirun.yaml to .cirun.yml

* print library versions for debugging

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* data types updated

* refactored

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update meta.yaml

* Update meta.yaml

* Update meta.yaml

* removed time measurement prints and added progress bar

* removed cuda synchronize statements used for performance testing

* added parameters gpu_n_blocks and gpu_block_width

* added parameter documentation

* cleaned up hamming kernel

* catch sequence conversion error by retrying with non ascii implementation

* use parameters gpu_n_blocks and gpu_block_width for testing

* adapted documentation about n_blocks

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* adapted error handling for ascii UnicodeError

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* removed reference to :class:`~scirpy.ir_dist.metrics.GPUHammingDistanceCalculator` in documentation

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* WIP: update docs

* Update changelog

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* added an explanation for choosing parameters to GPU hamming class documentation.

* logging.info used instead of print

* Add tutorial for large datasets

* Bump version

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added checks for int32 overflow

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* adapted error text of int32 overflow check.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Gregor Sturm <[email protected]>
Co-authored-by: Intron7 <[email protected]>
Co-authored-by: Severin Dicks <[email protected]>
  • Loading branch information
5 people authored Feb 25, 2025
1 parent 4157ae4 commit 622be74
Show file tree
Hide file tree
Showing 12 changed files with 568 additions and 9 deletions.
11 changes: 11 additions & 0 deletions .cirun.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
runners:
- name: aws-gpu-runner
cloud: aws
instance_type: g4dn.xlarge
machine_image: ami-067a4ba2816407ee9
region: eu-north-1
preemptible:
- true
- false
labels:
- cirun-aws-gpu
4 changes: 2 additions & 2 deletions .conda/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ requirements:
- numba >=0.41.0
- pooch >=1.7.0
- joblib >=1.3.1
- logomaker
- logomaker !=0.8.5

test:
source_files:
Expand All @@ -57,7 +57,7 @@ test:
imports:
- scirpy
commands:
- pytest --pyargs scirpy -m "not extra"
- pytest --pyargs scirpy -m "not extra and not gpu"

about:
home: https://scirpy.scverse.org
Expand Down
65 changes: 65 additions & 0 deletions .github/workflows/test-gpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
name: GPU-CI

on:
push:
branches: [main]
pull_request:
types:
- labeled
- opened
- synchronize

# Cancel the job if new commits are pushed
# https://stackoverflow.com/questions/66335225/how-to-cancel-previous-runs-in-the-pr-when-you-push-new-commitsupdate-the-curre
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: flying-sheep/check@v1
with:
success: ${{ github.event_name == 'push' || contains(github.event.pull_request.labels.*.name, 'run-gpu-ci') }}
test:
name: GPU Tests
needs: check
runs-on: "cirun-aws-gpu--${{ github.run_id }}"
timeout-minutes: 30

defaults:
run:
shell: bash -el {0}
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Nvidia SMI sanity check
run: nvidia-smi

- name: Install Python
uses: actions/setup-python@v5
with:
python-version: "3.11"

- name: Install UV
uses: hynek/setup-cached-uv@v2
with:
cache-dependency-path: pyproject.toml

- name: Install scirpy
run: uv pip install --system -e ".[dev,test,rpack,dandelion,diversity,parasail,cupy]"
- name: Pip list
run: pip list

- name: Run test
run: pytest -m gpu

- name: Remove 'run-gpu-ci' Label
if: always()
uses: actions-ecosystem/action-remove-labels@v1
with:
labels: "run-gpu-ci"
github_token: ${{ secrets.GITHUB_TOKEN }}
2 changes: 1 addition & 1 deletion .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ jobs:
PLATFORM: ${{ matrix.os }}
DISPLAY: :42
run: |
coverage run -m pytest -v --color=yes
coverage run -m pytest -v --color=yes -m "not gpu"
- name: Report coverage
run: |
coverage report
Expand Down
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,16 @@ and this project adheres to [Semantic Versioning][].
[keep a changelog]: https://keepachangelog.com/en/1.0.0/
[semantic versioning]: https://semver.org/spec/v2.0.0.html

## v0.21.0

### Additions

- Add GPU implementation of Hamming distance ([#541](https://github.com/scverse/scirpy/pull/541))

### Documentation

- Add tutorial with tips for large datasets ([#541](https://github.com/scverse/scirpy/pull/541))

## v0.20.1

### Fixes
Expand Down
1 change: 1 addition & 0 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,7 @@ distance metrics
ir_dist.metrics.IdentityDistanceCalculator
ir_dist.metrics.LevenshteinDistanceCalculator
ir_dist.metrics.HammingDistanceCalculator
ir_dist.metrics.GPUHammingDistanceCalculator
ir_dist.metrics.AlignmentDistanceCalculator
ir_dist.metrics.FastAlignmentDistanceCalculator
ir_dist.metrics.TCRdistDistanceCalculator
1 change: 1 addition & 0 deletions docs/tutorials.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ Tutorials
tutorials/tutorial_io.ipynb
tutorials/tutorial_3k_tcr.ipynb
tutorials/tutorial_5k_bcr.ipynb
tutorials/large-datasets.md
68 changes: 68 additions & 0 deletions docs/tutorials/large-datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Working with >1M cells

Scirpy scales to millions of cells on a single workstation. This page is a work-in-progess collection with advice how to
work with large datasets.

:::{admonition} Use an up-to-date version!
:class: tip

Scalability has been a major focus of recent developments in Scirpy. Make sure you use the latest version
when working with large datasets to take advantage of all speedups.
:::

## Distance metrics

Computing pairwise sequence distances the major bottleneck for large datasets in the scirpy workflow.
Here is some advice on how to maximize the speed of this step:

## Choose an appropriate distance metric for `pp.ir_dist`

Some distance metrics are significantly faster than others. Here are the distance metrics, roughly ordered by speed:

`identity` > `gpu_hamming` > `hamming` = `normalized_hamming` > `tcrdist` > `levenshtein` > `fastalignment` > `alignment`

TCRdist, fastalignment and alignment produce very similar distance matrices, but tcrdist is by far the fastest. For this
reason, we'd always recommend to go with `tcrdist`, when looking for a metric taking into account a substitution matrix.

## Multi-machine parallelization with dask

The `hamming`, `normalized_hamming`, `tcrdist`, `levenshtein`, `fastalignment`, and `alignment` metrics are parallelized
using [joblib](https://joblib.readthedocs.io/en/stable/). This makes it very easy to switch the backend to
[dask](https://www.dask.org/) to distribute jobs across a multi machine cluster. Note that this comes with a
considerable overhead for communication between the workers. It's only worthwhile when processing on a single
machine becomes infeasible.

```python
from dask.distributed import Client, LocalCluster
import joblib

# substitute this with a multi-machine cluster...
cluster = LocalCluster(n_workers=16)
client = Client(cluster)

with joblib.parallel_config(backend="dask", n_jobs=200, verbose=10):
ir.pp.ir_dist(
mdata,
metric="tcrdist",
n_jobs=1, # jobs per worker
n_blocks = 20, # number of blocks sent to dask
)
```

## Using GPU acceleration for hamming distance

The Hamming distance metric supports GPU acceleration via [cupy](https://cupy.dev/).

First, install the optional `cupy` dependency:

```
!pip install scirpy[cupy]
```

Then simply run

```
ir.pp.ir_dist(mdata, metric="gpu_hamming")
```

to take advantage of GPU acceleration.
6 changes: 5 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,9 @@ test = [
'rectangle-packer',
'parasail != 1.2.1',
]
cupy = [
'cupy-cuda12x',
]
dandelion = [
'sc-dandelion>=0.3.5',
]
Expand Down Expand Up @@ -112,7 +115,8 @@ xfail_strict = true
# ]
markers = [
"conda: marks a subset of tests to be ran on the Bioconda CI.",
"extra: marks tests that require extra dependencies."
"extra: marks tests that require extra dependencies.",
"gpu: mark test to run on GPU",
]
minversion = 6.0
norecursedirs = [ '.*', 'build', 'dist', '*.egg', 'data', '__pycache__']
Expand Down
17 changes: 16 additions & 1 deletion src/scirpy/ir_dist/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,16 @@ def IrNeighbors(*args, **kwargs):


MetricType = (
Literal["alignment", "fastalignment", "identity", "levenshtein", "hamming", "normalized_hamming", "tcrdist"]
Literal[
"alignment",
"fastalignment",
"identity",
"levenshtein",
"hamming",
"gpu_haming",
"normalized_hamming",
"tcrdist",
]
| metrics.DistanceCalculator
)

Expand All @@ -51,6 +60,8 @@ def IrNeighbors(*args, **kwargs):
See :class:`~scirpy.ir_dist.metrics.TCRdistDistanceCalculator`.
* `hamming` -- Hamming distance for CDR3 sequences of equal length.
See :class:`~scirpy.ir_dist.metrics.HammingDistanceCalculator`.
* `gpu_hamming` -- Hamming distance for CDR3 sequences of equal length calculated with a GPU.
See :class:`~scirpy.ir_dist.metrics.GPUHammingDistanceCalculator`.
* `normalized_hamming` -- Normalized Hamming distance (in percent) for CDR3 sequences of equal length.
See :class:`~scirpy.ir_dist.metrics.HammingDistanceCalculator`.
* `alignment` -- Distance based on pairwise sequence alignments using the
Expand Down Expand Up @@ -105,6 +116,8 @@ def _get_distance_calculator(metric: MetricType, cutoff: int | None, *, n_jobs=-
dist_calc = metrics.HammingDistanceCalculator(n_jobs=n_jobs, **kwargs)
elif metric == "normalized_hamming":
dist_calc = metrics.HammingDistanceCalculator(n_jobs=n_jobs, normalize=True, **kwargs)
elif metric == "gpu_hamming":
dist_calc = metrics.GPUHammingDistanceCalculator(**kwargs)
elif metric == "tcrdist":
dist_calc = metrics.TCRdistDistanceCalculator(n_jobs=n_jobs, **kwargs)
else:
Expand Down Expand Up @@ -184,6 +197,8 @@ def _ir_dist(
Like `chain_idx_key`, but for `reference`.
**kwargs
Arguments are passed to the respective :class:`~scirpy.ir_dist.metrics.DistanceCalculator` class.
Check out the distance calculator for the respective metric to see parameters specific to
individual distance calculators that can be passed via kwargs.
Returns
-------
Expand Down
Loading

0 comments on commit 622be74

Please sign in to comment.