Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare for first stable release #24

Merged
merged 34 commits into from
Jun 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
185af45
Lint with black
yutanagano Jun 7, 2024
799fb33
Add license
yutanagano Jun 7, 2024
3de04e0
Add some badges
yutanagano Jun 7, 2024
0e90e13
Remove unnecessary badges
yutanagano Jun 7, 2024
0b03411
Further polish the documentation
yutanagano Jun 8, 2024
045f41c
Merge pull request #21 from yutanagano/improve_docs
yutanagano Jun 8, 2024
9273f16
Move functional API into root module
yutanagano Jun 8, 2024
7e2667d
Defer default model loading in the root module until necessary
yutanagano Jun 8, 2024
e61f15f
Rename VERSION attribute to __version__
yutanagano Jun 8, 2024
59a50e6
Update documentation to reflect refactor
yutanagano Jun 8, 2024
8037ab4
Merge pull request #22 from yutanagano/move_functional_api_to_top
yutanagano Jun 8, 2024
3c07879
Move Sceptr model class documentation to its own page
yutanagano Jun 8, 2024
b0fbd8d
Make separate page for model submodule
yutanagano Jun 8, 2024
5250991
Further polish documentation
yutanagano Jun 8, 2024
e946f2d
Flesh out usage page of docs
yutanagano Jun 9, 2024
8c7a5de
Update README
yutanagano Jun 9, 2024
4816fa7
Center badges and documentation link
yutanagano Jun 9, 2024
ce6fde7
Improve readme heading
yutanagano Jun 9, 2024
8a53491
Add pypi version badge to README
yutanagano Jun 9, 2024
7e27a7a
Add $ to beginning of pip install line on README
yutanagano Jun 9, 2024
abedb48
Remove back $ from pip install line to make it directly copyable
yutanagano Jun 9, 2024
5da0065
Pre-emptively link to documentation in README
yutanagano Jun 9, 2024
a644959
Add links to repository in sphinx documentation
yutanagano Jun 9, 2024
b445303
Add schematic figure from preprint onto home page
yutanagano Jun 9, 2024
d6d3455
Fix typos in documentation
yutanagano Jun 9, 2024
49a4f7a
Merge pull request #23 from yutanagano/add_publication_figures
yutanagano Jun 9, 2024
f5de61d
Readme changes
andim Jun 10, 2024
7f4aa4c
Update README.md
yutanagano Jun 10, 2024
93854f5
Add readthedocs configuration
yutanagano Jun 10, 2024
1582dae
Simplify rtd yaml
yutanagano Jun 10, 2024
46b91da
Switch to optional dependencies for rtd import
yutanagano Jun 10, 2024
a990d6f
Add docs badge to README
yutanagano Jun 10, 2024
7c44f85
Bump version to 1.0.0
yutanagano Jun 10, 2024
663db23
Bump dev status to stable on pyproject toml
yutanagano Jun 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
version: 2

build:
os: ubuntu-22.04
tools:
python: "3.12"

sphinx:
configuration: docs/conf.py

python:
install:
- method: pip
path: .
extra_requirements:
- docs
21 changes: 21 additions & 0 deletions LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2024 Yuta Nagano

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
185 changes: 17 additions & 168 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,179 +1,28 @@
# SCEPTR

> [!NOTE]
> The latest version of SCEPTR no longer supports Python versions earlier than 3.9.

**S**imple **C**ontrastive **E**mbedding of the **P**rimary sequence of **T** cell **R**eceptors (**SCEPTR**) is a BERT-like attention model trained on T cell receptor (TCR) data.
It maps TCRs to vector representations, which can be used for downstream TCR and TCR repertoire analysis such as TCR clustering or classification.

## Installation

### From [PyPI](https://pypi.org/project/sceptr/) (Recommended)

```bash
pip install sceptr
```

### From Source

> [!IMPORTANT]
> To install `sceptr` from source, you must have [`git-lfs`](https://git-lfs.com/) installed and set up on your system.
> This is because you must be able to download the trained model weights directly from the Git LFS servers during your install.
<div align="center">

#### Using `pip`

From your Python environment, run the following replacing `<VERSION_TAG>` with the appropriate version specifier (e.g. `v1.0.0-alpha.1`).
The latest release tags can be found by checking the 'releases' section on the github repository page.

```bash
pip install git+https://github.com/yutanagano/sceptr.git@<VERSION_TAG>
```

#### Manual install
# SCEPTR

You can also clone the repository, and from within your Python environment, navigate to the project root directory and run:
[![Latest release](https://img.shields.io/pypi/v/sceptr)](https://pypi.org/p/sceptr)
![Tests](https://github.com/yutanagano/sceptr/actions/workflows/tests.yaml/badge.svg)
[![Documentation Status](https://readthedocs.org/projects/sceptr/badge/?version=latest)](https://sceptr.readthedocs.io)
[![License](https://img.shields.io/badge/license-MIT-blue)](https://github.com/yutanagano/tidytcells?tab=MIT-1-ov-file#readme)

```bash
pip install .
```
### Check out the [documentation page](https://sceptr.readthedocs.io).

Note that even for manual installation, you still need `git-lfs` to properly de-reference the stub files at `git-clone`-ing time.
</div>

#### Troubleshooting
**SCEPTR** (**S**imple **C**ontrastive **E**mbedding of the **P**rimary sequence of **T** cell **R**eceptors) is a small, fast, and accurate TCR representation model that can be used for alignment-free TCR analysis, including for TCR-pMHC interaction prediction and TCR clustering (metaclonotype discovery).
Our [manuscript (coming soon)](about:blank) demonstrates that SCEPTR can be used for few-shot TCR specificity prediction with improved accuracy over previous methods.

A recent security update to `git` has resulted in some difficulties cloning repositories that rely on `git-lfs`.
This can result in an error message with a message along the lines of:
SCEPTR is a BERT-like transformer-based neural network implemented in [Pytorch](https://pytorch.org).
With the default model providing best-in-class performance with only 153,108 parameters (typical protein language models have tens or hundreds of millions), SCEPTR runs fast- even on a CPU!
And if your computer does have a [CUDA-enabled GPU](https://en.wikipedia.org/wiki/CUDA), the sceptr package will automatically detect and use it, giving you blazingly fast performance without the hassle.

```
fatal: active `post-checkout` hook found during `git clone`
```
sceptr's API exposes three intuitive functions: `calc_vector_representations`, `calc_cdist_matrix`, and `calc_pdist_vector`- and it's all you need to make full use of the SCEPTR models.
What's even better is that they are fully compliant with [pyrepseq](https://pyrepseq.readthedocs.io)'s [tcr_metric](https://pyrepseq.readthedocs.io/en/latest/api.html#pyrepseq.metric.tcr_metric.TcrMetric) API, so sceptr will fit snugly into the rest of your repertoire analysis workflow.

If this happens, you can temporarily set the `GIT_CLONE_PROTECTION_ACTIVE` environment variable to `false` by prepending `GIT_CLONE_PROTECTION_ACTIVE=false` before the install command like below:
## Installation

```bash
GIT_CLONE_PROTECTION_ACTIVE=false pip install git+https://github.com/yutanagano/sceptr.git@<VERSION_TAG>
pip install sceptr
```

This is [a known issue](https://github.com/git-lfs/git-lfs/issues/5749) for `git` version `2.45.1` and [is fixed](https://lore.kernel.org/git/[email protected]/T/#u) from version `2.45.2`.

## Prescribed data format

> [!IMPORTANT]
> SCEPTR only recognises TCR V/J gene symbols that are IMGT-compliant, and also known to be functional (i.e. known pseudogenes or ORFs are not allowed).
> For easy standardisation of TCR gene nomenclature in your data, as well as filtering your data for functional V/J genes, check out [tidytcells](https://pypi.org/project/tidytcells/).

SCEPTR expects to receive TCR data in the form of [pandas](https://pandas.pydata.org/) [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame) instances.
Therefore, all TCR data should be represented as a `DataFrame` with the following structure and data types.
The column order is irrelevant.
Each row should represent one TCR.
Incomplete rows are allowed (e.g. only beta chain data available) as long as the SCEPTR variant that is being used has at least some partial information to go on.

| Column name | Column datatype | Column contents |
|---|---|---|
|TRAV|`str`|IMGT symbol for the alpha chain V gene|
|CDR3A|`str`|Amino acid sequence of the alpha chain CDR3, including the first C and last W/F residues, in all caps|
|TRAJ|`str`|IMGT symbol for the alpha chain J gene|
|TRBV|`str`|IMGT symbol for the beta chain V gene|
|CDR3B|`str`|Amino acid sequence of the beta chain CDR3, including the first C and last W/F residues, in all caps|
|TRBJ|`str`|IMGT symbol for the beta chain J gene|

## Usage

### Functional API (`sceptr.sceptr`)

The eponymous `sceptr` submodule is the easiest way to use SCEPTR.
It loads the default SCEPTR variant (currently `ab_sceptr`) and exposes its methods directly as module-level functions.

> [!TIP]
> To use the functional API, import the `sceptr` submodule like so:
> ```
> from sceptr import sceptr
> ```
> Attempting to access the submodule as an attribute of the top level module
> ```
> import sceptr
>
> sceptr.sceptr.calc_vector_representations() #...do something...
> ```
> will result in an error.

---

#### `sceptr.sceptr.calc_vector_representations(instances: DataFrame) -> ndarray`

Map a table of TCRs provided as a pandas `DataFrame` in the above format to a set of vector representations.

Parameters:

- tcrs (`DataFrame`): DataFrame in the presribed format.

Returns:

A 2D [numpy](https://numpy.org/) [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html) object where every row vector corresponds to a row in the original TCR `DataFrame`.
The returned array will have shape (N, D) where N is the number of TCRs in the input data and D is the dimensionality of the SCEPTR model.

---

#### `sceptr.sceptr.calc_cdist_matrix(anchors: DataFrame, comparisons: DataFrame) -> ndarray`

Generate a cdist matrix between two collections of TCRs.

Parameters:

- anchor_tcrs (`DataFrame`): DataFrame in the prescribed format, representing TCRs from collection A.
- comparison_tcrs (`DataFrame`): DataFrame in the prescribed format, representing TCRs from collection B.

Returns:

A 2D numpy `ndarray` representing a cdist matrix between TCRs from collection A and B.
The returned array will have shape (X, Y) where X is the number of TCRs in collection A and Y is the number of TCRs in collection B.

---

#### `sceptr.sceptr.calc_pdist_vector(instances: DataFrame) -> ndarray`

Generate a pdist set of distances between each pair of TCRs in the input data.

Parameters:

- tcrs (`DataFrame`): DataFrame in the prescribed format.

Returns

A 2D numpy `ndarray` representing a pdist vector of distances between each pair of TCRs in the input data.
The returned array will have shape (1/2 * N * (N-1),), where N is the number of TCRs in the input data.

---

### Loading specific SCEPTR variants (`sceptr.variant`)

For more curious users, model variants are available to load and use through the `sceptr.variant` submodule.

The module exposes functions, each named after a particular model variant, which when called, will return a `Sceptr` object corresponding to the selected model variant.
This `Sceptr` object will then have the methods: `calc_pdist_vector`, `calc_cdist_matrix`, and `calc_vector_representations` available to use, with function signatures exactly as defined above for the functional API in the `sceptr.sceptr` submodule.

#### Paired-chain variants

|Name|Description|
|---|---|
|`sceptr.variant.default`|default model used by the functional API|
|`sceptr.variant.mlm_only`|default model trained without autocontrastive learning|
|`sceptr.variant.left_aligned`|similar to default model but with learnable token embeddings and a sinusoidal position information embedding method more similar to the original NLP BERT/transformer models|
|`sceptr.variant.cdr3_only`|only uses the CDR3 loops as input|
|`sceptr.variant.cdr3_only_mlm_only`|only uses CDR3 loops as input, and did not receive autocontrastive learning|
|`sceptr.variant.large`|larger variant with model dimensionality 128|
|`sceptr.variant.small`|smaller variant with model dimensionality 32|
|`sceptr.variant.tiny`|smaller variant with model dimensionality 16|
|`sceptr.variant.blosum`|variant using BLOSUM62 embeddings instead of one-hot|
|`sceptr.variant.average_pooling`|variant using the average-pooling method to generate the TCR representation vector|
|`sceptr.variant.shuffled_data`|variant trained on the Tanno et al. dataset with randomised alpha/beta pairing|
|`sceptr.variant.synthetic_data`|variant trained using synthetic TCR sequences generated by OLGA|
|`sceptr.variant.dropout_noise_only`|variant trained without residue/chain dropping during autocontrastive learning|
|`sceptr.variant.finetuned`|variant fine-tuned using supervised contrastive learning for six pMHCs with peptides GILGFVFTL, NLVPMVATV, SPRWYFYYL, TFEYVSQPFLMDLE, TTDPSFLGRY and YLQPRTFLL (from [VDJdb](https://vdjdb.cdr3.net/))|

#### Single-chain variants

|Name|Description
|---|---|
|`sceptr.variant.a_sceptr`|alpha-chain only variant|
|`sceptr.variant.b_sceptr`|beta-chain only variant|
Binary file added docs/about_sceptr.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 3 additions & 2 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@ API reference
=============

.. toctree::
:maxdepth: 2
:maxdepth: 1

sceptr_sceptr
sceptr
sceptr_variant
sceptr_model
29 changes: 15 additions & 14 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,27 +8,28 @@
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

project = 'sceptr'
copyright = '2024, Yuta Nagano'
author = 'Yuta Nagano'
version = sceptr.VERSION
project = "sceptr"
copyright = "2024, Yuta Nagano"
author = "Yuta Nagano"
release = sceptr.__version__

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.autosummary",
"sphinx.ext.napoleon"
]

templates_path = ['_templates']
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
extensions = ["sphinx.ext.autodoc", "sphinx.ext.autosummary", "sphinx.ext.napoleon"]

templates_path = ["_templates"]
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]


# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

html_theme = 'sphinx_book_theme'
html_static_path = ['_static']
html_theme = "sphinx_book_theme"
html_theme_options = {
"repository_url": "https://github.com/yutanagano/sceptr",
"path_to_docs": "docs",
"use_repository_button": True,
"use_issues_button": True,
}
html_static_path = ["_static"]
27 changes: 19 additions & 8 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,24 @@
.. sceptr documentation master file, created by
sphinx-quickstart on Fri Jun 7 10:32:22 2024.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
SCEPTR
======

SCEPTR: A fast and performant TCR representation model
======================================================
**SCEPTR** (\ **S**\ imple **C**\ ontrastive **E**\ mbedding of the **P**\ rimary sequence of **T** cell **R**\ eceptors) is a small, fast, and performant TCR representation model that can be used for alignment-free downstream TCR and TCR repertoire analysis such as TCR clustering or classification.
Our `manuscript (coming soon) <about:blank>`_ demonstrates SCEPTR's state-of-the-art performance (as of 2024) on downstream TCR specificity prediction.

SCEPTR (Simple Contrastive Embedding of the Primary sequence of T cell Receptors) is a BERT-like attention model trained on T cell receptor (TCR) data.
It maps TCRs to vector representations, which enables alignment-free downstream TCR and TCR repertoire analysis such as TCR clustering or classification.
SCEPTR is a BERT-like transformer-based neural network implemented in `Pytorch <https://pytorch.org>`_.
With the default model providing best-in-class performance with only 153,108 parameters (typical protein language models have tens or hundreds of millions), SCEPTR runs fast- even on a CPU!
And if your computer does have a `CUDA-enabled GPU <https://en.wikipedia.org/wiki/CUDA>`_, the sceptr package will automatically detect and use it, giving you blazingly fast performance without the hassle.

sceptr's :ref:`API <api>` exposes three intuitive functions: :py:func:`~sceptr.calc_vector_representations`, :py:func:`~sceptr.calc_cdist_matrix`, and :py:func:`~sceptr.calc_pdist_vector`-- and it's all you need to make full use of the SCEPTR models.
What's even better is that they are fully compliant with `pyrepseq <https://pyrepseq.readthedocs.io>`_'s `tcr_metric <https://pyrepseq.readthedocs.io/en/latest/api.html#pyrepseq.metric.tcr_metric.TcrMetric>`_ API, so sceptr will fit snugly into the rest of your repertoire analysis toolkit.

.. figure:: about_sceptr.png
:width: 700px
:alt: Schematic diagrams showing a visual introduction to the architecture of SCEPTR, as well as how it was trained-- namely, autocontrastive learning and masked-language modelling.

A visual introduction to how SCEPTR works, taken from our SCEPTR preprint.
SCEPTR is a TCR language model (a,b) pre-trained using masked-language modelling and autocontrastive learning (c,d).
(a) The default model uses the ``<cls>`` pooling method, but there is also a variant that is trained to use average-pooling (see :py:func:`sceptr.variant.average_pooling`).
Please see the manuscript for more details.

.. toctree::
:maxdepth: 2
Expand Down
18 changes: 6 additions & 12 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,44 +12,38 @@ From `Source <https://github.com/yutanagano/sceptr>`_
-----------------------------------------------------

.. important::
To install `sceptr` from source, you must have `git-lfs <https://git-lfs.com/>`_ installed and set up on your system.
To install ``sceptr`` from source, you must have `git-lfs <https://git-lfs.com/>`_ installed and set up on your system.
This is because you must be able to download the trained model weights directly from the Git LFS servers during your install.

Using `pip`
...........

From your Python environment, run the following replacing `<VERSION_TAG>` with the appropriate version specifier (e.g. `v1.0.0-beta.1`).
From your Python environment, run the following replacing ``<VERSION_TAG>`` with the appropriate version specifier (e.g. ``v1.0.0-beta.1``).
The latest release tags can be found by checking the 'releases' section on the github repository page.

.. code-block:: bash

$ pip install git+https://github.com/yutanagano/sceptr.git@<VERSION_TAG>

Manual install
..............

You can also clone the repository, and from within your Python environment, navigate to the project root directory and run:

.. code-block:: bash

$ pip install .

Note that even for manual installation, you still need `git-lfs` to properly de-reference the stub files at `git-clone`-ing time.
Note that even for manual installation, you still need ``git-lfs`` to properly de-reference the stub files at ``git-clone``-ing time.

Troubleshooting
...............

A recent security update to `git` has resulted in some difficulties cloning repositories that rely on `git-lfs`.
A recent security update to ``git`` has resulted in some difficulties cloning repositories that rely on ``git-lfs``.
This can result in an error message with a message along the lines of:

.. code-block:: bash

$ fatal: active `post-checkout` hook found during `git clone`

If this happens, you can temporarily set the `GIT_CLONE_PROTECTION_ACTIVE` environment variable to `false` by prepending `GIT_CLONE_PROTECTION_ACTIVE=false` before the install command like below:
If this happens, you can temporarily set the ``GIT_CLONE_PROTECTION_ACTIVE`` environment variable to ``false`` by prepending ``GIT_CLONE_PROTECTION_ACTIVE=false`` before the install command like below:

.. code-block:: bash

$ GIT_CLONE_PROTECTION_ACTIVE=false pip install git+https://github.com/yutanagano/sceptr.git@<VERSION_TAG>

This is `a known issue <https://github.com/git-lfs/git-lfs/issues/5749>`_ for `git` version `2.45.1` and is fixed from version `2.45.2`.
This is `a known issue <https://github.com/git-lfs/git-lfs/issues/5749>`_ for ``git`` version ``2.45.1`` and is fixed from version ``2.45.2``.
7 changes: 7 additions & 0 deletions docs/sceptr.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.. _api:

``sceptr``
==========

.. automodule:: sceptr
:members:
5 changes: 5 additions & 0 deletions docs/sceptr_model.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
``sceptr.model``
================

.. autoclass:: sceptr.model.Sceptr()
:members:
5 changes: 0 additions & 5 deletions docs/sceptr_sceptr.rst

This file was deleted.

Loading