Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepro refactor #1115

Merged
merged 9 commits into from
Feb 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions kubernetes/loculus/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -108,9 +108,7 @@ instances:
preprocessing:
image: ghcr.io/loculus-project/preprocessing-nextclade
args:
- "python"
- "main.py"
- "--log-level=DEBUG"
- "prepro"
configFile:
log_level: DEBUG
nextclade_dataset_name: nextstrain/mpox/all-clades
Expand Down
4 changes: 4 additions & 0 deletions preprocessing/nextclade/.gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
# Python
venv/
__pycache__/
.mypy_cache/
.ruff_cache/

# Editors
.idea/
.vscode/

config.yaml
14 changes: 14 additions & 0 deletions preprocessing/nextclade/.justfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
all: ruff_format ruff_check run_mypy

ruff_format:
ruff format

ruff_check:
ruff check --fix

run_mypy:
mypy -p src --python-version 3.12 --pretty

install_dev_deps:
micromamba install -f dev_dependencies.txt --rc-file .mambarc --platform osx-64

6 changes: 6 additions & 0 deletions preprocessing/nextclade/.mambarc
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
channels:
- conda-forge
- bioconda
repodata_use_zst: true
channel_priority: strict
download_threads: 20
4 changes: 4 additions & 0 deletions preprocessing/nextclade/.mypy.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[mypy]

[mypy-Bio.*]
ignore_missing_imports = True
18 changes: 15 additions & 3 deletions preprocessing/nextclade/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,18 @@
FROM mambaorg/micromamba:1.5.6

COPY --chown=$MAMBA_USER:$MAMBA_USER environment.yml /tmp/env.yaml
RUN micromamba install -y -n base -f /tmp/env.yaml --platform=linux-64 && \
micromamba clean --all --yes
COPY --chown=$MAMBA_USER:$MAMBA_USER .mambarc /tmp/.mambarc

RUN micromamba config set extract_threads 1 \
&& micromamba install -y -n base -f /tmp/env.yaml --platform=linux-64 --rc-file /tmp/.mambarc \
&& micromamba clean --all --yes

COPY --chown=$MAMBA_USER:$MAMBA_USER . /package

# Set the environment variable to activate the conda environment
ARG MAMBA_DOCKERFILE_ACTIVATE=1

RUN ls -alht /package

COPY . .
# Install the package
RUN pip install /package
33 changes: 19 additions & 14 deletions preprocessing/nextclade/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,27 +19,27 @@ This SARS-CoV-2 preprocessing pipeline is only for demonstration purposes. It re
1. Install `conda`/`mamba`/`micromamba`: see e.g. [micromamba installation docs](https://mamba.readthedocs.io/en/latest/micromamba-installation.html#umamba-install)
2. Install environment:

```bash
mamba env create -n loculus-nextclade -f environment.yml
```
```bash
mamba env create -n loculus-nextclade -f environment.yml
```

3. Start backend (see [backend README](../backend/README.md))
4. Submit sequences to backend

```bash
curl -X 'POST' 'http://localhost:8079/submit?username=testuser' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'metadataFile=@testdata/metadata.tsv;type=text/tab-separated-values' \
-F 'sequenceFile=@testdata/sequences.fasta'
```
```bash
curl -X 'POST' 'http://localhost:8079/submit?username=testuser' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'metadataFile=@testdata/metadata.tsv;type=text/tab-separated-values' \
-F 'sequenceFile=@testdata/sequences.fasta'
```

5. Run pipeline

```bash
mamba activate loculus-nextclade
python main.py
```
```bash
mamba activate loculus-nextclade
python main.py
```

### Docker

Expand All @@ -54,3 +54,8 @@ Run (TODO: port-forwarding):
```bash
docker run -it --platform=linux/amd64 --network host --rm nextclade_processing python main.py
```

## Development

- Install Ruff to lint/format
- Use `mypy` to check types: `mypy -p src --python-version 3.12`
4 changes: 4 additions & 0 deletions preprocessing/nextclade/dev_dependencies.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
mypy
ruff
types-PyYAML
types-requests
8 changes: 4 additions & 4 deletions preprocessing/nextclade/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ channels:
dependencies:
- python=3.12
- biopython=1.83
- nextclade=3.0.1
- click=8.1
- nextclade=3.2.1
- PyYAML
- requests=2.31
- snakemake=8.2.3
- PyYAML
- pyjwt=2.8
- pip
21 changes: 21 additions & 0 deletions preprocessing/nextclade/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
[tool.ruff]
line-length = 100

[tool.ruff.lint]
# ignore-init-module-imports = true
select = ["E", "F", "B"]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build.targets.wheel]
packages = ["src/loculus_preprocessing"]

# Basic package config to make it installable
[project]
name = "loculus_preprocessing"
version = "0.1.0"

[project.scripts]
prepro = "loculus_preprocessing.__main__:cli_entry"
4 changes: 4 additions & 0 deletions preprocessing/nextclade/src/loculus_preprocessing/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from . import backend as backend
from . import config as config
from . import datatypes as datatypes
from . import prepro as prepro
20 changes: 20 additions & 0 deletions preprocessing/nextclade/src/loculus_preprocessing/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import logging

from .config import get_config
from .prepro import run


def cli_entry() -> None:
logging.basicConfig(level=logging.INFO)

config = get_config()

logging.getLogger().setLevel(config.log_level)

logging.info("Using config: {}".format(config))

run(config)


if __name__ == "__main__":
cli_entry()
61 changes: 61 additions & 0 deletions preprocessing/nextclade/src/loculus_preprocessing/backend.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
"""
Functions to interface with the backend
"""

import datetime as dt
import logging

import jwt
import requests

from .config import Config


class JwtCache:
def __init__(self) -> None:
self.token: str = ""
self.expiration: dt.datetime = dt.datetime.min

def get_token(self) -> str | None:
# Only use token if it's got more than 5 minutes left
if self.token and self.expiration > dt.datetime.now() + dt.timedelta(minutes=5):
return self.token
return None

def set_token(self, token: str, expiration: dt.datetime):
self.token = token
self.expiration = expiration


jwt_cache = JwtCache()


def get_jwt(config: Config) -> str:
if cached_token := jwt_cache.get_token():
logging.debug("Using cached JWT")
return cached_token

url = config.keycloak_host.rstrip("/") + "/" + config.keycloak_token_path.lstrip("/")
data = {
"client_id": "test-cli",
"username": config.keycloak_user,
"password": config.keycloak_password,
"grant_type": "password",
}

logging.debug(f"Requesting JWT from {url}")

with requests.post(url, data=data) as response:
if response.ok:
logging.debug("JWT fetched successfully.")
token = response.json()["access_token"]
decoded = jwt.decode(token, options={"verify_signature": False})
expiration = dt.datetime.fromtimestamp(decoded.get("exp", 0))
jwt_cache.set_token(token, expiration)
return token
else:
error_msg = (
f"Fetching JWT failed with status code {response.status_code}: " f"{response.text}"
)
logging.error(error_msg)
raise Exception(error_msg)
103 changes: 103 additions & 0 deletions preprocessing/nextclade/src/loculus_preprocessing/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
import argparse
import copy
import dataclasses
import logging
import os
from dataclasses import dataclass
from types import UnionType
from typing import Type, get_args

import yaml

logger = logging.getLogger(__name__)

# Dataclass types for which we can generate CLI arguments
CLI_TYPES = [str, int, float, bool]


@dataclass
class Config:
backend_host: str = "http://127.0.0.1:8079"
keycloak_host: str = "http://172.0.0.1:8083"
keycloak_user: str = "dummy_preprocessing_pipeline"
keycloak_password: str = "dummy_preprocessing_pipeline"
keycloak_token_path: str = "realms/loculusRealm/protocol/openid-connect/token"
nextclade_dataset_name: str = "nextstrain/mpox/all-clades"
nextclade_dataset_version: str = "2024-01-16--20-31-02Z"
config_file: str | None = None
log_level: str = "DEBUG"
genes: dict[str, int] = dataclasses.field(default_factory=dict)
keep_tmp_dir: bool = False
reference_length: int = 197209
batch_size: int = 5


def load_config_from_yaml(config_file: str, config: Config) -> Config:
config = copy.deepcopy(config)
with open(config_file, "r") as file:
yaml_config = yaml.safe_load(file)
logging.debug(f"Loaded config from {config_file}: {yaml_config}")
for key, value in yaml_config.items():
if value is not None and hasattr(config, key):
setattr(config, key, value)
return config


def base_type(field_type: type) -> type:
"""Pull the non-None type from a Union, e.g. `str | None` -> `str`"""
if type(field_type) is UnionType:
return next(t for t in get_args(field_type) if t is not type(None))
return field_type


def kebab(s: str) -> str:
"""Convert snake_case to kebab-case"""
return s.replace("_", "-")


def generate_argparse_from_dataclass(config_cls: Type[Config]) -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="Command-line arguments for Config class")
for field in dataclasses.fields(config_cls):
field_name = kebab(field.name)
field_type = base_type(field.type)
if field_type not in CLI_TYPES:
continue
if field_type == bool: # Special case for boolean flags
parser.add_argument(f"--{field_name}", action="store_true")
parser.add_argument(
f"--no-{field_name}", dest=field_name.replace("-", "_"), action="store_false"
)
else:
parser.add_argument(f"--{field_name}", type=field_type)
return parser


def get_config() -> Config:
# Config precedence: CLI args > ENV variables > config file > default

env_log_level = os.environ.get("PREPROCESSING_LOG_LEVEL")
if env_log_level:
logging.basicConfig(level=env_log_level)

parser = generate_argparse_from_dataclass(Config)
args = parser.parse_args()

# Load default config
config = Config()

# Overwrite config with config in config_file
if args.config_file:
config = load_config_from_yaml(args.config_file, config)

# Use environment variables if available
for key in config.__dict__.keys():
env_var = f"PREPROCESSING_{key.upper()}"
if env_var in os.environ:
setattr(config, key, os.environ[env_var])

# Overwrite config with CLI args
for key, value in args.__dict__.items():
if value is not None:
setattr(config, key, value)

return config
Loading
Loading