CONTRIBUTING.qmd

---
title: Contributing to OpenProblems
format: gfm
toc: true
toc-depth: 2
engine: knitr
---

[OpenProblems](https://openproblems.bio) is a community effort, and everyone is welcome to contribute. This project is hosted on [github.com/openproblems-bio/openproblems-v2](https://github.com/openproblems-bio/openproblems-v2).  You can find a full in depth guide on how to contribute to this project on the [OpenProblems website](https://openproblems.bio/documentation/).

## Code of conduct {#code-of-conduct}

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.

Our full [Code of Conduct](CODE_OF_CONDUCT.md) is adapted from the [Contributor Covenant](https://www.contributor-covenant.org), version 2.1.


## Requirements

To use this repository, please install the following dependencies:

* Bash
* Java (Java 11 or higher)
* Docker (Instructions [here](https://docs.docker.com/get-docker/))
* Nextflow (Optional, though [very easy to install](https://www.nextflow.io/index.html#GetStarted))

## Quick start

The `src/` folder contains modular software components for running a modality alignment benchmark. Running the full pipeline is quite easy.

**Step 0, fetch Viash and Nextflow**

```bash
mkdir $HOME/bin
curl -fsSL get.viash.io | bash -s -- --bin $HOME/bin --tools false
curl -s https://get.nextflow.io | bash; mv nextflow $HOME/bin
```

Make sure that Viash and Nextflow are on the $PATH by checking whether the following commands work:

```{bash}
viash -v
nextflow -v
```

**Step 1, download test resources:** by running the following command.

```bash
viash run src/common/sync_test_resources/config.vsh.yaml
```

    Completed 256.0 KiB/7.2 MiB (302.6 KiB/s) with 6 file(s) remaining
    Completed 512.0 KiB/7.2 MiB (595.8 KiB/s) with 6 file(s) remaining
    Completed 768.0 KiB/7.2 MiB (880.3 KiB/s) with 6 file(s) remaining
    Completed 1.0 MiB/7.2 MiB (1.1 MiB/s) with 6 file(s) remaining    
    Completed 1.2 MiB/7.2 MiB (1.3 MiB/s) with 6 file(s) remaining
    ...

**Step 2, build all the components:** in the `src/` folder as standalone executables in the `target/` folder. Use the `-q 'xxx'` parameter to build a subset of components in the repository.

```bash
viash ns build --query 'label_projection|common' --parallel --setup cachedbuild
```

    In development mode with 'dev'.
    Exporting process_dataset (label_projection) =docker=> target/docker/label_projection/process_dataset
    Exporting accuracy (label_projection/metrics) =docker=> target/docker/label_projection/metrics/accuracy
    Exporting random_labels (label_projection/control_methods) =docker=> target/docker/label_projection/control_methods/random_labels
    [notice] Building container 'label_projection/control_methods_random_labels:dev' with Dockerfile
    [notice] Building container 'common/data_processing_dataset_concatenate:dev' with Dockerfile
    [notice] Building container 'label_projection/metrics_accuracy:dev' with Dockerfile
    ...

Viash will build a whole namespace (`ns`) into executables and Nextflow pipelines into the `target/docker` and `target/nextflow` folders respectively.
By adding the `-q/--query` flag, you can filter which components to build using a regex.
By adding the `--parallel` flag, these components are built in parallel (otherwise it will take a really long time).
The flag `--setup cachedbuild` will automatically start building Docker containers for each of these methods.

The command might take a while to run, since it is building a docker container for each of the components. 

**Step 3, run the pipeline with nextflow.** To do so, run the bash script located at `src/tasks/label_projection/workflows/run_nextflow.sh`:

```bash
src/tasks/label_projection/workflows/run/run_test.sh
```

    N E X T F L O W  ~  version 22.04.5
    Launching `src/tasks/label_projection/workflows/run/main.nf` [pensive_turing] DSL2 - revision: 16b7b0c332
    executor >  local (28)
    [f6/f89435] process > run_wf:run_methods:true_labels:true_labels_process (pancreas.true_labels)                         [100%] 1 of 1 ✔
    [ed/d674a2] process > run_wf:run_methods:majority_vote:majority_vote_process (pancreas.majority_vote)                   [100%] 1 of 1 ✔
    [15/f0a427] process > run_wf:run_methods:random_labels:random_labels_process (pancreas.random_labels)                   [100%] 1 of 1 ✔
    [02/969d05] process > run_wf:run_methods:knn:knn_process (pancreas.knn)                                                 [100%] 1 of 1 ✔
    [90/5fdf9a] process > run_wf:run_methods:mlp:mlp_process (pancreas.mlp)                                                 [100%] 1 of 1 ✔
    [c7/dee2e5] process > run_wf:run_methods:logistic_regression:logistic_regression_process (pancreas.logistic_regression) [100%] 1 of 1 ✔
    [83/3ba0c9] process > run_wf:run_methods:scanvi:scanvi_process (pancreas.scanvi)                                        [100%] 1 of 1 ✔
    [e3/2c298e] process > run_wf:run_methods:seurat_transferdata:seurat_transferdata_process (pancreas.seurat_transferdata) [100%] 1 of 1 ✔
    [d6/7212ab] process > run_wf:run_methods:xgboost:xgboost_process (pancreas.xgboost)                                     [100%] 1 of 1 ✔
    [b6/7dc1a7] process > run_wf:run_metrics:accuracy:accuracy_process (pancreas.scanvi)                                    [100%] 9 of 9 ✔
    [be/7d4da4] process > run_wf:run_metrics:f1:f1_process (pancreas.scanvi)                                                [100%] 9 of 9 ✔
    [89/dcd77a] process > run_wf:aggregate_results:extract_scores:extract_scores_process (combined)                         [100%] 1 of 1 ✔

## Project structure

High level overview:
    .
    ├── bin                    Helper scripts for building the project and developing a new component.
    ├── resources_test         Datasets for testing components. If you don't have this folder, run **Step 1** above.
    ├── src                    Source files for each component in the pipeline.
    │   ├── common             Common processing components.
    │   ├── datasets           Components and pipelines for building the 'Common datasets'
    │   ├── label_projection   Source files related to the 'Label projection' task.
    │   └── ...                Other tasks.
    └── target                 Executables generated by viash based on the components listed under `src/`.
        ├── docker             Bash executables which can be used from a terminal.
        └── nextflow           Nextflow modules which can be used as a standalone pipeline or as part of a bigger pipeline.

Detailed overview of a task folder (e.g. `src/tasks/label_projection`):

    src/tasks/label_projection/
    ├── api                    Specs for the components in this task.
    ├── control_methods        Control methods which serve as quality control checks for the benchmark.
    ├── docs                   Task documentation
    ├── methods                Label projection method components.
    ├── metrics                Label projection metric components.
    ├── resources_scripts      The scripts needed to run the benchmark.
    ├── resources_test_scripts The scripts needed to generate the test resources (which are needed for unit testing).
    ├── process_dataset        A component that masks a common dataset for use in the benchmark
    └── workflows              The benchmarking workflow.


Detailed overview of the `src/datasets` folder:

    src/datasets/
    ├── api                    Specs for the data loaders and normalisation methods.
    ├── loaders                Components for ingesting datasets from a source.
    ├── normalization          Normalization method components.
    ├── processors             Other preprocessing components (e.g. HVG and PCA).
    ├── resource_scripts       The scripts needed to generate the common datasets.
    ├── resource_test_scripts  The scripts needed to generate the test resources (which are needed for unit testing).
    └── workflows              The workflow which generates the common datasets.

## Adding a Viash component

[Viash](https://viash.io) allows you to create pipelines
in Bash or Nextflow by wrapping Python, R, or Bash scripts into reusable components.


You can start creating a new component by [creating a Viash component](https://viash.io/guide/component/creation/docker.html). 


```{bash, include=FALSE}

mkdir -p src/tasks/label_projection/methods/foo

cat > src/tasks/label_projection/methods/foo/config.vsh.yaml << HERE
__merge__: ../../api/comp_method.yaml
functionality:
  name: "foo"
  namespace: "label_projection/methods"
  # A multiline description of your method.
  description: "Todo: fill in"
  info:
    type: method

    # a short label of your method
    label: Foo

    # A multiline description of your method.
    description: "Todo: fill in"

    # A short summary of the method description.
    summary: "Todo: fill in"

    # Add the bibtex reference to the "src/common/library.bib" if it is not already there.
    reference: "cover1967nearest"

    repository_url: "https://github.com/openproblems-bio/openproblems-v2"
    documentation_url: "https://openproblems.bio/documentation/"
    preferred_normalization: log_cp10k
  resources:
    - type: python_script
      path: script.py
platforms:
  - type: docker
    image: openproblems/base_python:1.0.0
    setup:
      - type: python
        packages: [scikit-learn]
  - type: nextflow
    directives:
      label: [midtime, midmem, lowcpu]
HERE

cat > src/tasks/label_projection/methods/foo/script.py << HERE
import anndata as ad
import numpy as np

## VIASH START
# This code-block will automatically be replaced by Viash at runtime.
par = {
    'input_train': 'resources_test/label_projection/pancreas/train.h5ad',
    'input_test': 'resources_test/label_projection/pancreas/test.h5ad',
    'output': 'output.h5ad'
}
meta = {
    'functionality_name': 'foo'
}
## VIASH END

print("Load data", flush=True)
input_train = ad.read_h5ad(par['input_train'])
input_test = ad.read_h5ad(par['input_test'])

print("Create predictions", flush=True)
input_test.obs["label_pred"] = "foo"

print("Add method name to uns", flush=True)
input_test.uns["method_id"] = meta["functionality_name"]

print("Write output to file", flush=True)
input_test.write_h5ad(par["output"], compression="gzip")
HERE
```

For example, to create a new Python-based method named `foo`, create a Viash config at `src/tasks/label_projection/methods/foo/config.vsh.yaml`:

```{embed lang="yaml"}
src/tasks/label_projection/methods/foo/config.vsh.yaml
```

And create a script at `src/tasks/label_projection/methods/foo/script.py`:

```{embed lang="python"}
src/tasks/label_projection/methods/foo/script.py
```


## Running a component from CLI

You can view the interface of the executable by running the executable with the `-h` or `--help` parameter.

```{bash}
viash run src/tasks/label_projection/methods/foo/config.vsh.yaml -- --help
```

Before running a new component, youy will need to create the docker container:

```{bash}
viash run src/tasks/label_projection/methods/foo/config.vsh.yaml -- ---setup cachedbuild

```

You can **run the component** as follows:

```{bash}
viash run src/tasks/label_projection/methods/foo/config.vsh.yaml -- \
  --input_train resources_test/label_projection/pancreas/train.h5ad \
  --input_test resources_test/label_projection/pancreas/test.h5ad \
  --output resources_test/label_projection/pancreas/prediction.h5ad
```

## Building a component

`viash` has several helper functions to help you quickly develop a component.

With **`viash build`**, you can turn the component into a standalone executable. 
This standalone executable you can give to somebody else, and they will be able to 
run it, provided that they have Bash and Docker installed.

```{bash}
viash build src/tasks/label_projection/methods/foo/config.vsh.yaml \
  -o target/docker/label_projection/methods/foo
```

:::{.callout-note}
The `viash ns build` component does a much better job of setting up 
a collection of components.
:::

You can now view the same interface of the executable by running the executable with the `-h` parameter.

```{bash}
target/docker/label_projection/methods/foo/foo -h
```

Or **run the component** as follows:

```{bash}
target/docker/label_projection/methods/foo/foo \
  --input_train resources_test/label_projection/pancreas/train.h5ad \
  --input_test resources_test/label_projection/pancreas/test.h5ad \
  --output resources_test/label_projection/pancreas/prediction.h5ad
```


## Unit testing a component

The [method API specifications](src/tasks/label_projection/api/comp_method.yaml) comes with a generic unit test for free. 
This means you can unit test your component using the **`viash test`** command.

```{bash}
viash test src/tasks/label_projection/methods/foo/config.vsh.yaml
```

```{bash include=FALSE}
cat > src/tasks/label_projection/methods/foo/script.py << HERE
import anndata as ad
import numpy as np

## VIASH START
# This code-block will automatically be replaced by Viash at runtime.
par = {
    'input_train': 'resources_test/label_projection/pancreas/train.h5ad',
    'input_test': 'resources_test/label_projection/pancreas/test.h5ad',
    'output': 'output.h5ad'
}
meta = {
    'functionality_name': 'foo'
}
## VIASH END

print("Load data", flush=True)
input_train = ad.read_h5ad(par['input_train'])
input_test = ad.read_h5ad(par['input_test'])

print("Not creating any predictions!!!", flush=True)
# input_test.obs["label_pred"] = "foo"

print("Not adding method name to uns!!!", flush=True)
# input_test.uns["method_id"] = meta["functionality_name"]

print("Write output to file", flush=True)
input_test.write_h5ad(par["output"], compression="gzip")
HERE
```

Let's introduce a bug in the script and try running the test again. For instance:

```{embed lang="python"}
src/tasks/label_projection/methods/foo/script.py
```

If we now run the test, we should get an error since we didn't create all of the required output slots.

```{bash error=TRUE}
viash test src/tasks/label_projection/methods/foo/config.vsh.yaml
```


## More information

The [Viash reference docs](https://viash.io/reference/config/) page provides information on all of the available fields in a Viash config, and the [Guide](https://viash.io/guide/) will help you get started with creating components from scratch.


<!-- cleaning up temporary files -->

```{bash, echo=FALSE}
rm -r src/tasks/label_projection/methods/foo target/docker/label_projection/methods/foo
```

## Branch Naming Conventions

### Category

A git branch should start with a category. Pick one of these: feature, bugfix, hotfix, or test.

* `feature` is for adding, refactoring or removing a feature
* `bugfix` is for fixing a bug
* `hotfix` is for changing code with a temporary solution and/or without following the usual process (usually because of an emergency)
* `test` is for experimenting outside of an issue/ticket
* `doc` is for adding, changing or removing documentation

### Reference

After the category, there should be a "`/`" followed by the reference of the issue/ticket/task you are working on. If there's no reference, just add no-ref. With task it is meant as benchmarking task e.g. batch_integration

### Description

After the reference, there should be another "`/`" followed by a description which sums up the purpose of this specific branch. This description should be short and "kebab-cased".

By default, you can use the title of the issue/ticket you are working on. Just replace any special character by "`-`".

### To sum up, follow this pattern when branching:

```bash
git branch <category/reference/description-in-kebab-case>
```

### Examples

* You need to add, refactor or remove a feature: `git branch feature/issue-42/create-new-button-component`
* You need to fix a bug: `git branch bugfix/issue-342/button-overlap-form-on-mobile`
* You need to fix a bug really fast (possibly with a temporary solution): `git branch hotfix/no-ref/registration-form-not-working`
* You need to experiment outside of an issue/ticket: `git branch test/no-ref/refactor-components-with-atomic-design`

### References

* [a-simplified-convention-for-naming-branches-and-commits-in-git](https://dev.to/varbsan/a-simplified-convention-for-naming-branches-and-commits-in-git-il4)