diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index a96077b..1543435 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -66,7 +66,7 @@ jobs: - name: Cache postgres use if: steps.cache-postgres.outputs.cache-hit == 'true' run: docker image load --input ~/postgres.tar - + - run: sudo apt-get -y install graphviz - name: Install Python dependencies run: | python -m pip install -U pip diff --git a/docs/guide/bulk_rna_seq.ipynb b/docs/guide/bulk_rna_seq.ipynb index 8b43567..63db031 100644 --- a/docs/guide/bulk_rna_seq.ipynb +++ b/docs/guide/bulk_rna_seq.ipynb @@ -5,15 +5,7 @@ "id": "a1c40541-e05e-48a3-8b50-33baf3d6d0d4", "metadata": {}, "source": [ - "# Track bulk RNA-seq Nextflow runs" - ] - }, - { - "cell_type": "markdown", - "id": "56db6387-dab7-4ee6-84d5-f4750fc43439", - "metadata": {}, - "source": [ - "## Background" + "# Track Nextflow workflows" ] }, { @@ -21,11 +13,9 @@ "id": "cd12ac1a-73e6-44e2-a854-6fa5e52cfd41", "metadata": {}, "source": [ - "[Nextflow](https://www.nextflow.io/) is a workflow management system used for orchestrating and executing scientific workflows across different computational environments. Fundamental features include ease of scalability, portability, and reproducibility, as it allows researchers to define complex workflows in a platform-agnostic manner and run them efficiently on various computing infrastructures.\n", + "[Nextflow](https://www.nextflow.io/) is a workflow management system used for executing scientific workflows across platforms scalably, portably, and reproducibly.\n", "\n", - "While Nextflow together with nf-tower focuses on executing reproducible and trackable bioinformatics pipelines, LaminDB offers a provenance-aware data lake.\n", - "\n", - "Here, we will demonstrate how to track Nextflow workflow execution and generated biological entities with [lamin](https://lamin.ai/)." + "The workflow [nf-core rnaseq](https://nf-co.re/rnaseq/3.12.0) is arguably one of the most popular pipelines for bulk RNA sequencing using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.\n" ] }, { @@ -50,27 +40,14 @@ "cell_type": "code", "execution_count": null, "id": "756e24d7-b2a0-4a10-bf6c-f532e0cc323b", - "metadata": {}, - "outputs": [], - "source": [ - "!lamin init --storage ./nextflow_rna_seq --schema bionty" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ca2926f4", "metadata": { "tags": [ - "hide-cell" + "hide-output" ] }, "outputs": [], "source": [ - "# avoids download bars\n", - "import bionty as bt\n", - "\n", - "bt.Gene(species=\"saccharomyces cerevisiae\")" + "!lamin init --storage . --name nextflow-bulkrna" ] }, { @@ -81,85 +58,58 @@ "outputs": [], "source": [ "import lamindb as ln\n", - "import lnschema_bionty as lb\n", - "import pandas as pd\n", - "import os\n", - "import anndata as ad\n", - "from pathlib import Path\n", - "\n", - "ln.settings.verbosity = 3 # show hints" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ad2ba7f8", - "metadata": {}, - "outputs": [], - "source": [ - "lb.settings.species = \"saccharomyces cerevisiae\"" + "from pathlib import Path" ] }, { "cell_type": "markdown", - "id": "ecb68cf2-1188-4f8b-a2ab-01c60d5779b8", + "id": "cff2c742", "metadata": {}, "source": [ - "## Tracking nf-core rnaseq" + "## Download test data" ] }, { - "cell_type": "markdown", - "id": "3e1224fd", - "metadata": {}, + "cell_type": "code", + "execution_count": null, + "id": "3f03dc98-c23d-439e-838d-147134470cf7", + "metadata": { + "tags": [ + "hide-output" + ] + }, + "outputs": [], "source": [ - "The Nextflow pipeline [nf-core rnaseq](https://nf-co.re/rnaseq/3.12.0) is arguably one of the most popular pipelines for bulk RNA sequencing using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.\n", - "\n", - "First, we create a new Transform object for our pipeline run." + "!git clone https://github.com/nf-core/test-datasets --single-branch --branch rnaseq3 --depth 1" ] }, { - "cell_type": "code", - "execution_count": null, - "id": "05a03bf4-81e5-45b4-a126-002def01f81a", + "cell_type": "markdown", + "id": "be7f913a", "metadata": {}, - "outputs": [], "source": [ - "rna_seq_transform = ln.Transform(\n", - " name=\"nf-core rnaseq\",\n", - " version=\"3.11.2\",\n", - " type=\"pipeline\",\n", - " reference=\"https://github.com/laminlabs/nextflow-lamin-usecases/\",\n", - ").save()" + "To keep track of the download, let's create a \"Download\" transform and a track a run pointing to the reference url:" ] }, { "cell_type": "code", "execution_count": null, - "id": "e2344c38-d2ea-4545-914d-4f27e79a1599", + "id": "7b410782", "metadata": {}, "outputs": [], "source": [ - "ln.track(rna_seq_transform)" + "download = ln.Transform(name=\"Download\")\n", + "ln.track(\n", + " download, reference=\"https://github.com/nf-core/test-datasets\", reference_type=\"url\"\n", + ")" ] }, { "cell_type": "markdown", - "id": "b20dbc7d-0e75-4b06-8f7a-d540bffbdb44", + "id": "26d980c5", "metadata": {}, "source": [ - "We download the [test data](https://github.com/nf-core/test-datasets/tree/rnaseq3) for the pipeline which we track with Lamin." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3f03dc98-c23d-439e-838d-147134470cf7", - "metadata": {}, - "outputs": [], - "source": [ - "%%capture command\n", - "!git clone https://github.com/nf-core/test-datasets --single-branch --branch rnaseq3" + "Let's register the files we need from the download, they'll automatically be linked against the download run:" ] }, { @@ -174,58 +124,83 @@ "outputs": [], "source": [ "input_fastqs_file = ln.File.from_dir(\"test-datasets/testdata/GSE110004/\")\n", - "sample_sheet_file = ln.File(\"test-datasets/samplesheet/v3.10/samplesheet_test.csv\")\n", "ln.save(input_fastqs_file)\n", + "sample_sheet_file = ln.File(\"test-datasets/samplesheet/v3.10/samplesheet_test.csv\")\n", "ln.save(sample_sheet_file)" ] }, { "cell_type": "markdown", - "id": "9085a716-cbc9-44dc-8d45-86f83dd7f80a", + "id": "f915ff7a", "metadata": {}, "source": [ - "Let’s set the input files for our run" + "Let's visualize data lineage for one of the files:" ] }, { "cell_type": "code", "execution_count": null, - "id": "82e03a05-8654-428e-b8a1-ce50f8708cfd", + "id": "15c73170", "metadata": {}, "outputs": [], "source": [ - "run = ln.Run.filter(created_by_id=\"DzTjkKse\").one()\n", - "run" + "sample_sheet_file.view_lineage()" + ] + }, + { + "cell_type": "markdown", + "id": "ecb68cf2-1188-4f8b-a2ab-01c60d5779b8", + "metadata": {}, + "source": [ + "## Track the nf-core rnaseq run" + ] + }, + { + "cell_type": "markdown", + "id": "3e1224fd", + "metadata": {}, + "source": [ + "Let's now track the Nextflow workflow:" ] }, { "cell_type": "code", "execution_count": null, - "id": "fac3ebd4-4ab3-400a-ae2a-40a31a4b7e5c", + "id": "05a03bf4-81e5-45b4-a126-002def01f81a", "metadata": {}, "outputs": [], "source": [ - "run.input_files.set(input_fastqs_file)\n", - "run.reference = \"lamin_rnaseq\"\n", - "run.reference_type = \"nextflow_name\"" + "nextflow_bulkrna = ln.Transform(\n", + " name=\"nf-core rnaseq\",\n", + " version=\"3.11.2\",\n", + " type=\"pipeline\",\n", + " reference=\"https://github.com/laminlabs/nextflow-lamin-usecases\",\n", + ")\n", + "\n", + "ln.track(nextflow_bulkrna)" ] }, { "cell_type": "markdown", - "id": "7c00edf5-fa14-4e21-b1ab-cc4ac5c24a98", + "id": "670533a7", "metadata": {}, "source": [ - "To sync the workflow execution name with Lamin, we export it as an environment variable." + "If we now stage input files, they'll be tracked as run inputs (if input data is tracked in the cloud and registered in LaminDB, this is where we'd typcically start):" ] }, { "cell_type": "code", "execution_count": null, - "id": "a923a34c-f3a0-40cc-af95-2c7e958d8e4a", - "metadata": {}, + "id": "27294132", + "metadata": { + "tags": [ + "hide-output" + ] + }, "outputs": [], "source": [ - "os.environ[\"LAMINDB_RUN_ID\"] = run.reference" + "sample_sheet_file.stage()\n", + "[input_fastq.stage() for input_fastq in input_fastqs_file]" ] }, { @@ -233,26 +208,37 @@ "id": "17f9905e-0a34-4335-b0c4-eb9b598c8eaf", "metadata": {}, "source": [ - "Next, we run the pipeline with its test dataset and track output files and features with Lamin." + "We'll pass the LaminDB run id to the nextflow run, so that we can easily find it from within Nextflow:" ] }, { "cell_type": "code", "execution_count": null, "id": "2219c55e", - "metadata": {}, + "metadata": { + "tags": [ + "hide-output" + ] + }, "outputs": [], "source": [ - "%%capture command\n", - "!nextflow run nf-core/rnaseq -r 3.11.2 -profile test,docker --outdir rna-seq-results -name $LAMINDB_RUN_ID -resume" + "!nextflow run nf-core/rnaseq -r 3.11.2 -profile test,docker --outdir rna-seq-results -name {ln.dev.run_context.run.id} -resume" + ] + }, + { + "cell_type": "markdown", + "id": "fb81c953", + "metadata": {}, + "source": [ + "## Register outputs" ] }, { "cell_type": "markdown", - "id": "a56e8a22-94dd-413b-989d-f13f59addbe6", + "id": "e02898d4", "metadata": {}, "source": [ - "As a first step, we ingest all multiqc plots from the pipeline run." + "### QC" ] }, { @@ -260,14 +246,13 @@ "execution_count": null, "id": "6e7b5f1d-b00b-43d3-bc46-83b14144a8ba", "metadata": { - "tags": [ - "hide-output" - ] + "tags": [] }, "outputs": [], "source": [ - "multiqc_results = ln.File.from_dir(\"rna-seq-results/multiqc/\", run=run)\n", - "ln.save(multiqc_results)" + "# this would register 240 files, we don't need them here\n", + "# multiqc_results = ln.File.from_dir(\"rna-seq-results/multiqc/\")\n", + "# ln.save(multiqc_results)" ] }, { @@ -277,16 +262,16 @@ "metadata": {}, "outputs": [], "source": [ - "multiqc_file = ln.File.select(key__icontains=\"multiqc_report.html\").one()\n", - "multiqc_file" + "multiqc_file = ln.File(\"rna-seq-results/multiqc/star_salmon/multiqc_report.html\")\n", + "multiqc_file.save()" ] }, { "cell_type": "markdown", - "id": "29bae36c-dac6-4314-b85b-f3afd7e47fbd", + "id": "6f107ee9", "metadata": {}, "source": [ - "We further ingest the merged Salmon gene counts since we plan on working further with the count table:" + "### Count matrix" ] }, { @@ -296,9 +281,8 @@ "metadata": {}, "outputs": [], "source": [ - "salmon_gene_counts_table_df = pd.read_csv(\n", - " \"rna-seq-results/salmon/salmon.merged.gene_counts.tsv\", sep=\"\\t\"\n", - ")" + "count_matrix = ln.File(\"rna-seq-results/salmon/salmon.merged.gene_counts.tsv\")\n", + "count_matrix.save()" ] }, { @@ -306,35 +290,23 @@ "id": "22c88eed-61e0-4d12-96bb-ea4e10f476c0", "metadata": {}, "source": [ - "We curate the count table analogously to {doc}`docs:bulkrna`." + "To make it queryable by biological entities (genes, etc.), we can now proceed with: {doc}`docs:bulkrna`" ] }, { - "cell_type": "code", - "execution_count": null, - "id": "5b0ca2da-8bff-4750-972d-3f1c0cdb28e8", + "cell_type": "markdown", + "id": "9f607150", "metadata": {}, - "outputs": [], "source": [ - "salmon_gene_counts_table_df = salmon_gene_counts_table_df.T\n", - "var = pd.DataFrame(\n", - " {\"gene_name\": salmon_gene_counts_table_df.loc[\"gene_name\"].values},\n", - " index=salmon_gene_counts_table_df.loc[\"gene_id\"],\n", - ")\n", - "adata = ad.AnnData(salmon_gene_counts_table_df.iloc[2:].astype(\"float32\"), var=var)" + "## Visualize" ] }, { - "cell_type": "code", - "execution_count": null, - "id": "c1a58047-0c25-4632-b355-69610c6176f3", + "cell_type": "markdown", + "id": "54595d81", "metadata": {}, - "outputs": [], "source": [ - "curated_salmon_gene_counts_file = ln.File.from_anndata(\n", - " adata, description=\"Curated bulk RNA counts\", var_ref=lb.Gene.stable_id, run=run\n", - ")\n", - "ln.save(curated_salmon_gene_counts_file)" + "View data lineage:" ] }, { @@ -344,23 +316,33 @@ "metadata": {}, "outputs": [], "source": [ - "curated_salmon_gene_counts_file.describe()" + "count_matrix.view_lineage()" ] }, { "cell_type": "markdown", - "id": "58c6a407-1e87-4241-98e5-b069be057ea7", + "id": "ee3db779", "metadata": {}, "source": [ - "## Conclusion" + "View the database content:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e45c2584", + "metadata": {}, + "outputs": [], + "source": [ + "ln.view()" ] }, { "cell_type": "markdown", - "id": "8bba6911-70b6-4a99-a95e-6c9659435af6", + "id": "ad06a7e1", "metadata": {}, "source": [ - "Lamin makes it easy to track pipeline executions and to ingest input and output files that can subsequently be used for advanced downstream analyses. This is complementary to nf-tower." + "Clean up the test instance:" ] }, { @@ -369,12 +351,12 @@ "id": "5f3f95a8", "metadata": { "tags": [ - "hide-cell" + "hide-output" ] }, "outputs": [], "source": [ - "!lamin delete --force nextflow_rna_seq" + "!lamin delete --force nextflow-bulkrna" ] } ], @@ -394,7 +376,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.12" + "version": "3.9.16" }, "nbproject": { "id": "8124Vtle6ZrO", diff --git a/noxfile.py b/noxfile.py index cd81be0..5c1518c 100644 --- a/noxfile.py +++ b/noxfile.py @@ -16,6 +16,11 @@ def lint(session: nox.Session) -> None: @nox.session() def build(session): + session.run( + "pip", + "install", + "lamindb @ git+https://github.com/laminlabs/lamindb", + ) session.run(*"pip install -e .[dev]".split()) login_testuser1(session) run_pytest(session) diff --git a/pyproject.toml b/pyproject.toml index 4ee2c37..5f859e2 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -8,8 +8,7 @@ authors = [{name = "Lamin Labs", email = "laminlabs@gmail.com"}] readme = "README.md" dynamic = ["version", "description"] dependencies = [ - "nbproject", - "lamindb[bionty,aws]", + "graphviz" ] [project.urls]