From c0bb8987dce42422bc9f4b3edc215b300e48f53e Mon Sep 17 00:00:00 2001
From: Alex Wolf <f.alexander.wolf@gmail.com>
Date: Thu, 17 Aug 2023 23:03:18 +0200
Subject: [PATCH 1/3] =?UTF-8?q?=E2=99=BB=EF=B8=8F=20Refactor=20logic=20in?=
 =?UTF-8?q?=20bulk=20RNA-seq=20use=20case=20(#8)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* ♻️ Refactor notebook

* 💚 Fix

* 👷 Add graphviz

* 👷 Try to get nextflow to work again

* 💄 Prettier dat lineage plot

* 📝 Fix heading level
---
 .github/workflows/build.yml   |   2 +-
 docs/guide/bulk_rna_seq.ipynb | 258 ++++++++++++++++------------------
 noxfile.py                    |   5 +
 pyproject.toml                |   3 +-
 4 files changed, 127 insertions(+), 141 deletions(-)

diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
index a96077b..1543435 100644
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -66,7 +66,7 @@ jobs:
       - name: Cache postgres use
         if: steps.cache-postgres.outputs.cache-hit == 'true'
         run: docker image load --input ~/postgres.tar
-
+      - run: sudo apt-get -y install graphviz
       - name: Install Python dependencies
         run: |
           python -m pip install -U pip
diff --git a/docs/guide/bulk_rna_seq.ipynb b/docs/guide/bulk_rna_seq.ipynb
index 8b43567..63db031 100644
--- a/docs/guide/bulk_rna_seq.ipynb
+++ b/docs/guide/bulk_rna_seq.ipynb
@@ -5,15 +5,7 @@
    "id": "a1c40541-e05e-48a3-8b50-33baf3d6d0d4",
    "metadata": {},
    "source": [
-    "# Track bulk RNA-seq Nextflow runs"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "56db6387-dab7-4ee6-84d5-f4750fc43439",
-   "metadata": {},
-   "source": [
-    "## Background"
+    "# Track Nextflow workflows"
    ]
   },
   {
@@ -21,11 +13,9 @@
    "id": "cd12ac1a-73e6-44e2-a854-6fa5e52cfd41",
    "metadata": {},
    "source": [
-    "[Nextflow](https://www.nextflow.io/) is a workflow management system used for orchestrating and executing scientific workflows across different computational environments. Fundamental features include ease of scalability, portability, and reproducibility, as it allows researchers to define complex workflows in a platform-agnostic manner and run them efficiently on various computing infrastructures.\n",
+    "[Nextflow](https://www.nextflow.io/) is a workflow management system used for executing scientific workflows across platforms scalably, portably, and reproducibly.\n",
     "\n",
-    "While Nextflow together with nf-tower focuses on executing reproducible and trackable bioinformatics pipelines, LaminDB offers a provenance-aware data lake.\n",
-    "\n",
-    "Here, we will demonstrate how to track Nextflow workflow execution and generated biological entities with [lamin](https://lamin.ai/)."
+    "The workflow [nf-core rnaseq](https://nf-co.re/rnaseq/3.12.0) is arguably one of the most popular pipelines for bulk RNA sequencing using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.\n"
    ]
   },
   {
@@ -50,27 +40,14 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "756e24d7-b2a0-4a10-bf6c-f532e0cc323b",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!lamin init --storage ./nextflow_rna_seq --schema bionty"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ca2926f4",
    "metadata": {
     "tags": [
-     "hide-cell"
+     "hide-output"
     ]
    },
    "outputs": [],
    "source": [
-    "# avoids download bars\n",
-    "import bionty as bt\n",
-    "\n",
-    "bt.Gene(species=\"saccharomyces cerevisiae\")"
+    "!lamin init --storage . --name nextflow-bulkrna"
    ]
   },
   {
@@ -81,85 +58,58 @@
    "outputs": [],
    "source": [
     "import lamindb as ln\n",
-    "import lnschema_bionty as lb\n",
-    "import pandas as pd\n",
-    "import os\n",
-    "import anndata as ad\n",
-    "from pathlib import Path\n",
-    "\n",
-    "ln.settings.verbosity = 3  # show hints"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ad2ba7f8",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "lb.settings.species = \"saccharomyces cerevisiae\""
+    "from pathlib import Path"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "ecb68cf2-1188-4f8b-a2ab-01c60d5779b8",
+   "id": "cff2c742",
    "metadata": {},
    "source": [
-    "## Tracking nf-core rnaseq"
+    "## Download test data"
    ]
   },
   {
-   "cell_type": "markdown",
-   "id": "3e1224fd",
-   "metadata": {},
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3f03dc98-c23d-439e-838d-147134470cf7",
+   "metadata": {
+    "tags": [
+     "hide-output"
+    ]
+   },
+   "outputs": [],
    "source": [
-    "The Nextflow pipeline [nf-core rnaseq](https://nf-co.re/rnaseq/3.12.0) is arguably one of the most popular pipelines for bulk RNA sequencing using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.\n",
-    "\n",
-    "First, we create a new Transform object for our pipeline run."
+    "!git clone https://github.com/nf-core/test-datasets --single-branch --branch rnaseq3 --depth 1"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "05a03bf4-81e5-45b4-a126-002def01f81a",
+   "cell_type": "markdown",
+   "id": "be7f913a",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "rna_seq_transform = ln.Transform(\n",
-    "    name=\"nf-core rnaseq\",\n",
-    "    version=\"3.11.2\",\n",
-    "    type=\"pipeline\",\n",
-    "    reference=\"https://github.com/laminlabs/nextflow-lamin-usecases/\",\n",
-    ").save()"
+    "To keep track of the download, let's create a \"Download\" transform and a track a run pointing to the reference url:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "e2344c38-d2ea-4545-914d-4f27e79a1599",
+   "id": "7b410782",
    "metadata": {},
    "outputs": [],
    "source": [
-    "ln.track(rna_seq_transform)"
+    "download = ln.Transform(name=\"Download\")\n",
+    "ln.track(\n",
+    "    download, reference=\"https://github.com/nf-core/test-datasets\", reference_type=\"url\"\n",
+    ")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "b20dbc7d-0e75-4b06-8f7a-d540bffbdb44",
+   "id": "26d980c5",
    "metadata": {},
    "source": [
-    "We download the [test data](https://github.com/nf-core/test-datasets/tree/rnaseq3) for the pipeline which we track with Lamin."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "3f03dc98-c23d-439e-838d-147134470cf7",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "%%capture command\n",
-    "!git clone https://github.com/nf-core/test-datasets --single-branch --branch rnaseq3"
+    "Let's register the files we need from the download, they'll automatically be linked against the download run:"
    ]
   },
   {
@@ -174,58 +124,83 @@
    "outputs": [],
    "source": [
     "input_fastqs_file = ln.File.from_dir(\"test-datasets/testdata/GSE110004/\")\n",
-    "sample_sheet_file = ln.File(\"test-datasets/samplesheet/v3.10/samplesheet_test.csv\")\n",
     "ln.save(input_fastqs_file)\n",
+    "sample_sheet_file = ln.File(\"test-datasets/samplesheet/v3.10/samplesheet_test.csv\")\n",
     "ln.save(sample_sheet_file)"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "9085a716-cbc9-44dc-8d45-86f83dd7f80a",
+   "id": "f915ff7a",
    "metadata": {},
    "source": [
-    "Let’s set the input files for our run"
+    "Let's visualize data lineage for one of the files:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "82e03a05-8654-428e-b8a1-ce50f8708cfd",
+   "id": "15c73170",
    "metadata": {},
    "outputs": [],
    "source": [
-    "run = ln.Run.filter(created_by_id=\"DzTjkKse\").one()\n",
-    "run"
+    "sample_sheet_file.view_lineage()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ecb68cf2-1188-4f8b-a2ab-01c60d5779b8",
+   "metadata": {},
+   "source": [
+    "## Track the nf-core rnaseq run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3e1224fd",
+   "metadata": {},
+   "source": [
+    "Let's now track the Nextflow workflow:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "fac3ebd4-4ab3-400a-ae2a-40a31a4b7e5c",
+   "id": "05a03bf4-81e5-45b4-a126-002def01f81a",
    "metadata": {},
    "outputs": [],
    "source": [
-    "run.input_files.set(input_fastqs_file)\n",
-    "run.reference = \"lamin_rnaseq\"\n",
-    "run.reference_type = \"nextflow_name\""
+    "nextflow_bulkrna = ln.Transform(\n",
+    "    name=\"nf-core rnaseq\",\n",
+    "    version=\"3.11.2\",\n",
+    "    type=\"pipeline\",\n",
+    "    reference=\"https://github.com/laminlabs/nextflow-lamin-usecases\",\n",
+    ")\n",
+    "\n",
+    "ln.track(nextflow_bulkrna)"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "7c00edf5-fa14-4e21-b1ab-cc4ac5c24a98",
+   "id": "670533a7",
    "metadata": {},
    "source": [
-    "To sync the workflow execution name with Lamin, we export it as an environment variable."
+    "If we now stage input files, they'll be tracked as run inputs (if input data is tracked in the cloud and registered in LaminDB, this is where we'd typcically start):"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "a923a34c-f3a0-40cc-af95-2c7e958d8e4a",
-   "metadata": {},
+   "id": "27294132",
+   "metadata": {
+    "tags": [
+     "hide-output"
+    ]
+   },
    "outputs": [],
    "source": [
-    "os.environ[\"LAMINDB_RUN_ID\"] = run.reference"
+    "sample_sheet_file.stage()\n",
+    "[input_fastq.stage() for input_fastq in input_fastqs_file]"
    ]
   },
   {
@@ -233,26 +208,37 @@
    "id": "17f9905e-0a34-4335-b0c4-eb9b598c8eaf",
    "metadata": {},
    "source": [
-    "Next, we run the pipeline with its test dataset and track output files and features with Lamin."
+    "We'll pass the LaminDB run id to the nextflow run, so that we can easily find it from within Nextflow:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "id": "2219c55e",
-   "metadata": {},
+   "metadata": {
+    "tags": [
+     "hide-output"
+    ]
+   },
    "outputs": [],
    "source": [
-    "%%capture command\n",
-    "!nextflow run nf-core/rnaseq -r 3.11.2 -profile test,docker --outdir rna-seq-results -name $LAMINDB_RUN_ID -resume"
+    "!nextflow run nf-core/rnaseq -r 3.11.2 -profile test,docker --outdir rna-seq-results -name {ln.dev.run_context.run.id} -resume"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fb81c953",
+   "metadata": {},
+   "source": [
+    "## Register outputs"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "a56e8a22-94dd-413b-989d-f13f59addbe6",
+   "id": "e02898d4",
    "metadata": {},
    "source": [
-    "As a first step, we ingest all multiqc plots from the pipeline run."
+    "### QC"
    ]
   },
   {
@@ -260,14 +246,13 @@
    "execution_count": null,
    "id": "6e7b5f1d-b00b-43d3-bc46-83b14144a8ba",
    "metadata": {
-    "tags": [
-     "hide-output"
-    ]
+    "tags": []
    },
    "outputs": [],
    "source": [
-    "multiqc_results = ln.File.from_dir(\"rna-seq-results/multiqc/\", run=run)\n",
-    "ln.save(multiqc_results)"
+    "# this would register 240 files, we don't need them here\n",
+    "# multiqc_results = ln.File.from_dir(\"rna-seq-results/multiqc/\")\n",
+    "# ln.save(multiqc_results)"
    ]
   },
   {
@@ -277,16 +262,16 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "multiqc_file = ln.File.select(key__icontains=\"multiqc_report.html\").one()\n",
-    "multiqc_file"
+    "multiqc_file = ln.File(\"rna-seq-results/multiqc/star_salmon/multiqc_report.html\")\n",
+    "multiqc_file.save()"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "29bae36c-dac6-4314-b85b-f3afd7e47fbd",
+   "id": "6f107ee9",
    "metadata": {},
    "source": [
-    "We further ingest the merged Salmon gene counts since we plan on working further with the count table:"
+    "### Count matrix"
    ]
   },
   {
@@ -296,9 +281,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "salmon_gene_counts_table_df = pd.read_csv(\n",
-    "    \"rna-seq-results/salmon/salmon.merged.gene_counts.tsv\", sep=\"\\t\"\n",
-    ")"
+    "count_matrix = ln.File(\"rna-seq-results/salmon/salmon.merged.gene_counts.tsv\")\n",
+    "count_matrix.save()"
    ]
   },
   {
@@ -306,35 +290,23 @@
    "id": "22c88eed-61e0-4d12-96bb-ea4e10f476c0",
    "metadata": {},
    "source": [
-    "We curate the count table analogously to {doc}`docs:bulkrna`."
+    "To make it queryable by biological entities (genes, etc.), we can now proceed with: {doc}`docs:bulkrna`"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5b0ca2da-8bff-4750-972d-3f1c0cdb28e8",
+   "cell_type": "markdown",
+   "id": "9f607150",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "salmon_gene_counts_table_df = salmon_gene_counts_table_df.T\n",
-    "var = pd.DataFrame(\n",
-    "    {\"gene_name\": salmon_gene_counts_table_df.loc[\"gene_name\"].values},\n",
-    "    index=salmon_gene_counts_table_df.loc[\"gene_id\"],\n",
-    ")\n",
-    "adata = ad.AnnData(salmon_gene_counts_table_df.iloc[2:].astype(\"float32\"), var=var)"
+    "## Visualize"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c1a58047-0c25-4632-b355-69610c6176f3",
+   "cell_type": "markdown",
+   "id": "54595d81",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "curated_salmon_gene_counts_file = ln.File.from_anndata(\n",
-    "    adata, description=\"Curated bulk RNA counts\", var_ref=lb.Gene.stable_id, run=run\n",
-    ")\n",
-    "ln.save(curated_salmon_gene_counts_file)"
+    "View data lineage:"
    ]
   },
   {
@@ -344,23 +316,33 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "curated_salmon_gene_counts_file.describe()"
+    "count_matrix.view_lineage()"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "58c6a407-1e87-4241-98e5-b069be057ea7",
+   "id": "ee3db779",
    "metadata": {},
    "source": [
-    "## Conclusion"
+    "View the database content:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e45c2584",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ln.view()"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "8bba6911-70b6-4a99-a95e-6c9659435af6",
+   "id": "ad06a7e1",
    "metadata": {},
    "source": [
-    "Lamin makes it easy to track pipeline executions and to ingest input and output files that can subsequently be used for advanced downstream analyses. This is complementary to nf-tower."
+    "Clean up the test instance:"
    ]
   },
   {
@@ -369,12 +351,12 @@
    "id": "5f3f95a8",
    "metadata": {
     "tags": [
-     "hide-cell"
+     "hide-output"
     ]
    },
    "outputs": [],
    "source": [
-    "!lamin delete --force nextflow_rna_seq"
+    "!lamin delete --force nextflow-bulkrna"
    ]
   }
  ],
@@ -394,7 +376,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.12"
+   "version": "3.9.16"
   },
   "nbproject": {
    "id": "8124Vtle6ZrO",
diff --git a/noxfile.py b/noxfile.py
index cd81be0..5c1518c 100644
--- a/noxfile.py
+++ b/noxfile.py
@@ -16,6 +16,11 @@ def lint(session: nox.Session) -> None:
 
 @nox.session()
 def build(session):
+    session.run(
+        "pip",
+        "install",
+        "lamindb @ git+https://github.com/laminlabs/lamindb",
+    )
     session.run(*"pip install -e .[dev]".split())
     login_testuser1(session)
     run_pytest(session)
diff --git a/pyproject.toml b/pyproject.toml
index 4ee2c37..5f859e2 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -8,8 +8,7 @@ authors = [{name = "Lamin Labs", email = "laminlabs@gmail.com"}]
 readme = "README.md"
 dynamic = ["version", "description"]
 dependencies = [
-    "nbproject",
-    "lamindb[bionty,aws]",
+    "graphviz"
 ]
 
 [project.urls]

From 1728a30915300fdc40d21c5766b8b4210ea38a97 Mon Sep 17 00:00:00 2001
From: github-actions <github-actions@github.com>
Date: Thu, 17 Aug 2023 21:03:49 +0000
Subject: [PATCH 2/3] =?UTF-8?q?=F0=9F=93=9D=20Update=20release=20notes?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 docs/changelog.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/changelog.md b/docs/changelog.md
index b18e278..2749186 100644
--- a/docs/changelog.md
+++ b/docs/changelog.md
@@ -3,6 +3,7 @@
 <!-- prettier-ignore -->
 Name | PR | Developer | Date | Version
 --- | --- | --- | --- | ---
+♻️ Refactor logic in bulk RNA-seq use case | [8](https://github.com/laminlabs/nextflow-lamin-usecases/pull/8) | [falexwolf](https://github.com/falexwolf) | 2023-08-17 |
 :sparkles: Enable nb building & new curation | [7](https://github.com/laminlabs/nextflow-lamin-usecases/pull/7) | [Zethson](https://github.com/Zethson) | 2023-08-15 |
 New API | [6](https://github.com/laminlabs/nextflow-lamin-usecases/pull/6) | [Zethson](https://github.com/Zethson) | 2023-08-09 |
 :sparkles: Cache pipeline execution | [5](https://github.com/laminlabs/nextflow-lamin-usecases/pull/5) | [Zethson](https://github.com/Zethson) | 2023-08-01 |

From 2b67628a359d87f8e07998b6bbfebb69d0d36385 Mon Sep 17 00:00:00 2001
From: Alex Wolf <f.alexander.wolf@gmail.com>
Date: Fri, 18 Aug 2023 09:02:47 +0200
Subject: [PATCH 3/3] =?UTF-8?q?=F0=9F=92=84=20Fine=20tune=20wording?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 docs/guide/bulk_rna_seq.ipynb | 118 ++++++++++++++++++++--------------
 1 file changed, 71 insertions(+), 47 deletions(-)

diff --git a/docs/guide/bulk_rna_seq.ipynb b/docs/guide/bulk_rna_seq.ipynb
index 63db031..1cb95cc 100644
--- a/docs/guide/bulk_rna_seq.ipynb
+++ b/docs/guide/bulk_rna_seq.ipynb
@@ -15,25 +15,17 @@
    "source": [
     "[Nextflow](https://www.nextflow.io/) is a workflow management system used for executing scientific workflows across platforms scalably, portably, and reproducibly.\n",
     "\n",
-    "The workflow [nf-core rnaseq](https://nf-co.re/rnaseq/3.12.0) is arguably one of the most popular pipelines for bulk RNA sequencing using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.\n"
+    "Here, we'll run `nf-core/rnaseq` to process `.fastq` files from bulk RNA sequencing using STAR, RSEM, HISAT2, Salmon with gene/isoform counts and extensive quality control ([reference](https://nf-co.re/rnaseq/3.12.0)).\n",
+    "\n",
+    "![](https://raw.githubusercontent.com/nf-core/rnaseq/3.12.0//docs/images/nf-core-rnaseq_metro_map_grey.png)\n"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "531093fd-67af-40fd-b481-fbf68828bcfd",
+   "id": "b7c8e52d",
    "metadata": {},
    "source": [
-    "## Setup"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f51e29c1",
-   "metadata": {},
-   "source": [
-    "To run this notebook, you need to load a LaminDB instance that has the `bionty` schema mounted.\n",
-    "\n",
-    "Here, we’ll create a test instance (skip if you’d like to run it using your instance):"
+    "Let's create a test instance:"
    ]
   },
   {
@@ -57,8 +49,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import lamindb as ln\n",
-    "from pathlib import Path"
+    "import lamindb as ln"
    ]
   },
   {
@@ -69,6 +60,14 @@
     "## Download test data"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "4f32ae96",
+   "metadata": {},
+   "source": [
+    "Download test data using git:"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -88,7 +87,7 @@
    "id": "be7f913a",
    "metadata": {},
    "source": [
-    "To keep track of the download, let's create a \"Download\" transform and a track a run pointing to the reference url:"
+    "Track the download:"
    ]
   },
   {
@@ -99,9 +98,8 @@
    "outputs": [],
    "source": [
     "download = ln.Transform(name=\"Download\")\n",
-    "ln.track(\n",
-    "    download, reference=\"https://github.com/nf-core/test-datasets\", reference_type=\"url\"\n",
-    ")"
+    "download_url = \"https://github.com/nf-core/test-datasets\"\n",
+    "ln.track(download, reference=download_url, reference_type=\"url\")"
    ]
   },
   {
@@ -109,7 +107,7 @@
    "id": "26d980c5",
    "metadata": {},
    "source": [
-    "Let's register the files we need from the download, they'll automatically be linked against the download run:"
+    "Register input files - they'll automatically be linked against the download run:"
    ]
   },
   {
@@ -123,10 +121,10 @@
    },
    "outputs": [],
    "source": [
-    "input_fastqs_file = ln.File.from_dir(\"test-datasets/testdata/GSE110004/\")\n",
-    "ln.save(input_fastqs_file)\n",
-    "sample_sheet_file = ln.File(\"test-datasets/samplesheet/v3.10/samplesheet_test.csv\")\n",
-    "ln.save(sample_sheet_file)"
+    "sample_sheet = ln.File(\"test-datasets/samplesheet/v3.10/samplesheet_test.csv\")\n",
+    "ln.save(sample_sheet)\n",
+    "input_fastqs = ln.File.from_dir(\"test-datasets/testdata/GSE110004/\")\n",
+    "ln.save(input_fastqs)"
    ]
   },
   {
@@ -134,7 +132,7 @@
    "id": "f915ff7a",
    "metadata": {},
    "source": [
-    "Let's visualize data lineage for one of the files:"
+    "Visualize data lineage for one of the files:"
    ]
   },
   {
@@ -144,7 +142,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "sample_sheet_file.view_lineage()"
+    "sample_sheet.view_lineage()"
    ]
   },
   {
@@ -152,7 +150,15 @@
    "id": "ecb68cf2-1188-4f8b-a2ab-01c60d5779b8",
    "metadata": {},
    "source": [
-    "## Track the nf-core rnaseq run"
+    "## Track the Nextflow run"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3b698d87",
+   "metadata": {},
+   "source": [
+    "(We'd start here if input files were tracked in the cloud with LaminDB rather than downloaded through git.)"
    ]
   },
   {
@@ -160,7 +166,7 @@
    "id": "3e1224fd",
    "metadata": {},
    "source": [
-    "Let's now track the Nextflow workflow:"
+    "Track the Nextflow pipeline & run:"
    ]
   },
   {
@@ -176,7 +182,6 @@
     "    type=\"pipeline\",\n",
     "    reference=\"https://github.com/laminlabs/nextflow-lamin-usecases\",\n",
     ")\n",
-    "\n",
     "ln.track(nextflow_bulkrna)"
    ]
   },
@@ -185,7 +190,9 @@
    "id": "670533a7",
    "metadata": {},
    "source": [
-    "If we now stage input files, they'll be tracked as run inputs (if input data is tracked in the cloud and registered in LaminDB, this is where we'd typcically start):"
+    "If we now stage input files, they'll be tracked as run inputs.\n",
+    "\n",
+    "(As data is already locally available in this test case, staging won't download anything.)"
    ]
   },
   {
@@ -199,8 +206,8 @@
    },
    "outputs": [],
    "source": [
-    "sample_sheet_file.stage()\n",
-    "[input_fastq.stage() for input_fastq in input_fastqs_file]"
+    "sample_sheet.stage()\n",
+    "[input_fastq.stage() for input_fastq in input_fastqs]"
    ]
   },
   {
@@ -208,7 +215,7 @@
    "id": "17f9905e-0a34-4335-b0c4-eb9b598c8eaf",
    "metadata": {},
    "source": [
-    "We'll pass the LaminDB run id to the nextflow run, so that we can easily find it from within Nextflow:"
+    "All data is now in place and we can run the nextflow pipeline:"
    ]
   },
   {
@@ -225,6 +232,14 @@
     "!nextflow run nf-core/rnaseq -r 3.11.2 -profile test,docker --outdir rna-seq-results -name {ln.dev.run_context.run.id} -resume"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "58eea7fc",
+   "metadata": {},
+   "source": [
+    "Here, we passed the LaminDB run id to nextflow so that we can query it from within nextflow."
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "fb81c953",
@@ -244,26 +259,27 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "6e7b5f1d-b00b-43d3-bc46-83b14144a8ba",
-   "metadata": {
-    "tags": []
-   },
+   "id": "7140018a-9ef7-4136-a595-37b514c66a81",
+   "metadata": {},
    "outputs": [],
    "source": [
-    "# this would register 240 files, we don't need them here\n",
-    "# multiqc_results = ln.File.from_dir(\"rna-seq-results/multiqc/\")\n",
-    "# ln.save(multiqc_results)"
+    "multiqc_file = ln.File(\"rna-seq-results/multiqc/star_salmon/multiqc_report.html\")\n",
+    "multiqc_file.save()"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "7140018a-9ef7-4136-a595-37b514c66a81",
+   "cell_type": "markdown",
+   "id": "a588717f",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "multiqc_file = ln.File(\"rna-seq-results/multiqc/star_salmon/multiqc_report.html\")\n",
-    "multiqc_file.save()"
+    ":::{dropdown} How would I register all QC files?\n",
+    "\n",
+    "```python\n",
+    "multiqc_results = ln.File.from_dir(\"rna-seq-results/multiqc/\")\n",
+    "ln.save(multiqc_results)\n",
+    "```\n",
+    "\n",
+    ":::"
    ]
   },
   {
@@ -285,12 +301,20 @@
     "count_matrix.save()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "dd98074b",
+   "metadata": {},
+   "source": [
+    "## Link biological entities"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "22c88eed-61e0-4d12-96bb-ea4e10f476c0",
    "metadata": {},
    "source": [
-    "To make it queryable by biological entities (genes, etc.), we can now proceed with: {doc}`docs:bulkrna`"
+    "To make the count matrix queryable by biological entities (genes, experimental metadata, etc.), we can now proceed with: {doc}`docs:bulkrna`"
    ]
   },
   {