docs: updated workflow and outputs

nf-core · Dec 2, 2024 · cddbe3b · cddbe3b
1 parent 4f6feee
commit cddbe3b
Show file tree

Hide file tree

Showing 4 changed files with 37 additions and 69 deletions.
diff --git a/README.md b/README.md
@@ -36,23 +36,13 @@ To run the pipeline you have create experiment metadata files:
 
 and samplesheet (`samplesheet.csv`). We provide test example [here](assets/samplesheet.csv).
 
-Next, you have to generate genome references to incorporate ERCC spike-ins. References are downloaded from [GENCODE](https://www.gencodegenes.org) database.
-
-```bash
-nextflow run nf-core/marsseq \
-  -profile <docker/singularity/.../institute> \
-  --genome <mm10,mm9,GRCh38_v43> \
-  --build_references \
-  --input samplsheet.csv \
-  --outdir <OUTDIR>
-```
-
 Now, you can run the pipeline using:
 
 ```bash
 nextflow run nf-core/marsseq \
   -profile <docker/singularity/.../institute> \
-  --genome <mm10,mm9,GRCh38_v43> \
+  --fasta https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M32/GRCm39.primary_assembly.genome.fa.gz \
+  --gtf https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M32/gencode.vM32.annotation.gtf.gz \
   --input samplesheet.csv \
   --outdir <OUTDIR>
 ```

diff --git a/docs/images/workflow.png b/docs/images/workflow.png
diff --git a/docs/output.md b/docs/output.md
@@ -10,7 +10,7 @@ The directories listed below will be created in the results directory after the
 
 The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
 
-- [Download and build references](#download-and-build-references) - Build references needer to run the pipeline
+- [Prepare genome](#prepare-genome) - Build references needer to run the pipeline
 - [Prepare pipeline](#prepare-pipeline)
 - [Label reads](#label-reads)
 - [Align reads](#align-preads)
@@ -26,55 +26,35 @@ The pipeline is executed per `Batch` and therefore the folder structure looks li
 
 ```console
 results/
-|-- multiqc
-|-- pipeline_info
-|-- references
-`-- <batch>
+├── multiqc
+├── pipeline_info
+├── references
+└── SB26
+    ├── data
+    ├── fastqc
+    ├── output
+    ├── QC
+    ├── SB26.sam
+    └── velocity
 ```
 
-## Download and build references
+## Prepare genome
 
-<details markdown="1">
-<summary>Output files</summary>
+The pipeline requires ERCC (spike-ins) to be included in the reference genome. To
+accomdate this, the pipeline requires `fasta` and `gtf` reference files. We recommend
+using files from [GENCODE](https://www.gencodegenes.org). Reference indexes are built
+based on set `--aligner` parameter.
 
 ```console
-.
-└── <genome>
-    ├── bowtie2
-    │   ├── <genome>.1.bt2
-    │   ├── <genome>.2.bt2
-    │   ├── <genome>.3.bt2
-    │   ├── <genome>.4.bt2
-    │   ├── <genome>.rev.1.bt2
-    │   └── <genome>.rev.2.bt2
-    ├── <genome>.fa
-    ├── <genome>.gtf
-    ├── star
-    │   ├── chrLength.txt
-    │   ├── chrNameLength.txt
-    │   ├── chrName.txt
-    │   ├── chrStart.txt
-    │   ├── exonGeTrInfo.tab
-    │   ├── exonInfo.tab
-    │   ├── geneInfo.tab
-    │   ├── Genome
-    │   ├── genomeParameters.txt
-    │   ├── Log.out
-    │   ├── SA
-    │   ├── SAindex
-    │   ├── sjdbInfo.txt
-    │   ├── sjdbList.fromGTF.out.tab
-    │   ├── sjdbList.out.tab
-    │   └── transcriptInfo.tab
-    └── versions.yml
+results/references
+├── bowtie2
+├── gencode.vM32.annotation.gtf
+├── GRCm39.primary_assembly.genome_ercc.fa
+├── GRCm39.primary_assembly.genome.fa
+├── star
+└── versions.yml
 ```
 
-</details>
-
-The pipeline downloads references from GENCODE database. This is required, because
-the MARS-seq is using ERCC spike-ins, which have to be appended. Next it builds
-bowtie2 index. If `--velocity` flag is set, star index is also built.
-
 ## Prepare pipeline
 
 <details markdown="1">
@@ -85,7 +65,6 @@ bowtie2 index. If `--velocity` flag is set, star index is also built.
   - `gene_intervals.txt`: Information about gene (chromosome, start, end, strand and symbol)
   - `seq_batches.txt`: Sequencing batches
   - `wells_cells.txt`: Well cells
-  - `*fastq.gz`: Raw reads
 
 </details>
 
@@ -121,25 +100,23 @@ folder.
 Split reads are aligned using `bowtie2`. Next, all the aligned reads are merged
 into one `SAM` file which is used as an input for demultiplexing.
 
-If `--velocity` flag is set, the reads are also aligned using `StarSolo` to estimated
-both spliced and unspliced reads which can be used for RNA velocity estimation.
-This is an additional plugin which we developed. In short MARS-seq2.0 reads are
-converted to `10X v2` format. Additionally, a whitelist is generated for aligned
-to perform demultiplexing.
+If `--aligner` flag is set to `bowtie2_star` or `star`, the reads are also aligned
+using `StarSolo` to estimated both spliced and unspliced reads which can be used
+for RNA velocity estimation. This is an additional plugin which we developed.
+In short MARS-seq2.0 reads are converted to `10X v2` format. Additionally, a
+whitelist is generated for aligned to perform demultiplexing.
 
 <details markdown="1">
 <summary>Output files</summary>
 
 - `<batch>`
   - `<batch>.sam`: Merged aligned reads into one SAM file with `bowtie2`
   - `velocity/`
-    - `Solo.out/*`: Output from StarSolo (Gene, GeneFull, SJ, Velocyto and Barcode.stats)
-    - `Aligned.sortedByCoord.out.bam`: Aligned reads
-    - `Log.final.out`: STAR alignment report containing the mapping results summary
-    - `Log.out` and `Log.progress.out`: STAR log files containing detailed information about the run. Typically only useful for debugging purposes
-    - `<batch>.cutadapt.log`: Log file from running `cutadapt`
     - `<batch>_{1,2}.trim.fastq.gz`: Trimmed pair-end converted `10X v2` reads
-    - `SJ.out.tab`: File containing filtered splice junctions detected after mapping the reads
+    - `<batch>.cutadapt.log`: Log file from running `cutadapt`
+    - `<batch>.Log.final.out`: STAR alignment report containing the mapping results summary
+    - `<batch>.Log.out` and `<batch>.Log.progress.out`: STAR log files containing detailed information about the run. Typically only useful for debugging purposes
+    - `<batch>.Solo.out/*`: Output from StarSolo (Gene, GeneFull, SJ, Velocyto and Barcode.stats)
     - `whitelist.txt`: File containing cell barcodes (combination of pool and cell barcode)
 
 </details>

diff --git a/docs/usage.md b/docs/usage.md
@@ -59,7 +59,7 @@ An [example samplesheet](../assets/samplesheet.csv) has been provided with the p
 The typical command for running the pipeline is as follows:
 
 ```bash
-nextflow run nf-core/marsseq --input ./samplesheet.csv --outdir ./results --genome GRCh37 -profile docker
+nextflow run nf-core/marsseq --input ./samplesheet.csv --outdir ./results --fasta genome.fasta --gtf annotation.gtf -profile docker
 ```
 
 This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles.
@@ -92,7 +92,8 @@ with:
 ```yaml title="params.yaml"
 input: './samplesheet.csv'
 outdir: './results/'
-genome: 'GRCh37'
+fasta: 'genome.fasta'
+gtf: 'annotation.gtf'
 <...>
 ```