Skip to content

Commit

Permalink
docs: updated workflow and outputs
Browse files Browse the repository at this point in the history
  • Loading branch information
matq007 committed Dec 2, 2024
1 parent 4f6feee commit cddbe3b
Show file tree
Hide file tree
Showing 4 changed files with 37 additions and 69 deletions.
14 changes: 2 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,23 +36,13 @@ To run the pipeline you have create experiment metadata files:

and samplesheet (`samplesheet.csv`). We provide test example [here](assets/samplesheet.csv).

Next, you have to generate genome references to incorporate ERCC spike-ins. References are downloaded from [GENCODE](https://www.gencodegenes.org) database.

```bash
nextflow run nf-core/marsseq \
-profile <docker/singularity/.../institute> \
--genome <mm10,mm9,GRCh38_v43> \
--build_references \
--input samplsheet.csv \
--outdir <OUTDIR>
```

Now, you can run the pipeline using:

```bash
nextflow run nf-core/marsseq \
-profile <docker/singularity/.../institute> \
--genome <mm10,mm9,GRCh38_v43> \
--fasta https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M32/GRCm39.primary_assembly.genome.fa.gz \
--gtf https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M32/gencode.vM32.annotation.gtf.gz \
--input samplesheet.csv \
--outdir <OUTDIR>
```
Expand Down
Binary file modified docs/images/workflow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
87 changes: 32 additions & 55 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ The directories listed below will be created in the results directory after the

The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

- [Download and build references](#download-and-build-references) - Build references needer to run the pipeline
- [Prepare genome](#prepare-genome) - Build references needer to run the pipeline
- [Prepare pipeline](#prepare-pipeline)
- [Label reads](#label-reads)
- [Align reads](#align-preads)
Expand All @@ -26,55 +26,35 @@ The pipeline is executed per `Batch` and therefore the folder structure looks li

```console
results/
|-- multiqc
|-- pipeline_info
|-- references
`-- <batch>
├── multiqc
├── pipeline_info
├── references
└── SB26
├── data
├── fastqc
├── output
├── QC
├── SB26.sam
└── velocity
```

## Download and build references
## Prepare genome

<details markdown="1">
<summary>Output files</summary>
The pipeline requires ERCC (spike-ins) to be included in the reference genome. To
accomdate this, the pipeline requires `fasta` and `gtf` reference files. We recommend
using files from [GENCODE](https://www.gencodegenes.org). Reference indexes are built
based on set `--aligner` parameter.

```console
.
└── <genome>
├── bowtie2
│ ├── <genome>.1.bt2
│ ├── <genome>.2.bt2
│ ├── <genome>.3.bt2
│ ├── <genome>.4.bt2
│ ├── <genome>.rev.1.bt2
│ └── <genome>.rev.2.bt2
├── <genome>.fa
├── <genome>.gtf
├── star
│ ├── chrLength.txt
│ ├── chrNameLength.txt
│ ├── chrName.txt
│ ├── chrStart.txt
│ ├── exonGeTrInfo.tab
│ ├── exonInfo.tab
│ ├── geneInfo.tab
│ ├── Genome
│ ├── genomeParameters.txt
│ ├── Log.out
│ ├── SA
│ ├── SAindex
│ ├── sjdbInfo.txt
│ ├── sjdbList.fromGTF.out.tab
│ ├── sjdbList.out.tab
│ └── transcriptInfo.tab
└── versions.yml
results/references
├── bowtie2
├── gencode.vM32.annotation.gtf
├── GRCm39.primary_assembly.genome_ercc.fa
├── GRCm39.primary_assembly.genome.fa
├── star
└── versions.yml
```

</details>

The pipeline downloads references from GENCODE database. This is required, because
the MARS-seq is using ERCC spike-ins, which have to be appended. Next it builds
bowtie2 index. If `--velocity` flag is set, star index is also built.

## Prepare pipeline

<details markdown="1">
Expand All @@ -85,7 +65,6 @@ bowtie2 index. If `--velocity` flag is set, star index is also built.
- `gene_intervals.txt`: Information about gene (chromosome, start, end, strand and symbol)
- `seq_batches.txt`: Sequencing batches
- `wells_cells.txt`: Well cells
- `*fastq.gz`: Raw reads

</details>

Expand Down Expand Up @@ -121,25 +100,23 @@ folder.
Split reads are aligned using `bowtie2`. Next, all the aligned reads are merged
into one `SAM` file which is used as an input for demultiplexing.

If `--velocity` flag is set, the reads are also aligned using `StarSolo` to estimated
both spliced and unspliced reads which can be used for RNA velocity estimation.
This is an additional plugin which we developed. In short MARS-seq2.0 reads are
converted to `10X v2` format. Additionally, a whitelist is generated for aligned
to perform demultiplexing.
If `--aligner` flag is set to `bowtie2_star` or `star`, the reads are also aligned
using `StarSolo` to estimated both spliced and unspliced reads which can be used
for RNA velocity estimation. This is an additional plugin which we developed.
In short MARS-seq2.0 reads are converted to `10X v2` format. Additionally, a
whitelist is generated for aligned to perform demultiplexing.

<details markdown="1">
<summary>Output files</summary>

- `<batch>`
- `<batch>.sam`: Merged aligned reads into one SAM file with `bowtie2`
- `velocity/`
- `Solo.out/*`: Output from StarSolo (Gene, GeneFull, SJ, Velocyto and Barcode.stats)
- `Aligned.sortedByCoord.out.bam`: Aligned reads
- `Log.final.out`: STAR alignment report containing the mapping results summary
- `Log.out` and `Log.progress.out`: STAR log files containing detailed information about the run. Typically only useful for debugging purposes
- `<batch>.cutadapt.log`: Log file from running `cutadapt`
- `<batch>_{1,2}.trim.fastq.gz`: Trimmed pair-end converted `10X v2` reads
- `SJ.out.tab`: File containing filtered splice junctions detected after mapping the reads
- `<batch>.cutadapt.log`: Log file from running `cutadapt`
- `<batch>.Log.final.out`: STAR alignment report containing the mapping results summary
- `<batch>.Log.out` and `<batch>.Log.progress.out`: STAR log files containing detailed information about the run. Typically only useful for debugging purposes
- `<batch>.Solo.out/*`: Output from StarSolo (Gene, GeneFull, SJ, Velocyto and Barcode.stats)
- `whitelist.txt`: File containing cell barcodes (combination of pool and cell barcode)

</details>
Expand Down
5 changes: 3 additions & 2 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ An [example samplesheet](../assets/samplesheet.csv) has been provided with the p
The typical command for running the pipeline is as follows:

```bash
nextflow run nf-core/marsseq --input ./samplesheet.csv --outdir ./results --genome GRCh37 -profile docker
nextflow run nf-core/marsseq --input ./samplesheet.csv --outdir ./results --fasta genome.fasta --gtf annotation.gtf -profile docker
```

This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles.
Expand Down Expand Up @@ -92,7 +92,8 @@ with:
```yaml title="params.yaml"
input: './samplesheet.csv'
outdir: './results/'
genome: 'GRCh37'
fasta: 'genome.fasta'
gtf: 'annotation.gtf'
<...>
```

Expand Down

0 comments on commit cddbe3b

Please sign in to comment.