Skip to content

Commit

Permalink
Merge pull request #19 from maxulysse/dev
Browse files Browse the repository at this point in the history
MERGE TEMPLATE
  • Loading branch information
edmundmiller authored May 21, 2024
2 parents eb18e02 + 883e5f7 commit 7a3b939
Show file tree
Hide file tree
Showing 103 changed files with 2,611 additions and 737 deletions.
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
workflows/sarek/** @nf-core/sarek
49 changes: 49 additions & 0 deletions .github/workflows/build_reference.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
name: Build reference genomes that changed
on:
push:
branches:
- main
paths:
- "assets/genomes/*.yml"

jobs:
run-tower:
name: Run AWS full tests
if: github.repository == 'nf-core/nascent'
runs-on: ubuntu-latest
steps:
- name: Find changed genomes
id: changed-genome-files
uses: tj-actions/changed-files@v42
with:
files: |
assets/genomes/*.yml
- name: Concatinate all the yamls together
if: steps.changed-files-specific.outputs.any_changed == 'true'
env:
CHANGED_FILES: ${{ steps.changed-files-specific.outputs.all_changed_files }}
run: cat ${CHANGED_FILES} > samplesheet.yml
# - name: Upload samplesheet.yml to s3 or Tower Datasets
# run: TODO
- name: Launch workflow via tower
uses: seqeralabs/action-tower-launch@v2
with:
workspace_id: ${{ secrets.TOWER_WORKSPACE_ID }}
access_token: ${{ secrets.TOWER_ACCESS_TOKEN }}
compute_env: ${{ secrets.TOWER_COMPUTE_ENV }}
revision: ${{ github.sha }}
workdir: s3://${{ secrets.AWS_S3_SCRATCH_BUCKET }}/work
parameters: |
{
"input": "samplesheet.yml"
"hook_url": "${{ secrets.MEGATESTS_ALERTS_SLACK_HOOK_URL }}",
"outdir": "s3://${{ secrets.AWS_S3_BUCKET }}/nascent/results-${{ github.sha }}"
}
profiles: cloud

- uses: actions/upload-artifact@v4
with:
name: Tower debug log file
path: |
tower_action_*.log
tower_action_*.json
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,6 @@ results/
testing/
testing*
*.pyc
.idea
*.log
tmp/
69 changes: 20 additions & 49 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,64 +19,35 @@

## Introduction

**nf-core/references** is a bioinformatics pipeline that ...
**nf-core/references** is a bioinformatics pipeline that build references.

<!-- TODO nf-core:
Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
major pipeline sections and the types of output it produces. You're giving an overview to someone new
to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
-->
## How to hack on it

<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->
<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
0. Have docker, and Nextflow installed
1. `nextflow run main.nf`

1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
## Some thoughts on reference building:

## Usage
- We could use the glob and if you just drop a fasta in s3 bucket it'll get picked up and new resources built
- Could take this a step further and make it a little config file that has the fasta, gtf, genome_size etc.
- How do we avoid rebuilding? Ideally we should build once on a new minor release of an aligner/reference. IMO kinda low priority because the main cost is going to be egress, not compute.
- How much effort is too much effort?
- Should it be as easy as adding a file on s3?
- No that shouldn't be a requirement, should be able to link to a reference externally(A "source of truth" ie an FTP link), and the workflow will build the references
- So like mulled biocontainers, just make a PR to the samplesheet and boom new reference in the s3 bucket if it's approved?

> [!NOTE]
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
# Roadmap

<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
Explain what rows and columns represent. For instance (please edit as appropriate):
PoC:

First, prepare a samplesheet with your input data that looks as follows:
- Replace aws-igenomes
- bwa, bowtie2, star, bismark need to be built
- fasta, gtf, bed12, mito_name, macs_gsize blacklist, copied over

`samplesheet.csv`:
Other nice things to have:

```csv
sample,fastq_1,fastq_2
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
```
Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
-->

Now, you can run the pipeline using:

<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->

```bash
nextflow run nf-core/references \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--outdir <OUTDIR>
```

> [!WARNING]
> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;
> see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
For more details and further functionality, please refer to the [usage documentation](https://nf-co.re/references/usage) and the [parameter documentation](https://nf-co.re/references/parameters).

## Pipeline output

To see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/references/results) tab on the nf-core website pipeline page.
For more details about the output files and reports, please refer to the
[output documentation](https://nf-co.re/references/output).
- Building our test-datasets
- Downsampling for a unified genomics test dataset creation, (Thinking about viralitegration/rnaseq/wgs) and spiking in test cases of interest(Specific variants for example)

## Credits

Expand Down
7 changes: 7 additions & 0 deletions assets/genomes/GRCh38.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# FIXME Some check this
- genome: GRCh38.p14
fasta: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/GRCh38.primary_assembly.genome.fa.gz
gtf: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.chr_patch_hapl_scaff.annotation.gtf.gz
mito_name: MT
site: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40
reference_version: GCF_000001405.40
6 changes: 6 additions & 0 deletions assets/genomes/GRCm39.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
- genome: GRCm39
fasta: https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.fa.gz
gtf: https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/genes/mm39.ncbiRefSeq.gtf.gz
mito_name: MT
site: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001635.27/
reference_version: GCF_000001635.27
5 changes: 5 additions & 0 deletions assets/genomes/homo_sapiens/ucsc/chm13.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
- genome: "CHM13"
species: homo_sapiens
source: ucsc
fasta: "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz"
gtf: "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf.gz"
8 changes: 8 additions & 0 deletions assets/genomes/homo_sapiens/ucsc/hg19.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
- genome: hg19
species: homo_sapiens
source: ucsc
fasta: "s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa"
gtf: "s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf"
bed12: "s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.bed"
mito_name: chrM
macs_gsize: 2.7e9
8 changes: 8 additions & 0 deletions assets/genomes/homo_sapiens/ucsc/hg38.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
- genome: hg38
species: homo_sapiens
source: ucsc
fasta: "s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa"
gtf: "s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.gtf"
bed12: "s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.bed"
mito_name: chrM
macs_gsize: 2.7e9
53 changes: 53 additions & 0 deletions assets/genomes/test/pipelines/R64-1-1.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
- genome: R64-1-1
fasta: s3://ngi-igenomes/igenomes/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/WholeGenomeFasta/genome.fa
gtf: s3://ngi-igenomes/igenomes/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.gtf
bed12: s3://ngi-igenomes/igenomes/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.bed
mito_name: MT
macs_gsize: 1.2e7
readme: s3://ngi-igenomes/igenomes/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/README.txt
# TODO
# Required
# reference_id:
# type: string
# default: R64-1-1
# reference_version:
# type: string
# default: '111'
# created_at:
# type: string
# format: date
# default: 2024-02-07

# # Source specific
# source_type:
# type: string
# enum:
# - ensembl
# - ucsc
# - ncbi
# - gencode
# - refseq
# - encode
# - custom

# # OR Manually submitted
# # Each optional, build what we can based on what is provided
# fasta:
# type: string
# default:
# gtf:
# type: string
# default: s3://ngi-igenomes/igenomes/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.gtf
# bed12:
# type: string
# default: s3://ngi-igenomes/igenomes/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.bed
# mito_name:
# type: string
# default: MT
# macs_gsize:
# type: string
# default: 1.2e7

# # Markdown block?
# description:
# type: string
5 changes: 5 additions & 0 deletions assets/genomes/test/pipelines/nascent.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
- genome: "GRCh38_chr21"
fasta: "https://raw.githubusercontent.com/nf-core/test-datasets/nascent/reference/GRCh38_chr21.fa"
gtf: "https://raw.githubusercontent.com/nf-core/test-datasets/nascent/reference/genes_chr21.gtf"
mito_name: "MT"
readme: "https://raw.githubusercontent.com/nf-core/test-datasets/nascent/README.md"
48 changes: 35 additions & 13 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,27 +7,49 @@
"items": {
"type": "object",
"properties": {
"sample": {
"genome": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Sample name must be provided and cannot contain spaces",
"errorMessage": "Genome name must be provided and cannot contain spaces",
"meta": ["id"]
},
"fastq_1": {
"source": {
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
"errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
"errorMessage": "Where the references came from",
"meta": ["source"]
},
"fastq_2": {
"species": {
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
"errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
"errorMessage": "Species of the reference",
"meta": ["species"]
},
"fasta": {
"type": "string",
"pattern": "^\\S+\\.f(ast|n)?a(\\.gz)?$",
"errorMessage": "TODO"
},
"gtf": {
"type": "string",
"pattern": "^\\S+\\.gtf(\\.gz)?$",
"errorMessage": "TODO"
},
"bed12": {
"type": "string",
"errorMessage": "TODO"
},
"readme": {
"type": "string",
"errorMessage": "TODO"
},
"mito_name": {
"type": "string",
"errorMessage": "TODO"
},
"macs_gsize": {
"type": "integer",
"errorMessage": "TODO"
}
},
"required": ["sample", "fastq_1"]
"required": ["genome", "fasta", "gtf"]
}
}
8 changes: 6 additions & 2 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@

process {

// TODO nf-core: Check the defaults for all processes
cpus = { check_max( 1 * task.attempt, 'cpus' ) }
memory = { check_max( 6.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
Expand All @@ -24,7 +23,6 @@ process {
// These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
// If possible, it would be nice to keep the same label naming convention when
// adding in your local modules too.
// TODO nf-core: Customise requirements for specific processes.
// See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
withLabel:process_single {
cpus = { check_max( 1 , 'cpus' ) }
Expand Down Expand Up @@ -59,4 +57,10 @@ process {
errorStrategy = 'retry'
maxRetries = 2
}
errorStrategy = 'ignore'
publishDir = [
path: { "${params.outdir}/${workflow.sessionId}/${meta.species}/${meta.source}/${meta.id}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}
4 changes: 0 additions & 4 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,6 @@ process {
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]

withName: FASTQC {
ext.args = '--quiet'
}

withName: 'MULTIQC' {
ext.args = { params.multiqc_title ? "--title \"$params.multiqc_title\"" : '' }
publishDir = [
Expand Down
7 changes: 1 addition & 6 deletions conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,5 @@ params {
max_time = '6.h'

// Input data
// TODO nf-core: Specify the paths to your test data on nf-core/test-datasets
// TODO nf-core: Give any required params for the test so that command line flags are not needed
input = params.pipelines_testdata_base_path + 'viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv'

// Genome references
genome = 'R64-1-1'
input = "${projectDir}/assets/genomes/test/pipelines/nascent.yml"
}
16 changes: 16 additions & 0 deletions conf/test_data.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
// README:
// https://github.com/nf-core/test-datasets/blob/modules/README.md

params {
// Base directory for test data
test_data_base = "https://raw.githubusercontent.com/nf-core/test-datasets/modules"

test_data {
'sarscov2' {
'genome' {
genome_fasta = "${params.test_data_base}/data/genomics/sarscov2/genome/genome.fasta"
genome_gtf = "${params.test_data_base}/data/genomics/sarscov2/genome/genome.gtf"
}
}
}
}
18 changes: 18 additions & 0 deletions docs/retreat-brainstrorming.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Brainstorming

## Generate

- md5 checksums (validate downloads if possible)

## Track within the pipeline

- software_versions
- copy of command.sh (or just save Nextflow report?)
- Asset input paths
- Show skipped reference types if already existed
- Allow appending to the readme (treat like changelog), in case new asset types added

## Strategy

When adding a new asset, build for the latest reference versions only. Do all genomes.
Optionally backfill old releases on demand if specifically triggered.
Loading

0 comments on commit 7a3b939

Please sign in to comment.