Merge pull request #19 from maxulysse/dev

MERGE TEMPLATE
nf-core · May 21, 2024 · 7a3b939 · 7a3b939
2 parents eb18e02 + 883e5f7
commit 7a3b939
Show file tree

Hide file tree

Showing 103 changed files with 2,611 additions and 737 deletions.
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -0,0 +1 @@
+workflows/sarek/** @nf-core/sarek
diff --git a/.github/workflows/build_reference.yml b/.github/workflows/build_reference.yml
@@ -0,0 +1,49 @@
+name: Build reference genomes that changed
+on:
+  push:
+    branches:
+      - main
+    paths:
+      - "assets/genomes/*.yml"
+
+jobs:
+  run-tower:
+    name: Run AWS full tests
+    if: github.repository == 'nf-core/nascent'
+    runs-on: ubuntu-latest
+    steps:
+      - name: Find changed genomes
+        id: changed-genome-files
+        uses: tj-actions/changed-files@v42
+        with:
+          files: |
+            assets/genomes/*.yml
+      - name: Concatinate all the yamls together
+        if: steps.changed-files-specific.outputs.any_changed == 'true'
+        env:
+          CHANGED_FILES: ${{ steps.changed-files-specific.outputs.all_changed_files }}
+        run: cat ${CHANGED_FILES} > samplesheet.yml
+      # - name: Upload samplesheet.yml to s3 or Tower Datasets
+      #   run: TODO
+      - name: Launch workflow via tower
+        uses: seqeralabs/action-tower-launch@v2
+        with:
+          workspace_id: ${{ secrets.TOWER_WORKSPACE_ID }}
+          access_token: ${{ secrets.TOWER_ACCESS_TOKEN }}
+          compute_env: ${{ secrets.TOWER_COMPUTE_ENV }}
+          revision: ${{ github.sha }}
+          workdir: s3://${{ secrets.AWS_S3_SCRATCH_BUCKET }}/work
+          parameters: |
+            {
+              "input": "samplesheet.yml"
+              "hook_url": "${{ secrets.MEGATESTS_ALERTS_SLACK_HOOK_URL }}",
+              "outdir": "s3://${{ secrets.AWS_S3_BUCKET }}/nascent/results-${{ github.sha }}"
+            }
+          profiles: cloud
+
+      - uses: actions/upload-artifact@v4
+        with:
+          name: Tower debug log file
+          path: |
+            tower_action_*.log
+            tower_action_*.json
diff --git a/.gitignore b/.gitignore
@@ -6,3 +6,6 @@ results/
 testing/
 testing*
 *.pyc
+.idea
+*.log
+tmp/
diff --git a/README.md b/README.md
@@ -19,64 +19,35 @@
 
 ## Introduction
 
-**nf-core/references** is a bioinformatics pipeline that ...
+**nf-core/references** is a bioinformatics pipeline that build references.
 
-<!-- TODO nf-core:
-   Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
-   major pipeline sections and the types of output it produces. You're giving an overview to someone new
-   to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
--->
+## How to hack on it
 
-<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
-     workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples.   -->
-<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
+0. Have docker, and Nextflow installed
+1. `nextflow run main.nf`
 
-1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
-2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
+## Some thoughts on reference building:
 
-## Usage
+- We could use the glob and if you just drop a fasta in s3 bucket it'll get picked up and new resources built
+  - Could take this a step further and make it a little config file that has the fasta, gtf, genome_size etc.
+- How do we avoid rebuilding? Ideally we should build once on a new minor release of an aligner/reference. IMO kinda low priority because the main cost is going to be egress, not compute.
+- How much effort is too much effort?
+  - Should it be as easy as adding a file on s3?
+    - No that shouldn't be a requirement, should be able to link to a reference externally(A "source of truth" ie an FTP link), and the workflow will build the references
+    - So like mulled biocontainers, just make a PR to the samplesheet and boom new reference in the s3 bucket if it's approved?
 
-> [!NOTE]
-> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
+# Roadmap
 
-<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
-     Explain what rows and columns represent. For instance (please edit as appropriate):
+PoC:
 
-First, prepare a samplesheet with your input data that looks as follows:
+- Replace aws-igenomes
+  - bwa, bowtie2, star, bismark need to be built
+  - fasta, gtf, bed12, mito_name, macs_gsize blacklist, copied over
 
-`samplesheet.csv`:
+Other nice things to have:
 
-```csv
-sample,fastq_1,fastq_2
-CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
-```
-
-Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
-
--->
-
-Now, you can run the pipeline using:
-
-<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
-
-```bash
-nextflow run nf-core/references \
-   -profile <docker/singularity/.../institute> \
-   --input samplesheet.csv \
-   --outdir <OUTDIR>
-```
-
-> [!WARNING]
-> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_;
-> see [docs](https://nf-co.re/usage/configuration#custom-configuration-files).
-
-For more details and further functionality, please refer to the [usage documentation](https://nf-co.re/references/usage) and the [parameter documentation](https://nf-co.re/references/parameters).
-
-## Pipeline output
-
-To see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/references/results) tab on the nf-core website pipeline page.
-For more details about the output files and reports, please refer to the
-[output documentation](https://nf-co.re/references/output).
+- Building our test-datasets
+- Downsampling for a unified genomics test dataset creation, (Thinking about viralitegration/rnaseq/wgs) and spiking in test cases of interest(Specific variants for example)
 
 ## Credits
 

diff --git a/assets/genomes/GRCh38.yml b/assets/genomes/GRCh38.yml
@@ -0,0 +1,7 @@
+# FIXME Some check this
+- genome: GRCh38.p14
+  fasta: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/GRCh38.primary_assembly.genome.fa.gz
+  gtf: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.chr_patch_hapl_scaff.annotation.gtf.gz
+  mito_name: MT
+  site: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40
+  reference_version: GCF_000001405.40
diff --git a/assets/genomes/GRCm39.yml b/assets/genomes/GRCm39.yml
@@ -0,0 +1,6 @@
+- genome: GRCm39
+  fasta: https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.fa.gz
+  gtf: https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/genes/mm39.ncbiRefSeq.gtf.gz
+  mito_name: MT
+  site: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001635.27/
+  reference_version: GCF_000001635.27
diff --git a/assets/genomes/homo_sapiens/ucsc/chm13.yml b/assets/genomes/homo_sapiens/ucsc/chm13.yml
@@ -0,0 +1,5 @@
+- genome: "CHM13"
+  species: homo_sapiens
+  source: ucsc
+  fasta: "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz"
+  gtf: "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gtf.gz"
diff --git a/assets/genomes/homo_sapiens/ucsc/hg19.yml b/assets/genomes/homo_sapiens/ucsc/hg19.yml
@@ -0,0 +1,8 @@
+- genome: hg19
+  species: homo_sapiens
+  source: ucsc
+  fasta: "s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa"
+  gtf: "s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf"
+  bed12: "s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.bed"
+  mito_name: chrM
+  macs_gsize: 2.7e9
diff --git a/assets/genomes/homo_sapiens/ucsc/hg38.yml b/assets/genomes/homo_sapiens/ucsc/hg38.yml
@@ -0,0 +1,8 @@
+- genome: hg38
+  species: homo_sapiens
+  source: ucsc
+  fasta: "s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa"
+  gtf: "s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.gtf"
+  bed12: "s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.bed"
+  mito_name: chrM
+  macs_gsize: 2.7e9
diff --git a/assets/genomes/test/pipelines/R64-1-1.yml b/assets/genomes/test/pipelines/R64-1-1.yml
@@ -0,0 +1,53 @@
+- genome: R64-1-1
+  fasta: s3://ngi-igenomes/igenomes/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/WholeGenomeFasta/genome.fa
+  gtf: s3://ngi-igenomes/igenomes/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.gtf
+  bed12: s3://ngi-igenomes/igenomes/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.bed
+  mito_name: MT
+  macs_gsize: 1.2e7
+  readme: s3://ngi-igenomes/igenomes/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/README.txt
+# TODO
+# Required
+# reference_id:
+#   type: string
+#   default: R64-1-1
+# reference_version:
+#   type: string
+#   default: '111'
+# created_at:
+#   type: string
+#   format: date
+#   default: 2024-02-07
+
+# # Source specific
+# source_type:
+#   type: string
+#   enum:
+#     - ensembl
+#     - ucsc
+#     - ncbi
+#     - gencode
+#     - refseq
+#     - encode
+#     - custom
+
+# # OR Manually submitted
+# # Each optional, build what we can based on what is provided
+# fasta:
+#   type: string
+#   default:
+# gtf:
+#   type: string
+#   default: s3://ngi-igenomes/igenomes/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.gtf
+# bed12:
+#   type: string
+#   default: s3://ngi-igenomes/igenomes/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.bed
+# mito_name:
+#   type: string
+#   default: MT
+# macs_gsize:
+#   type: string
+#   default: 1.2e7
+
+# # Markdown block?
+# description:
+#   type: string
diff --git a/assets/genomes/test/pipelines/nascent.yml b/assets/genomes/test/pipelines/nascent.yml
@@ -0,0 +1,5 @@
+- genome: "GRCh38_chr21"
+  fasta: "https://raw.githubusercontent.com/nf-core/test-datasets/nascent/reference/GRCh38_chr21.fa"
+  gtf: "https://raw.githubusercontent.com/nf-core/test-datasets/nascent/reference/genes_chr21.gtf"
+  mito_name: "MT"
+  readme: "https://raw.githubusercontent.com/nf-core/test-datasets/nascent/README.md"
diff --git a/assets/schema_input.json b/assets/schema_input.json
@@ -7,27 +7,49 @@
     "items": {
         "type": "object",
         "properties": {
-            "sample": {
+            "genome": {
                 "type": "string",
                 "pattern": "^\\S+$",
-                "errorMessage": "Sample name must be provided and cannot contain spaces",
+                "errorMessage": "Genome name must be provided and cannot contain spaces",
                 "meta": ["id"]
             },
-            "fastq_1": {
+            "source": {
                 "type": "string",
-                "format": "file-path",
-                "exists": true,
-                "pattern": "^\\S+\\.f(ast)?q\\.gz$",
-                "errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
+                "errorMessage": "Where the references came from",
+                "meta": ["source"]
             },
-            "fastq_2": {
+            "species": {
                 "type": "string",
-                "format": "file-path",
-                "exists": true,
-                "pattern": "^\\S+\\.f(ast)?q\\.gz$",
-                "errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
+                "errorMessage": "Species of the reference",
+                "meta": ["species"]
+            },
+            "fasta": {
+                "type": "string",
+                "pattern": "^\\S+\\.f(ast|n)?a(\\.gz)?$",
+                "errorMessage": "TODO"
+            },
+            "gtf": {
+                "type": "string",
+                "pattern": "^\\S+\\.gtf(\\.gz)?$",
+                "errorMessage": "TODO"
+            },
+            "bed12": {
+                "type": "string",
+                "errorMessage": "TODO"
+            },
+            "readme": {
+                "type": "string",
+                "errorMessage": "TODO"
+            },
+            "mito_name": {
+                "type": "string",
+                "errorMessage": "TODO"
+            },
+            "macs_gsize": {
+                "type": "integer",
+                "errorMessage": "TODO"
             }
         },
-        "required": ["sample", "fastq_1"]
+        "required": ["genome", "fasta", "gtf"]
     }
 }
diff --git a/conf/base.config b/conf/base.config
@@ -10,7 +10,6 @@
 
 process {
 
-    // TODO nf-core: Check the defaults for all processes
     cpus   = { check_max( 1    * task.attempt, 'cpus'   ) }
     memory = { check_max( 6.GB * task.attempt, 'memory' ) }
     time   = { check_max( 4.h  * task.attempt, 'time'   ) }
@@ -24,7 +23,6 @@ process {
     //        These labels are used and recognised by default in DSL2 files hosted on nf-core/modules.
     //        If possible, it would be nice to keep the same label naming convention when
     //        adding in your local modules too.
-    // TODO nf-core: Customise requirements for specific processes.
     // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors
     withLabel:process_single {
         cpus   = { check_max( 1                  , 'cpus'    ) }
@@ -59,4 +57,10 @@ process {
         errorStrategy = 'retry'
         maxRetries    = 2
     }
+    errorStrategy = 'ignore'
+    publishDir = [
+        path: { "${params.outdir}/${workflow.sessionId}/${meta.species}/${meta.source}/${meta.id}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
+        mode: params.publish_dir_mode,
+        saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
+    ]
 }
diff --git a/conf/modules.config b/conf/modules.config
@@ -18,10 +18,6 @@ process {
         saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
     ]
 
-    withName: FASTQC {
-        ext.args = '--quiet'
-    }
-
     withName: 'MULTIQC' {
         ext.args   = { params.multiqc_title ? "--title \"$params.multiqc_title\"" : '' }
         publishDir = [

diff --git a/conf/test.config b/conf/test.config
@@ -20,10 +20,5 @@ params {
     max_time   = '6.h'
 
     // Input data
-    // TODO nf-core: Specify the paths to your test data on nf-core/test-datasets
-    // TODO nf-core: Give any required params for the test so that command line flags are not needed
-    input  = params.pipelines_testdata_base_path + 'viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv'
-
-    // Genome references
-    genome = 'R64-1-1'
+    input = "${projectDir}/assets/genomes/test/pipelines/nascent.yml"
 }
diff --git a/conf/test_data.config b/conf/test_data.config
@@ -0,0 +1,16 @@
+// README:
+// https://github.com/nf-core/test-datasets/blob/modules/README.md
+
+params {
+    // Base directory for test data
+    test_data_base = "https://raw.githubusercontent.com/nf-core/test-datasets/modules"
+
+    test_data {
+        'sarscov2' {
+            'genome' {
+                genome_fasta = "${params.test_data_base}/data/genomics/sarscov2/genome/genome.fasta"
+                genome_gtf = "${params.test_data_base}/data/genomics/sarscov2/genome/genome.gtf"
+            }
+        }
+    }
+}
diff --git a/docs/retreat-brainstrorming.md b/docs/retreat-brainstrorming.md
@@ -0,0 +1,18 @@
+# Brainstorming
+
+## Generate
+
+- md5 checksums (validate downloads if possible)
+
+## Track within the pipeline
+
+- software_versions
+- copy of command.sh (or just save Nextflow report?)
+- Asset input paths
+- Show skipped reference types if already existed
+- Allow appending to the readme (treat like changelog), in case new asset types added
+
+## Strategy
+
+When adding a new asset, build for the latest reference versions only. Do all genomes.
+Optionally backfill old releases on demand if specifically triggered.