First we will try NVIDIA Clara-Parabricks. The container is distributed via Docker and unfortunately Polaris does not allow usage of Docker. In order to use Parabricks within Polaris we must first create a Singularity container:
module use /soft/spack/gcc/0.6.1/install/modulefiles/Core
module load singularityce
mkdir -p /grand/projects/GeomicVar/rodriguez/1kg_proj/data/tools/singularity
cd /grand/projects/GeomicVar/rodriguez/1kg_proj/data/tools/singularity
singularity build --tmpdir ~/. parabricks-4.3 docker://
Now we can go ahead and test a simple job to see if our Singularity container build was a success.
arodriguez@x3004c0s13b1n0:~> qsub -A geomicVar -I -l select=1 -l walltime=1:00:00 -l filesystems=home:eagle -q debug
qsub: waiting for job to start
qsub: job ready
module use /soft/spack/gcc/0.6.1/install/modulefiles/Core
module load singularityce
cd /grand/projects/GeomicVar/rodriguez/1kg_proj/data/tools/singularity
arodriguez@x3004c0s13b1n0:~> singularity exec --bind /grand/projects/GeomicVar/rodriguez:/grand/projects/GeomicVar/rodriguez ./parabricks-4.3 pbrun -h
NVIDIA Clara-Parabricks provides multiple tools and workflows. We will be testing the tools specified in the link above as well as the workflows for germline variant detection using both low-coverage and 30X whole-genome samples.
The reference data used is GRCh38 and the directions on how to download this were taken from here.
The deepvariant-germline workflow runs through multiple tools such as bwa-mem, mark duplicates, and deepVariant. The workflow requires fastq files as input and provides an alignment BAM file and VCF as outputs. These versions of the tools take advantage of Polaris's GPU framework by accelerating the analysis.
The following was used to execute the analysis interactively on a Polaris node:
# Ask for an interactive node on Polaris and wait until this is provided
qsub -A covid-ct -I -l select=1 -l walltime=1:00:00 -l filesystems=home:eagle -q debug
# load the required modules
module use /soft/spack/gcc/0.6.1/install/modulefiles/Core
module load singularityce
# run the deepvariant-germline workflow using the low-coverage sequences
singularity run --bind /lus/grand/projects/covid-ct/arodriguez:/lus/grand/projects/covid-ct/arodriguez --nv /lus/grand/projects/covid-ct/arodriguez/tools/singularity/parabricks-4.0 pbrun deepvariant_germline --ref /lus/grand/projects/covid-ct/arodriguez/wgs_test/reference/hg38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --in-fq /lus/grand/projects/covid-ct/arodriguez/wgs_test/HG00138/low_cov/ERR016162_1.fastq.gz /lus/grand/projects/covid-ct/arodriguez/wgs_test/HG00138/low_cov/ERR016162_2.fastq.gz --out-variants /lus/grand/projects/covid-ct/arodriguez/wgs_test/HG00138/output/low_cov --out-bam /lus/grand/projects/covid-ct/arodriguez/wgs_test/HG00138/output/low_cov/HG00138.bam --out-variants /lus/grand/projects/covid-ct/arodriguez/wgs_test/HG00138/output/low_cov/HG00138.vcf
The log file of the run can be reached here.
The execution times can be seen in the table below:
Tools | Description | Version | Execution Time (seconds) | GPU Usage |
Parabricks accelerated Genomics Pipeline | BWA-mem Sorting Phase-I | 4.0.0-1 | 105 | X |
Parabricks accelerated Genomics Pipeline | Sorting Phase-II | 4.0.0-1 | 10 | X |
Parabricks accelerated Genomics Pipeline | Marking Duplicates, BQSR | 4.0.0-1 | 20 | X |
Parabricks accelerated Genomics Pipeline | deepvariant | 4.0.0-1 | 337 | X |
Total | 472 seconds = 7.9 minutes |
The input data for the 30X sequences are in CRAM format. Since the Parabricks pipeline requires fastq as input, we will need to run the Parabricks tool bam2fq before running the pipeline.
The following was used to execute the analysis interactively on a Polaris node:
# Ask for an interactive node on Polaris and wait until this is provided
qsub -A covid-ct -I -l select=1 -l walltime=1:00:00 -l filesystems=home:eagle -q debug
# load the required modules
module use /soft/spack/gcc/0.6.1/install/modulefiles/Core
module load singularityce
# convert from CRAM to fq
# singularity command needs to have the -W and -H parameters so it can write any tmp files to the specified path instead of the home directory
# home directory can fill up and kill the job
singularity run -H /lus/grand/projects/covid-ct/arodriguez -W /lus/grand/projects/covid-ct/arodriguez/tmp --bind /lus/grand/projects/covid-ct/arodriguez:/lus/grand/projects/covid-ct/arodriguez --nv /lus/grand/projects/covid-ct/arodriguez/tools/singularity/parabricks-4.0 pbrun bam2fq --in-bam /lus/grand/projects/covid-ct/arodriguez/wgs_test/HG00138/30x_cov/ --out-prefix /lus/grand/projects/covid-ct/arodriguez/wgs_test/HG00138/30x_cov/HG00138_30x_2.fastq --ref /lus/grand/projects/covid-ct/arodriguez/wgs_test/reference/GRCh38_CRAM/GRCh38_full_analysis_set_plus_decoy_hla.fa
# run the deepvariant-germline workflow using the low-coverage sequences
singularity run -H /lus/grand/projects/covid-ct/arodriguez -W /lus/grand/projects/covid-ct/arodriguez/tmp --bind /lus/grand/projects/covid-ct/arodriguez:/lus/grand/projects/covid-ct/arodriguez --nv /lus/grand/projects/covid-ct/arodriguez/tools/singularity/parabricks-4.0 pbrun deepvariant_germline --ref /lus/grand/projects/covid-ct/arodriguez/wgs_test/reference/GRCh38_CRAM/GRCh38_full_analysis_set_plus_decoy_hla.fa --in-fq /lus/grand/projects/covid-ct/arodriguez/wgs_test/HG00138/30x_cov/HG00138_30x_2.fastq_1.fastq.gz /lus/grand/projects/covid-ct/arodriguez/wgs_test/HG00138/30x_cov/HG00138_30x_2.fastq_2.fastq.gz --out-bam /lus/grand/projects/covid-ct/arodriguez/wgs_test/HG00138/output/30x/HG00138.bam --out-variants /lus/grand/projects/covid-ct/arodriguez/wgs_test/HG00138/output/30x/HG00138.vcf
The log file of the run can be reached here.
Below are the execution times:
Tools | Description | Version | Execution Time (seconds) | GPU Usage |
Parabricks accelerated Genomics Pipeline | bam2fq | 4.0.0-1 | 1015 | X |
Parabricks accelerated Genomics Pipeline | BWA-mem Sorting Phase-I | 4.0.0-1 | 1689 | X |
Parabricks accelerated Genomics Pipeline | Sorting Phase-II | 4.0.0-1 | 50 | X |
Parabricks accelerated Genomics Pipeline | Marking Duplicates, BQSR | 4.0.0-1 | 100 | X |
Parabricks accelerated Genomics Pipeline | deepvariant | 4.0.0-1 | 1094 | X |
Total | 3948 seconds = 66 minutes |