The tool quandico
applies statistical methods to detect copy number
variations (CNV) in a sample by comparing the
read counts from next generation sequencing (NGS) performed after PCR-enrichment of regions of interest, typically a
set of genes with known or expected relevance for the sample, e.g. genes that play a role in cancer. Counts from a normal
control (ideally matched normal from the same individual, e.g. healthy tissue) are required.
Data are expected to be from gene-panels with primer-counts in the range of hundreds, and roughly double-digit number genes. It will not work well on whole-exome-enrichment data.
A detailed description can be found in the publication of this tool: Quantitative analysis of differences in copy numbers using read depth obtained from PCR-enriched samples and controls: BMC Bioinformatics 2015, 16:17.
The main script is written in R
and there are two helper scripts and one main driver script written in Perl.
- PCR-enrichment of target regions, sequencing and read-mapping (SAM/BAM file)
- Extract counts for every PCR amplicon (
qgetcounts
) - Cluster read-counts into regions (
qcluster
) - Compare counts per cluster and perform CNV calling (
quandico
)
Steps 2, 3 and 4 can be performed with one single command line call of the Perl
driver script quandico
. Step 4 alone can be performed inside R
.
Running quandico
requires R
with some commonly available packages from CRAN.
Please visit The R
-project homepage for advice how to install R
on your system.
Dependencies for quandico
should be installed automatically when installing the package.
Accessory applications such as a command-line driver script (quandico
), the script to extract counts of
mapped reads (qgetcounts
) and the region clustering script (qcluster
) require Perl. Perl is already installed on
most Linux systems, please check Perl.org for details. If you are using Windows we
recommend Strawberry Perl. All dependencies should automatically be installed
from CPAN.
Download the install.sh
shell script for Unix/Linux.
Check the configuration in the first section of that script, adjust
parameters and then start it in a clean directory, e.g.:
$ mkdir quandico
$ cd quandico
$ wget https://github.com/reineckef/quandico/raw/master/install.sh
$ sh install.sh
This is not well tested and may fail on various systems for various reasons.
If you are running Ubuntu linux, it is recommended to install Perl dependencies from repositories like so:
sudo apt-get install libenv-path-perl libgetopt-long-descriptive-perl libdatetime-perl libtest-script-perl
Thanks to @biocyberman #3 for this fix.
Install these R
packages, either by using R> install.packages('name')
inside an R
session
or by running the install command from a command line: $ R CMD INSTALL 'name'
:
- MASS
- ggplot2
- grid
- gridExtra
- naturalsort
- scales
The string n.m will be used for the version number (major.minor) of quandico
. This was 1.14 by the time of writing.
- Download the latest release and install the packaged
R
code ofquandico
(filequandico_n.m.tar.gz
):
$ R CMD INSTALL quandico_n.m.tar.gz
optional but recommended
- Download the latest release and install the Perl module
QUANDICO
(fileQUANDICO-vn.m.tar.gz
) usingmake
on Linux, anddmake
on Windows:
$ tar xvfz QUANDICO-vn.m.tar.gz
$ cd QUANDICO-vn.m
$ perl Makefile.PL
$ [d]make
$ [d]make test
$ [d]make install # run this as sudo/root for system install
Alternatively, you can use cpanm
from App::cpanminus
to install to module directly from the downloaded archive.
We recommend to use --verbose
mode:
$ cpanm --verbose QUANDICO-vn.m.tar.gz
# run this as sudo/root for system install
To start from mapped reads in SAM/BAM format, the Perl script qgetcounts
will call samtools
(version 1.1 or later) to
extract the counts. Therefore, samtools
needs to be installed, please visit Samtools for advice.
Example data files to test the tools are available on Google drive. To start from mapped reads (in SAM or BAM files, please note that this step additionally requires samtools version >= 1.1).
- CNA902Y.bed amplicon coordinates
- M62_NA13019.bam example case (sample)
⚠️ 252 MB ! - M62_NA12878.bam example control (reference)
⚠️ 222 MB !
Please note, these files can also be downloaded using wget (below). Warning Google fails to provide the file if no virus check can be done, which seems to be the case for the BAM files. You need to accept that on a per-click basis, so only via the web-browser, not wget
:
$ wget --no-check-certificate "https://drive.google.com/uc?export=download&id=0BzLnl09R3GITQjBUZUFVcy1BNFk" -O CNA902Y.bed
$ # fails: wget --no-check-certificate "https://drive.google.com/uc?export=download&id=0BzLnl09R3GITMzNyakhveTh3UVE" -O M62_NA13019.bam
$ # fails: wget --no-check-certificate "https://drive.google.com/uc?export=download&id=0BzLnl09R3GITSnU1TlVRSjRXRHM" -O M62_NA12878.bam
To skip count extraction and start with clustering, the extracted counts of these samples are also available:
- M62_NA13019.tsv control counts (sample)
- M62_NA12878.tsv control counts (reference)
Please note, these files can also be downloaded using wget (sometimes these links do not work, for unclear reason):
$ wget --no-check-certificate "https://drive.google.com/uc?export=download&id=0BzLnl09R3GITUDZ0aXFBd2pDR0k" -O M62_NA13019.tsv
$ wget --no-check-certificate "https://drive.google.com/uc?export=download&id=0BzLnl09R3GITWU9xTndtZE5iOEE" -O M62_NA12878.tsv
A file with gene names and coordinates is required if clusters should be named using gene names. For human assemblies GRCh37 (hg19) and GRCh38 (hg38), these file can be downloaded from UCSC:
- refGene.txt.gz assembly hg19 (5 MB)
- refGene.txt.gz assembly hg38 (5 MB)
Clustered count files ready for processing with the final step are these:
Please note, these files can also be downloaded using wget (sometimes these links do not work, for unclear reason):
$ wget --no-check-certificate "https://drive.google.com/uc?export=download&id=0BzLnl09R3GITYTduNDY4azJNZXM" -O M62_NA13019.clustered
$ wget --no-check-certificate "https://drive.google.com/uc?export=download&id=0BzLnl09R3GITWm1FS0duczVlejQ" -O M62_NA12878.clustered
You can run all steps separately:
# extract counts
$ qgetcounts -i M62_NA13019.bam -a CNA902Y.bed > M62_NA13019.tsv
$ qgetcounts -i M62_NA12878.bam -a CNA902Y.bed > M62_NA12878.tsv
# cluster the counts
$ qcluster -i M62_NA13019.tsv [--names refGene.txt] > M62_NA13019.clustered
$ qcluster -i M62_NA12878.tsv [--names refGene.txt] > M62_NA12878.clustered
# call copy numbers
$ quandico --no-cluster \
-s data=M62_NA13019.clustered \ # file with clustered counts
-r data=M62_NA12878.clustered \ # file with clustered counts
-s x=2 -s y=0 -r x=2 -r y=0 \ # sample (-s) and reference (-r) are female
[--cp names=refGene.txt] # optional for naming of clusters
Alternatively, all this can be done using one single command:
$ quandico -s map=M62_NA13019.bam -s x=2 -s y=0 \ # sample
-r map=M62_NA12878.bam -r x=2 -r y=0 \ # reference
-A CNA902Y.bed \ # amplicons
-d results -b 13019_vs_12878 \ # output location and name
[--cp names=refGene.txt] # optional cluster names
In the folder 'ExampleOutput' the result files obtained by running the above mentioned steps with sample NA13019 and reference NA12878 from sequencing run M62 are presented. These are:
- NA13019_overview.pdf an overview for all clusters on one page
- NA13019.pdf a detailed plot for every cluster
- NA13019.vcf VCF of called variants
- NA13019.csv the called variants in tabular format
An image from one page (number 19) from the detailed PDF shows a loss (top) and a gain (bottom) event in that sample on chromsome X:
The Perl command-line tools present a help and usage information when run without arguments (or by explicitly using -h
or --help
).
The R
function provides help (after installation) by doing:
R> library(quandico)
R> ?quandico
Please report bugs by opening a new issued here on GitHub.
Fot other inquiries (no bug reports, please) you can email me.