Please checkout our manuscript, now published Peer Community Journal.
Salehi, Sohrab; Dorri, Fatemeh; Chern, Kevin; Kabeer, Farhia; Rusk, Nicole; Funnell, Tyler; Williams, Marc J.; Lai, Daniel; Andronescu, Mirela; Campbell, Kieran R.; McPherson, Andrew; Aparicio, Samuel; Roth, Andrew; Shah, Sohrab P.; Bouchard-Côté, Alexandre. Cancer phylogenetic tree inference at scale from 1000s of single cell genomes. Peer Community Journal, Volume 3 (2023), article no. e63. doi : 10.24072/pcjournal.292. https://peercommunityjournal.org/articles/10.24072/pcjournal.292/
In this section, we demonstrate how Sitka can be used to perform inference on a dummy dataset, i.e., example/data/cnv.csv
. The additional tree-growing feature is described in continuation, in the next section. Note Tidyverse is required.
To install the inference software, change directory into sitka/
and run ./gradlew installDist
. Compiled binaries can be found in sitka/build/install/nowellpack/bin
. Add the above path to the PATH
environment variable. Alternatively, prepend binary invocations COMMAND-NAME
with the path, i.e., sitka/build/install/nowellpack/bin/COMMAND-NAME
.
Make ssure that you have the following Oracle Java 8 installed:
java version "1.8.0_202"
Java(TM) SE Runtime Environment (build 1.8.0_202-b08)
Formally, Sitka takes as input a cell by locus matrix whose entries take on binary values as described in the paper. Technically, the input CSV file should contain columns cells
, loci
, and tipInclusionProbabilities
(a cell by locus matrix in tidy format). For example, the first 4 rows might look like the following:
"cells","loci","tipInclusionProbabilities"
"cell_940","1_0_117005",1
"foo","X_5399878_10765752",0
"bar","11_11699622_13502537",1
...
Typically, analysis pipelines will output CSVs with CNV data as opposed to binary data, e.g., examples/data/cnv.csv
. To transform CNV data to binary data, use, for example, Rscript cntob.R -i data/cnv.csv -o data/binary.csv
(-i
for input, -o
for output) from the example/
directory.
The first step of the process is to correct for jitter noise. This can be performed via the corrupt-straighten
binary:
$ corrupt-straighten --input data/binary.csv --neighborhoodSize 2
where the argument neighborhoodSize
controls for the jitter neighborhood size.
The resulting CSV file, output.csv
can be found in results/latest
(or as stated in the standard output).
output.csv
will be of the same format as binary.csv
with, potentially, prefixes added to values. Note the input file will not be modified.
For convenience, copy the output file to the working directory.
$ cp results/latest/output.csv ./
Note the results/latest
directory is soft linked to a time-stamped (and hashed) directory until another binary is invoked. In other words, it is merely a shortcut to access the latest outputs, and should not be used in input paths. See the standard output for absolute paths.
The second step filters loci. This can be performed via the corrupt-filter
binary:
$ corrupt-filter --input output.csv --lowerFraction 0.05
where the argument lower-fraction
controls for the proportion threshold used in the filtering step.
Again, the output can be found in results/latest/filtered.csv
, and we copy it to the working directory.
$ cp results/latest/filtered.csv ./
Steps 1 and 2 are preprocessing steps. This step is where the phylogenetic tree inference takes place and is the bulk of the computation. To perform inference, we use the binary corrupt-infer-with-noisy-params
.
$ corrupt-infer-with-noisy-params \
--model.globalParameterization true \
--model.binaryMatrix filtered.csv \
--model.fprBound 0.1 \
--model.fnrBound 0.5 \
--engine PT \
--engine.initialization FORWARD \
--engine.nScans 10000 \
--engine.nPassesPerScan 1 \
--engine.nChains 10
Briefly, the model-specific arguments globalParameterization
,binaryMatrix
,fprBound
, fnrBound
control for, respectively, the global parameter model, the input file path, and upper bounds for false-positive and false-negative rates.
Set the nScans
, nChains
, nPassesPerScan
according to a balance of computational budget and dataset complexity.
Additional details for command line arguments can found on the Blang webpage, or via corrupt-infer-with-noisy-params --help
.
The output of the inference is a sequence of phylogenetic trees, phylo.csv
, which can be post-processed to obtain a point-estimate tree. We copy it into the working directory.
$ cp results/latest/samples/phylo.csv ./
The point estimate is computed in two steps: a) average tip inclusion probabilities, and b) greedy search. The average can be computed using the binary corrupt-average
, and the greedy estimate can be computed using corrupt-greedy
.
$ corrupt-average --csvFile phylo.csv --logisticTransform false
$ cp results/latest/average.csv ./
Reminder: results/latest/
is a softlink, and will be changed by the following command for greedy search.
$ corrupt-greedy --tipInclusionProbabilities ReadOnlyCLMatrix average.csv
Finally, the resulting tree in Newick format can be found at results/latest/tree.newick
.
$ cp results/latest/tree.newick ./
A tree-growing feature is used to place additional
- cells that were measured with the same set of loci as the original data, or
- loci that were measured with the same set of cells as the original data.
Placements are determined via maximum a posteriori estimates. To access this feature, invoke the corrupt-grow
binary.
For example, given tree.newick
from Step 4.
corrupt-grow --matrix ReadOnlyCLMatrix data/extra.csv --phylo file tree.newick