Skip to content

Commit

Permalink
Merge pull request #192 from andersen-lab/adaptive
Browse files Browse the repository at this point in the history
Adaptive regularization method
  • Loading branch information
joshuailevy authored Dec 12, 2023
2 parents 5824bc5 + c52e681 commit 1b14ad0
Show file tree
Hide file tree
Showing 7 changed files with 65 additions and 23 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
.DS_Store
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
22 changes: 11 additions & 11 deletions freyja.egg-info/PKG-INFO
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ We can then run Freyja on the output files using the commmand:
```
freyja demix [variants-file] [depth-file] --output [output-file]
```
This outputs to a tsv file that includes the lineages present, their corresponding abundances, and summarization by constellation. This method also includes a `--eps` option, which enables the user to define the minimum lineage abundance returned to the user (e.g. `--eps 0.0001`). A custom barcode file can be provided using the `--barcodes [path-to-barcode-file]` option. As the UShER tree now included proposed lineages, we now offer the ```--confirmedonly``` flag which removes unconfirmed lineages from the analysis. For additional flexibility and reproducibility of analyses, a custom lineage-to-contellation mapping metadata file can be provided using the `--meta` option. An example output should have the format
This outputs to a tsv file that includes the lineages present, their corresponding abundances, and summarization by constellation. This method also includes a `--eps` option, which enables the user to define the minimum lineage abundance returned to the user (e.g. `--eps 0.0001`). A custom barcode file can be provided using the `--barcodes [path-to-barcode-file]` option. As the UShER tree now included proposed lineages, we now offer the ```--confirmedonly``` flag which removes unconfirmed lineages from the analysis. For additional flexibility and reproducibility of analyses, a custom lineage-to-contellation mapping metadata file can be provided using the `--meta` option. A coverage depth minimum can be specified using the `--depthcutoff` option, which excludes sites with coverage less than the specified value. An example output should have the format


| | filename |
Expand All @@ -63,7 +63,7 @@ This outputs to a tsv file that includes the lineages present, their correspondi
| resid | 3.14159 |
| coverage | 95.8 |

Where ```summarized``` denotes a sum of all lineage abundances in a particular WHO designation (i.e. B.1.617.2 and AY.6 abundances are summed in the above example), otherwise they are grouped into "Other". The ```lineage``` array lists the identified lineages in descending order, and ```abundances``` contains the corresponding abundances estimates. The value of ```resid``` corresponds to the residual of the weighted least absolute devation problem used to estimate lineage abundances. The ```coverage``` value provides the 10x coverage estimate (percent of sites with 10 or greater reads- 10 is the default but can be modfied using the ```--covcut``` option in ```demix```).
Where ```summarized``` denotes a sum of all lineage abundances in a particular WHO designation (i.e. B.1.617.2 and AY.6 abundances are summed in the above example), otherwise they are grouped into "Other". The ```lineage``` array lists the identified lineages in descending order, and ```abundances``` contains the corresponding abundances estimates. Using the `--depthcutoff` option may result in some distinct lineages now having identical barcodes, which are grouped into the format `[lineage]-like(num)` (based on their shared phylogeny) in the output. A summary of this lineage grouping is outputted to `[output-file]_collapsed_lineages.yml`. The value of ```resid``` corresponds to the residual of the weighted least absolute devation problem used to estimate lineage abundances. The ```coverage``` value provides the 10x coverage estimate (percent of sites with 10 or greater reads- 10 is the default but can be modfied using the ```--covcut``` option in ```demix```) .

NOTE: The ```freyja variants``` output is stable in time, and does not need to be re-run to incorporate updated lineage designations/corresponding mutational barcodes, whereas the outputs of ```freyja demix``` will change as barcodes are updated (and thus ```demix``` should be re-run as new information is made available).

Expand All @@ -81,7 +81,7 @@ We now provide a fast bootstrapping method for freyja, which can be run using th
```
freyja boot [variants-file] [depth-file] --nt [number-of-cpus] --nb [number-of-bootstraps] --output_basename [base-name]
```
which results in two output files `base-name_lineages.csv` and `base-name_summarized.csv`, which contain the 0.025, 0.05,0.25,0.5 (median),0.75, 0.95, and 0.975 percentiles for each lineage and WHO designated VOI/VOC, respectively, as obtained via the bootstrap. We also provide the `--eps`, `--barcodes`, and `--meta` options as in `freyja demix`. We now also provide a `--boxplot` option, which should be specified in the form `--boxplot pdf` if you want the boxplot in pdf format.
which results in two output files: `base-name_lineages.csv` and `base-name_summarized.csv`, which contain the 0.025, 0.05, 0.25, 0.5 (median),0.75, 0.95, and 0.975 percentiles for each lineage and WHO designated VOI/VOC, respectively, as obtained via the bootstrap. If the `--rawboots` option is used, it will return two additional output files `base-name_lineages_boot.csv` and `base-name_summarized_boot.csv`, which contain the bootstrap estimates (rather than summary statistics). We also provide the `--eps`, `--barcodes`, and `--meta` options as in `freyja demix`. We now also provide a `--boxplot` option, which should be specified in the form `--boxplot pdf` if you want the boxplot in pdf format.

For rapid visualization of results, we also offer two utility methods for manipulating the "demixed" output files. The first is an aggregation method

Expand Down Expand Up @@ -142,7 +142,7 @@ We now provide tools for the analysis of indexed bam files given a set of mutati
```
freyja extract [query-mutations.csv] [input-bam] --output [outputname.bam] --same_read
```
The above command will extract reads containing one or more mutations of interest and save them to `[input-bam]_extracted.bam`. Additionally, the `--same_read` flag can be included to specify that all query mutations must occur on the same read. In order to exclude reads containing one or more mutations, use the following:
The above command will extract reads containing one or more mutations of interest and save them to `[input-bam]_extracted.bam`. Additionally, the `--same_read` flag can be included to specify that all query mutations must occur on the same set of paired reads (same UMI). In order to **exclude** reads containing one or more mutations, use the following:
```
freyja filter [query-mutations.csv] [input-bam] [min-site] [max-site] --output [outputname.bam]
```
Expand All @@ -152,15 +152,15 @@ C75T,G230A,A543C
(732:'TT'),(1349:'A'),(12333:'A')
(1443:32),(1599:2),(2036:3)
```
In many cases, it can be useful to study covariant mutations (i.e. mutations co-occurring on the same read). This can be performed with the command:
In many cases, it can be useful to study covariant mutations (i.e. mutations co-occurring on the same read pair). This can be performed with the command:
```
freyja covariants [input_bam] [min_site] [max_site] --output [outputname.tsv]
freyja covariants [input_bam] [min_site] [max_site] --ref-genome [ref-genome.fasta (defaults to NC_045512.2 Hu-1)] --gff-file [gene-annotations.gff] --output [outputname.tsv] --sort_by [count|freq|site]
```
This outputs to a tsv file that includes the mutations present in each set of covariants, their frequency (with respect to the total number of reads between `min_site` and `max_site`, inclusively), as well as the coverage ranges. Should the user wish to only consider reads that span the entire genomic region defined by (min_site, max_site), they may include the `--spans_region` flag. Inclusion thresholds for read-mapping quality and the number of observed instances of a set of covariants can be set using `--min_quality` and `--min_count` respectively. The output can then be passed into:
This outputs to a tsv file that includes the mutations present in each set of covariants, their absolute counts (the number of read pairs with the mutations), their coverage ranges (the minimum and maximum position for read-pairs with the mutations), their "maximum" counts (the number of read pairs that span the positions in the mutations), and their frequencies (the absolute count divided by the maximum count). Should the user wish to only consider read pairs that span the entire genomic region defined by (min_site, max_site), they may include the `--spans_region` flag. By default, the covariant patterns are sorted in descending order by count, however they can also be sorted in descending order by frequency by setting the `--sort_by` option to "freq", or sorted sequentially by mutation site by setting the `--sort_by` option to "site". The `--ref-genome` argument defaults to `freyja/data/NC_045512_Hu-1.fasta`. If you are using a different build to perfrom alignment, it is important to pass that file in to `--ref-genome` instead. Optionally, a gff file (e.g. `freyja/data/NC_045512_Hu-1.gff`) may be included via the `--gff-file` option to output amino acid mutations alongside nucleotide mutations. Inclusion thresholds for read-mapping quality and the number of observed instances of a set of covariants can be set using `--min_quality` and `--min_count` respectively. The output can then be passed into:
```
freyja plot-covariants [covariant_file.tsv] --output [plot-filename(.pdf,.png,etc.)]
```
This generates an informative heatmap visualization, showing variants as well as coverage across different reads. The threshold, `--min_mutations`, sets a minimum number of mutations per set of covariants for inclusion in the plot. Example output:
| **Covariants Plot**|
| :---: |
|![](freyja/data/example_covariants_plot.png)|
This generates an informative heatmap visualization showing the frequency and coverage ranges of each observed covariant pattern. Each pattern is listed in the format CP(n). `--num_clusters` sets the number of patterns (i.e. rows in the plot) to display, while the threshold `--min_mutations` sets a minimum number of mutations per set of covariants for inclusion in the plot. The flag `--nt_muts` can be included to provide nucleotide-specific information in the plot (e.g. C22995A(S:T478K) as opposed to S:T478K), making it possible to view multi-nucleotide variants.

Example output:
![](freyja/data/example_covariants_plot.png)
3 changes: 3 additions & 0 deletions freyja.egg-info/SOURCES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ freyja/__init__.py
freyja/_cli.py
freyja/convert_paths2barcodes.py
freyja/read_analysis_tools.py
freyja/read_analysis_utils.py
freyja/sample_deconv.py
freyja/updates.py
freyja/utils.py
Expand Down Expand Up @@ -37,8 +38,10 @@ freyja/data/sweep_metadata.csv
freyja/data/sweep_metadata_missing.csv
freyja/data/test.bam
freyja/data/test.bam.bai
freyja/data/test.depth
freyja/data/test.pdf
freyja/data/test.png
freyja/data/test.tsv
freyja/data/test.vcf
freyja/data/test0.html
freyja/data/test0.png
Expand Down
12 changes: 9 additions & 3 deletions freyja/_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@


@click.group()
@click.version_option('1.4.7')
@click.version_option('1.4.8')
def cli():
pass

Expand Down Expand Up @@ -57,8 +57,13 @@ def print_barcode_version(ctx, param, value):
@click.option('--depthcutoff', default=0,
help='exclude sites with coverage depth below this value and'
'group identical barcodes')
@click.option('--adapt', default=0.,
help='adaptive lasso penalty parameter')
@click.option('--a_eps', default=1E-8,
help='adaptive lasso parameter, hard threshold')
def demix(variants, depths, output, eps, barcodes, meta,
covcut, confirmedonly, depthcutoff):
covcut, confirmedonly, depthcutoff,
adapt, a_eps):
locDir = os.path.abspath(os.path.join(os.path.realpath(__file__),
os.pardir))
# option for custom barcodes
Expand Down Expand Up @@ -92,7 +97,8 @@ def demix(variants, depths, output, eps, barcodes, meta,
sample_strains, abundances, error = solve_demixing_problem(df_barcodes,
mix,
depths_,
eps)
eps, adapt,
a_eps)
# merge intra-lineage diversity if multiple hits.
if len(set(sample_strains)) < len(sample_strains):
localDict = {}
Expand Down
37 changes: 32 additions & 5 deletions freyja/sample_deconv.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ def map_to_constellation(sample_strains, vals, mapDict):
return localDict


def solve_demixing_problem(df_barcodes, mix, depths, eps):
def solve_demixing_problem(df_barcodes, mix, depths, eps, adapt, a_eps):
# single file problem setup, solving

dep = np.log2(depths+1)
Expand All @@ -169,6 +169,28 @@ def solve_demixing_problem(df_barcodes, mix, depths, eps):
'likely due to insufficient sequencing depth.')
sys.exit(1)
sol = x.value
rnorm0 = cp.norm(A @ x - b, 1).value
if adapt > 0.:
# if specified, run the adaptive lasso based method
# thresholding step
solNz = [j for j, s in enumerate(sol) if s > a_eps]
solFlip = [1./sol[j] for j in solNz]
A = A[:, solNz]
x = cp.Variable(A.shape[1])
cost = cp.norm(A @ x - b, 1) +\
(rnorm0*adapt)*cp.norm(cp.multiply(solFlip, x), 1)
constraints = [sum(x) == 1, x >= 0]
prob = cp.Problem(cp.Minimize(cost), constraints)
try:
prob.solve(verbose=False)
except cp.error.SolverError:
print('demix: Solver error encountered, most '
'likely due to insufficient sequencing depth. '
'Try increasing the --depthcutoff parameter.')
sys.exit(1)
sol = np.zeros(len(sol))
sol[solNz] = x.value

rnorm = cp.norm(A @ x - b, 1).value
# extract lineages with non-negligible abundance
sol[sol < eps] = 0
Expand All @@ -181,7 +203,8 @@ def solve_demixing_problem(df_barcodes, mix, depths, eps):


def bootstrap_parallel(jj, samplesDefining, fracDepths_adj, mix_grp,
mix, df_barcodes, eps0, muts, mapDict):
mix, df_barcodes, eps0, muts, mapDict,
adapt, a_eps):
# helper function for fast bootstrap and solve
# get sequencing depth at the position of all defining mutations
mix_boot = mix.copy()
Expand Down Expand Up @@ -220,14 +243,16 @@ def bootstrap_parallel(jj, samplesDefining, fracDepths_adj, mix_grp,
mix_boot, dps_)
sample_strains, abundances, error = solve_demixing_problem(df_barcodes,
mix_boot_,
dps_, eps0)
dps_, eps0,
adapt, a_eps)
localDict = map_to_constellation(sample_strains, abundances, mapDict)
return sample_strains, abundances, localDict


def perform_bootstrap(df_barcodes, mix, depths_,
numBootstraps, eps0, n_jobs,
mapDict, muts, boxplot, basename):
mapDict, muts, boxplot, basename,
adapt=0., a_eps=1E-8):
depths_.index = depths_.index.to_series().apply(lambda x:
int(x[1:len(x)-1]))
depths_ = depths_[~depths_.index.duplicated(keep='first')]
Expand Down Expand Up @@ -257,7 +282,9 @@ def perform_bootstrap(df_barcodes, mix, depths_,
df_barcodes,
eps0,
muts,
mapDict)
mapDict,
adapt,
a_eps)
for jj0 in tqdm(range(numBootstraps)))
for i in range(len(out)):
sample_lins, abundances, localDict = out[i]
Expand Down
9 changes: 5 additions & 4 deletions freyja/tests/test_deconv.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import unittest
import pandas as pd
from freyja.sample_deconv import buildLineageMap, build_mix_and_depth_arrays,\
reindex_dfs, map_to_constellation, solve_demixing_problem,\
perform_bootstrap
from freyja.sample_deconv import buildLineageMap, build_mix_and_depth_arrays, \
reindex_dfs, map_to_constellation, solve_demixing_problem, \
perform_bootstrap
import pandas.testing as pdt
import pandas.api.types as ptypes
from numpy.random import negative_binomial
Expand Down Expand Up @@ -56,7 +56,8 @@ def test_demixing(self):
eps = 0.001
sample_strains, abundances, error = solve_demixing_problem(df_barcodes,
mix, depths,
eps)
eps, 0.,
1.E-8)
self.assertAlmostEqual(
abundances[sample_strains.tolist().index(strain1)], mixFracs[0])
self.assertAlmostEqual(
Expand Down
4 changes: 4 additions & 0 deletions freyja/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,10 @@ def get_color_scheme(df, default_color_scheme, config=None):


def prepLineageDict(agg_d0, thresh=0.001, config=None, lineage_info=None):

if len(agg_d0.index[agg_d0.index.duplicated(keep=False)]) > 0:
print('WARNING: multiple samples have the same ID/filename.')
print(agg_d0.index[agg_d0.index.duplicated(keep=False)])
agg_d0.loc[:, 'lineages'] = agg_d0['lineages']\
.apply(lambda x:
x.replace("'", "")
Expand Down

0 comments on commit 1b14ad0

Please sign in to comment.