Skip to content

Commit

Permalink
Merge pull request #49 from sbslee/0.28.0-dev
Browse files Browse the repository at this point in the history
0.28.0 dev
  • Loading branch information
sbslee authored Dec 5, 2021
2 parents 27ac9f6 + c0f59f8 commit 54c07e2
Show file tree
Hide file tree
Showing 14 changed files with 774 additions and 91 deletions.
18 changes: 18 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,24 @@
Changelog
*********

0.28.0 (2021-12-05)
-------------------

* Update :meth:`pyvcf.VcfFrame.filter_empty` method so that users can choose a varying number of missing genotypes as threshold.
* Add new method :meth:`pyvcf.plot_af_correlation`.
* Update :command:`bam-slice` command to support BED file as input for specifying regions. Additionally, from now on, the command will automatically handle the annoying 'chr' prefix.
* Add new method :meth:`pycov.CovFrame.matrix_uniformity`.
* Fix bug in :meth:`pyvcf.slice` method when the input region is missing start or end.
* Add new command :command:`ngs-bam2fq`.
* Add new command :command:`fa-filter`.
* Update :meth:`pycov.CovFrame.plot_region` and :meth:`pyvcf.VcfFrame.plot_region` methods to raise an error if the CovFrame/VcfFrame is empty.
* Update :meth:`pyvcf.VcfFrame.filter_*` methods so that they don't raise an error when the VcfFrame is empty (i.e. will return the empty VcfFrame).
* Update :meth:`common.plot_exons` method to not italicize text by default (use ``name='$text$'`` to italicize).
* Add new argument ``--posix`` to :command:`ngs-hc` command.
* Add new method :meth:`common.AnnFrame.subset`.
* Update :meth:`common.AnnFrame.plot_annot` method to raise an error if user provides an invalid group in ``group_order``.
* Add new method :meth:`pymaf.MafFrame.get_gene_concordance`.

0.27.0 (2021-11-20)
-------------------

Expand Down
2 changes: 2 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@ For getting help on the fuc CLI:
bed-sum Summarize a BED file.
cov-concat Concatenate depth of coverage files.
cov-rename Rename the samples in a depth of coverage file.
fa-filter Filter sequence records in a FASTA file
fq-count Count sequence reads in FASTQ files.
fq-sum Summarize a FASTQ file.
fuc-bgzip Write a BGZF compressed file.
Expand All @@ -133,6 +134,7 @@ For getting help on the fuc CLI:
maf-oncoplt Create an oncoplot with a MAF file.
maf-sumplt Create a summary plot with a MAF file.
maf-vcf2maf Convert a VCF file to a MAF file.
ngs-bam2fq Pipeline for converting BAM files to FASTQ files.
ngs-fq2bam Pipeline for converting FASTQ files to analysis-ready BAM files.
ngs-hc Pipeline for germline short variant discovery.
ngs-m2 Pipeline for somatic short variant discovery.
Expand Down
123 changes: 109 additions & 14 deletions docs/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ For getting help on the fuc CLI:
bed-sum Summarize a BED file.
cov-concat Concatenate depth of coverage files.
cov-rename Rename the samples in a depth of coverage file.
fa-filter Filter sequence records in a FASTA file
fq-count Count sequence reads in FASTQ files.
fq-sum Summarize a FASTQ file.
fuc-bgzip Write a BGZF compressed file.
Expand All @@ -40,6 +41,7 @@ For getting help on the fuc CLI:
maf-oncoplt Create an oncoplot with a MAF file.
maf-sumplt Create a summary plot with a MAF file.
maf-vcf2maf Convert a VCF file to a MAF file.
ngs-bam2fq Pipeline for converting BAM files to FASTQ files.
ngs-fq2bam Pipeline for converting FASTQ files to analysis-ready BAM files.
ngs-hc Pipeline for germline short variant discovery.
ngs-m2 Pipeline for somatic short variant discovery.
Expand Down Expand Up @@ -195,23 +197,35 @@ bam-slice
usage: fuc bam-slice [-h] [--format TEXT] [--fasta PATH]
bam regions [regions ...]
Slice a SAM/BAM/CRAM file.
Slice an alignment file (SAM/BAM/CRAM).
Positional arguments:
bam Alignment file.
regions List of regions to be sliced ('chrom:start-end').
bam Input alignment file must be already indexed (.bai) to allow
random access. You can index an alignment file with the
bam-index command.
regions One or more regions to be sliced. Each region must have the
format chrom:start-end and be a half-open interval with
(start, end]. This means, for example, chr1:100-103 will
extract positions 101, 102, and 103. Alternatively, you can
provide a BED file (compressed or uncompressed) to specify
regions. Note that the 'chr' prefix in contig names (e.g.
'chr1' vs. '1') will be automatically added or removed as
necessary to match the input VCF's contig names.
Optional arguments:
-h, --help Show this help message and exit.
--format TEXT Output format (default: 'BAM') (choices: 'SAM', 'BAM',
'CRAM').
--fasta PATH FASTA file. Required when --format is 'CRAM'.
[Example] Slice a BAM file:
$ fuc bam-slice in.bam chr1:100-200 chr2:100-200 > out.bam
[Example] Specify regions manually:
$ fuc bam-slice in.bam 1:100-300 2:400-700 > out.bam
[Example] Speicfy regions with a BED file:
$ fuc bam-slice in.bam regions.bed > out.bam
[Example] Slice a CRAM file:
$ fuc bam-slice in.bam chr1:100-200 --format CRAM --fasta ref.fa > out.cram
$ fuc bam-slice in.bam regions.bed --format CRAM --fasta ref.fa > out.cram
bed-intxn
=========
Expand Down Expand Up @@ -322,6 +336,32 @@ cov-rename
[Example] Using the 'RANGE' mode:
$ fuc cov-rename in.tsv new_only.tsv --mode RANGE --range 2 5 > out.tsv
fa-filter
=========

.. code-block:: text
$ fuc fa-filter -h
usage: fuc fa-filter [-h] [--contigs TEXT [TEXT ...]] [--exclude] fasta
Filter sequence records in a FASTA file.
Positional arguments:
fasta FASTA file (compressed or uncompressed).
Optional arguments:
-h, --help Show this help message and exit.
--contigs TEXT [TEXT ...]
One or more contigs to be selected. Alternatively, you can
provide a file containing one contig per line.
--exclude Exclude specified contigs.
[Example] Select certain contigs:
$ fuc fa-filter in.fasta --contigs chr1 chr2 > out.fasta
[Example] Select certain contigs:
$ fuc fa-filter in.fasta --contigs contigs.list --exclude > out.fasta
fq-count
========

Expand Down Expand Up @@ -677,6 +717,55 @@ maf-vcf2maf
[Example] Convert VCF to MAF:
$ fuc maf-vcf2maf in.vcf > out.maf
ngs-bam2fq
==========

.. code-block:: text
$ fuc ngs-bam2fq -h
usage: fuc ngs-bam2fq [-h] [--thread INT] [--force] manifest output qsub
Pipeline for converting BAM files to FASTQ files.
This pipeline will assume input BAM files consist of paired-end reads
and output two zipped FASTQ files for each sample (forward and reverse
reads). That is, SAMPLE.bam will produce SAMPLE_R1.fastq.gz and
SAMPLE_R2.fastq.gz.
External dependencies:
- SGE: Required for job submission (i.e. qsub).
- SAMtools: Required for BAM to FASTQ conversion.
Manifest columns:
- BAM: BAM file.
Positional arguments:
manifest Sample manifest CSV file.
output Output directory.
qsub SGE resoruce to request with qsub for BAM to FASTQ
conversion. Since this oppoeration supports multithreading,
it is recommended to speicfy a parallel environment (PE)
to speed up the process (also see --thread).
Optional arguments:
-h, --help Show this help message and exit.
--thread INT Number of threads to use (default: 1).
--force Overwrite the output directory if it already exists.
[Example] Specify queue:
$ fuc ngs-bam2fq \
manifest.csv \
output_dir \
"-q queue_name -pe pe_name 10" \
--thread 10
[Example] Specify nodes:
$ fuc ngs-bam2fq \
manifest.csv \
output_dir \
"-l h='node_A|node_B' -pe pe_name 10" \
--thread 10
ngs-fq2bam
==========

Expand Down Expand Up @@ -756,7 +845,7 @@ ngs-hc
$ fuc ngs-hc -h
usage: fuc ngs-hc [-h] [--bed PATH] [--dbsnp PATH] [--job TEXT] [--force]
[--keep]
[--keep] [--posix]
manifest fasta output qsub java1 java2
Pipeline for germline short variant discovery.
Expand All @@ -783,6 +872,7 @@ ngs-hc
--job TEXT Job submission ID for SGE.
--force Overwrite the output directory if it already exists.
--keep Keep temporary files.
--posix Optimize for a POSIX filesystem.
[Example] Specify queue:
$ fuc ngs-hc \
Expand Down Expand Up @@ -1104,10 +1194,10 @@ vcf-index
-h, --help Show this help message and exit.
--force Force to overwrite the index file if it is already present.
[Example] Index a compressed VCF file.
[Example] Index a compressed VCF file:
$ fuc vcf-index in.vcf.gz
[Example] Index an uncompressed VCF file. Will create a compressed file first.
[Example] Index an uncompressed VCF file (will create a compressed VCF first):
$ fuc vcf-index in.vcf
vcf-merge
Expand Down Expand Up @@ -1194,7 +1284,9 @@ vcf-slice
Positional arguments:
vcf Input VCF file must be already BGZF compressed (.gz) and
indexed (.tbi) to allow random access.
indexed (.tbi) to allow random access. A VCF file can be
compressed with the fuc-bgzip command and indexed with the
vcf-index command.
regions One or more regions to be sliced. Each region must have the
format chrom:start-end and be a half-open interval with
(start, end]. This means, for example, chr1:100-103 will
Expand All @@ -1207,11 +1299,14 @@ vcf-slice
Optional arguments:
-h, --help Show this help message and exit.
[Example] Specify regions manually.
$ fuc vcf-slice in.vcf.gz 1:100-300 2:400-700 > out.vcf
[Example] Specify regions manually:
$ fuc vcf-slice in.vcf.gz 1:100-300 2:400-700 > out.vcf
[Example] Speicfy regions with a BED file:
$ fuc vcf-slice in.vcf.gz regions.bed > out.vcf
[Example] Speicfy regions with a BED file.
$ fuc vcf-slice in.vcf.gz regions.bed > out.vcf
[Example] Output a compressed file:
$ fuc vcf-slice in.vcf.gz regions.bed | fuc fuc-bgzip > out.vcf.gz
vcf-vcf2bed
===========
Expand Down
90 changes: 84 additions & 6 deletions fuc/api/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,9 +233,12 @@ def plot_annot(
Parameters
----------
group_col : str
AnnFrame column containing sample group information.
AnnFrame column containing sample group information. If the
column has NaN values, they will be converted to 'N/A' string.
group_order : list, optional
List of sample group names.
List of sample group names (in that order too). You can use this
to subset samples belonging to specified groups only. You must
include all relevant groups when also using ``samples``.
samples : list, optional
Display only specified samples (in that order too).
colors : str or list, default: 'tab10'
Expand Down Expand Up @@ -298,6 +301,7 @@ def plot_annot(
"""
# Get the selected column.
s = self.df[group_col]
s = s.fillna('N/A')

# Subset the samples, if necessary.
if samples is not None:
Expand All @@ -307,9 +311,26 @@ def plot_annot(
if group_order is None:
group_order = sorted([x for x in s.unique() if x == x])
else:
s = s[s.isin(group_order)]
# Make sure all specified groups are valid.
for group in group_order:
groups = ', '.join([f"'{x}'" for x in sorted(s.unique())])
if group not in s.unique():
raise ValueError(f"The group '{group}' does not exist. "
f"The following groups are available: {groups}.")

if len(group_order) < len(s.unique()):
if samples is None:
s = s[s.isin(group_order)]
else:
missing = ', '.join([f"'{x}'" for x in s.unique()
if x not in group_order])
raise ValueError("The 'group_order' argumnet must "
"include all groups when used with the 'samples' "
"argument. Following groups are currently missing: "
f"{missing}.")

d = {k: v for v, k in enumerate(group_order)}
df = s.to_frame().applymap(lambda x: x if pd.isna(x) else d[x])
df = s.to_frame().applymap(lambda x: d[x])

# Determine the colors to use.
if isinstance(colors, str):
Expand Down Expand Up @@ -445,6 +466,63 @@ def sorted_samples(self, by, mf=None, keep_empty=False, nonsyn=False):

return df.index.to_list()

def subset(self, samples, exclude=False):
"""
Subset AnnFrame for specified samples.
Parameters
----------
samples : str or list
Sample name or list of names (the order matters).
exclude : bool, default: False
If True, exclude specified samples.
Returns
-------
AnnFrame
Subsetted AnnFrame.
Examples
--------
>>> from fuc import common
>>> data = {
... 'SampleID': ['A', 'B', 'C', 'D'],
... 'PatientID': ['P1', 'P1', 'P2', 'P2'],
... 'Tissue': ['Normal', 'Tumor', 'Normal', 'Tumor'],
... 'Age': [30, 30, 57, 57]
... }
>>> af = common.AnnFrame.from_dict(data, sample_col='SampleID') # or sample_col=0
>>> af.df
PatientID Tissue Age
SampleID
A P1 Normal 30
B P1 Tumor 30
C P2 Normal 57
D P2 Tumor 57
We can subset the AnnFrame for the normal samples A and C:
>>> af.subset(['A', 'C']).df
PatientID Tissue Age
SampleID
A P1 Normal 30
C P2 Normal 57
Alternatively, we can exclude those samples:
>>> af.subset(['A', 'C'], exclude=True).df
PatientID Tissue Age
SampleID
B P1 Tumor 30
D P2 Tumor 57
"""
if isinstance(samples, str):
samples = [samples]
if exclude:
samples = [x for x in self.samples if x not in samples]
return self.__class__(self.df.loc[samples])

def _script_name():
"""Return the current script's filename."""
fn = inspect.stack()[1].filename
Expand Down Expand Up @@ -1034,7 +1112,7 @@ def plot_exons(
ends : list
List of exon end positions.
name : str, optional
Gene name.
Gene name. Use ``name='$text$'`` to italicize the text.
offset : float, default: 1
How far gene name should be plotted from the gene model.
color : str, default: 'black'
Expand Down Expand Up @@ -1086,7 +1164,7 @@ def plot_exons(
ax.text(
x=(starts[0]+ends[-1]) / 2,
y=y-offset,
s=f'${name}$',
s=name,
horizontalalignment='center',
fontsize=fontsize,
)
Expand Down
Loading

0 comments on commit 54c07e2

Please sign in to comment.