Merge pull request #49 from sbslee/0.28.0-dev

0.28.0 dev
sbslee · Dec 5, 2021 · 54c07e2 · 54c07e2
2 parents 27ac9f6 + c0f59f8
commit 54c07e2
Show file tree

Hide file tree

Showing 14 changed files with 774 additions and 91 deletions.
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -1,6 +1,24 @@
 Changelog
 *********
 
+0.28.0 (2021-12-05)
+-------------------
+
+* Update :meth:`pyvcf.VcfFrame.filter_empty` method so that users can choose a varying number of missing genotypes as threshold.
+* Add new method :meth:`pyvcf.plot_af_correlation`.
+* Update :command:`bam-slice` command to support BED file as input for specifying regions. Additionally, from now on, the command will automatically handle the annoying 'chr' prefix.
+* Add new method :meth:`pycov.CovFrame.matrix_uniformity`.
+* Fix bug in :meth:`pyvcf.slice` method when the input region is missing start or end.
+* Add new command :command:`ngs-bam2fq`.
+* Add new command :command:`fa-filter`.
+* Update :meth:`pycov.CovFrame.plot_region` and :meth:`pyvcf.VcfFrame.plot_region` methods to raise an error if the CovFrame/VcfFrame is empty.
+* Update :meth:`pyvcf.VcfFrame.filter_*` methods so that they don't raise an error when the VcfFrame is empty (i.e. will return the empty VcfFrame).
+* Update :meth:`common.plot_exons` method to not italicize text by default (use ``name='$text$'`` to italicize).
+* Add new argument ``--posix`` to :command:`ngs-hc` command.
+* Add new method :meth:`common.AnnFrame.subset`.
+* Update :meth:`common.AnnFrame.plot_annot` method to raise an error if user provides an invalid group in ``group_order``.
+* Add new method :meth:`pymaf.MafFrame.get_gene_concordance`.
+
 0.27.0 (2021-11-20)
 -------------------
 

diff --git a/README.rst b/README.rst
@@ -121,6 +121,7 @@ For getting help on the fuc CLI:
        bed-sum      Summarize a BED file.
        cov-concat   Concatenate depth of coverage files.
        cov-rename   Rename the samples in a depth of coverage file.
+       fa-filter    Filter sequence records in a FASTA file
        fq-count     Count sequence reads in FASTQ files.
        fq-sum       Summarize a FASTQ file.
        fuc-bgzip    Write a BGZF compressed file.
@@ -133,6 +134,7 @@ For getting help on the fuc CLI:
        maf-oncoplt  Create an oncoplot with a MAF file.
        maf-sumplt   Create a summary plot with a MAF file.
        maf-vcf2maf  Convert a VCF file to a MAF file.
+       ngs-bam2fq   Pipeline for converting BAM files to FASTQ files.
        ngs-fq2bam   Pipeline for converting FASTQ files to analysis-ready BAM files.
        ngs-hc       Pipeline for germline short variant discovery.
        ngs-m2       Pipeline for somatic short variant discovery.

diff --git a/docs/cli.rst b/docs/cli.rst
@@ -28,6 +28,7 @@ For getting help on the fuc CLI:
        bed-sum      Summarize a BED file.
        cov-concat   Concatenate depth of coverage files.
        cov-rename   Rename the samples in a depth of coverage file.
+       fa-filter    Filter sequence records in a FASTA file
        fq-count     Count sequence reads in FASTQ files.
        fq-sum       Summarize a FASTQ file.
        fuc-bgzip    Write a BGZF compressed file.
@@ -40,6 +41,7 @@ For getting help on the fuc CLI:
        maf-oncoplt  Create an oncoplot with a MAF file.
        maf-sumplt   Create a summary plot with a MAF file.
        maf-vcf2maf  Convert a VCF file to a MAF file.
+       ngs-bam2fq   Pipeline for converting BAM files to FASTQ files.
        ngs-fq2bam   Pipeline for converting FASTQ files to analysis-ready BAM files.
        ngs-hc       Pipeline for germline short variant discovery.
        ngs-m2       Pipeline for somatic short variant discovery.
@@ -195,23 +197,35 @@ bam-slice
    usage: fuc bam-slice [-h] [--format TEXT] [--fasta PATH]
                         bam regions [regions ...]
    
-   Slice a SAM/BAM/CRAM file.
+   Slice an alignment file (SAM/BAM/CRAM).
    
    Positional arguments:
-     bam            Alignment file.
-     regions        List of regions to be sliced ('chrom:start-end').
+     bam            Input alignment file must be already indexed (.bai) to allow 
+                    random access. You can index an alignment file with the 
+                    bam-index command.
+     regions        One or more regions to be sliced. Each region must have the 
+                    format chrom:start-end and be a half-open interval with 
+                    (start, end]. This means, for example, chr1:100-103 will 
+                    extract positions 101, 102, and 103. Alternatively, you can 
+                    provide a BED file (compressed or uncompressed) to specify 
+                    regions. Note that the 'chr' prefix in contig names (e.g. 
+                    'chr1' vs. '1') will be automatically added or removed as 
+                    necessary to match the input VCF's contig names.
    
    Optional arguments:
      -h, --help     Show this help message and exit.
      --format TEXT  Output format (default: 'BAM') (choices: 'SAM', 'BAM', 
                     'CRAM').
      --fasta PATH   FASTA file. Required when --format is 'CRAM'.
    
-   [Example] Slice a BAM file:
-     $ fuc bam-slice in.bam chr1:100-200 chr2:100-200 > out.bam
+   [Example] Specify regions manually:
+     $ fuc bam-slice in.bam 1:100-300 2:400-700 > out.bam
+   
+   [Example] Speicfy regions with a BED file:
+     $ fuc bam-slice in.bam regions.bed > out.bam
    
    [Example] Slice a CRAM file:
-     $ fuc bam-slice in.bam chr1:100-200 --format CRAM --fasta ref.fa > out.cram
+     $ fuc bam-slice in.bam regions.bed --format CRAM --fasta ref.fa > out.cram
 
 bed-intxn
 =========
@@ -322,6 +336,32 @@ cov-rename
    [Example] Using the 'RANGE' mode:
      $ fuc cov-rename in.tsv new_only.tsv --mode RANGE --range 2 5 > out.tsv
 
+fa-filter
+=========
+
+.. code-block:: text
+
+   $ fuc fa-filter -h
+   usage: fuc fa-filter [-h] [--contigs TEXT [TEXT ...]] [--exclude] fasta
+   
+   Filter sequence records in a FASTA file.
+   
+   Positional arguments:
+     fasta                 FASTA file (compressed or uncompressed).
+   
+   Optional arguments:
+     -h, --help            Show this help message and exit.
+     --contigs TEXT [TEXT ...]
+                           One or more contigs to be selected. Alternatively, you can 
+                           provide a file containing one contig per line. 
+     --exclude             Exclude specified contigs.
+   
+   [Example] Select certain contigs:
+     $ fuc fa-filter in.fasta --contigs chr1 chr2 > out.fasta
+   
+   [Example] Select certain contigs:
+     $ fuc fa-filter in.fasta --contigs contigs.list --exclude > out.fasta
+
 fq-count
 ========
 
@@ -677,6 +717,55 @@ maf-vcf2maf
    [Example] Convert VCF to MAF:
      $ fuc maf-vcf2maf in.vcf > out.maf
 
+ngs-bam2fq
+==========
+
+.. code-block:: text
+
+   $ fuc ngs-bam2fq -h
+   usage: fuc ngs-bam2fq [-h] [--thread INT] [--force] manifest output qsub
+   
+   Pipeline for converting BAM files to FASTQ files.
+   
+   This pipeline will assume input BAM files consist of paired-end reads
+   and output two zipped FASTQ files for each sample (forward and reverse
+   reads). That is, SAMPLE.bam will produce SAMPLE_R1.fastq.gz and
+   SAMPLE_R2.fastq.gz.
+   
+   External dependencies:
+     - SGE: Required for job submission (i.e. qsub).
+     - SAMtools: Required for BAM to FASTQ conversion.
+   
+   Manifest columns:
+     - BAM: BAM file.
+   
+   Positional arguments:
+     manifest      Sample manifest CSV file.
+     output        Output directory.
+     qsub          SGE resoruce to request with qsub for BAM to FASTQ 
+                   conversion. Since this oppoeration supports multithreading, 
+                   it is recommended to speicfy a parallel environment (PE) 
+                   to speed up the process (also see --thread).
+   
+   Optional arguments:
+     -h, --help    Show this help message and exit.
+     --thread INT  Number of threads to use (default: 1).
+     --force       Overwrite the output directory if it already exists.
+   
+   [Example] Specify queue:
+     $ fuc ngs-bam2fq \
+     manifest.csv \
+     output_dir \
+     "-q queue_name -pe pe_name 10" \
+     --thread 10
+   
+   [Example] Specify nodes:
+     $ fuc ngs-bam2fq \
+     manifest.csv \
+     output_dir \
+     "-l h='node_A|node_B' -pe pe_name 10" \
+     --thread 10
+
 ngs-fq2bam
 ==========
 
@@ -756,7 +845,7 @@ ngs-hc
 
    $ fuc ngs-hc -h
    usage: fuc ngs-hc [-h] [--bed PATH] [--dbsnp PATH] [--job TEXT] [--force]
-                     [--keep]
+                     [--keep] [--posix]
                      manifest fasta output qsub java1 java2
    
    Pipeline for germline short variant discovery.
@@ -783,6 +872,7 @@ ngs-hc
      --job TEXT    Job submission ID for SGE.
      --force       Overwrite the output directory if it already exists.
      --keep        Keep temporary files.
+     --posix       Optimize for a POSIX filesystem.
    
    [Example] Specify queue:
      $ fuc ngs-hc \
@@ -1104,10 +1194,10 @@ vcf-index
      -h, --help  Show this help message and exit.
      --force     Force to overwrite the index file if it is already present.
    
-   [Example] Index a compressed VCF file.
+   [Example] Index a compressed VCF file:
      $ fuc vcf-index in.vcf.gz
    
-   [Example] Index an uncompressed VCF file. Will create a compressed file first.
+   [Example] Index an uncompressed VCF file (will create a compressed VCF first):
      $ fuc vcf-index in.vcf
 
 vcf-merge
@@ -1194,7 +1284,9 @@ vcf-slice
    
    Positional arguments:
      vcf         Input VCF file must be already BGZF compressed (.gz) and 
-                 indexed (.tbi) to allow random access.
+                 indexed (.tbi) to allow random access. A VCF file can be 
+                 compressed with the fuc-bgzip command and indexed with the 
+                 vcf-index command.
      regions     One or more regions to be sliced. Each region must have the 
                  format chrom:start-end and be a half-open interval with 
                  (start, end]. This means, for example, chr1:100-103 will 
@@ -1207,11 +1299,14 @@ vcf-slice
    Optional arguments:
      -h, --help  Show this help message and exit.
    
-   [Example] Specify regions manually.
-   $ fuc vcf-slice in.vcf.gz 1:100-300 2:400-700 > out.vcf
+   [Example] Specify regions manually:
+     $ fuc vcf-slice in.vcf.gz 1:100-300 2:400-700 > out.vcf
+   
+   [Example] Speicfy regions with a BED file:
+     $ fuc vcf-slice in.vcf.gz regions.bed > out.vcf
    
-   [Example] Speicfy regions with a BED file.
-   $ fuc vcf-slice in.vcf.gz regions.bed > out.vcf
+   [Example] Output a compressed file:
+     $ fuc vcf-slice in.vcf.gz regions.bed | fuc fuc-bgzip > out.vcf.gz
 
 vcf-vcf2bed
 ===========

diff --git a/fuc/api/common.py b/fuc/api/common.py
@@ -233,9 +233,12 @@ def plot_annot(
         Parameters
         ----------
         group_col : str
-            AnnFrame column containing sample group information.
+            AnnFrame column containing sample group information. If the
+            column has NaN values, they will be converted to 'N/A' string.
         group_order : list, optional
-            List of sample group names.
+            List of sample group names (in that order too). You can use this
+            to subset samples belonging to specified groups only. You must
+            include all relevant groups when also using ``samples``.
         samples : list, optional
             Display only specified samples (in that order too).
         colors : str or list, default: 'tab10'
@@ -298,6 +301,7 @@ def plot_annot(
         """
         # Get the selected column.
         s = self.df[group_col]
+        s = s.fillna('N/A')
 
         # Subset the samples, if necessary.
         if samples is not None:
@@ -307,9 +311,26 @@ def plot_annot(
         if group_order is None:
             group_order = sorted([x for x in s.unique() if x == x])
         else:
-            s = s[s.isin(group_order)]
+            # Make sure all specified groups are valid.
+            for group in group_order:
+                groups = ', '.join([f"'{x}'" for x in sorted(s.unique())])
+                if group not in s.unique():
+                    raise ValueError(f"The group '{group}' does not exist. "
+                        f"The following groups are available: {groups}.")
+
+            if len(group_order) < len(s.unique()):
+                if samples is None:
+                    s = s[s.isin(group_order)]
+                else:
+                    missing = ', '.join([f"'{x}'" for x in s.unique()
+                        if x not in group_order])
+                    raise ValueError("The 'group_order' argumnet must "
+                        "include all groups when used with the 'samples' "
+                        "argument. Following groups are currently missing: "
+                        f"{missing}.")
+
         d = {k: v for v, k in enumerate(group_order)}
-        df = s.to_frame().applymap(lambda x: x if pd.isna(x) else d[x])
+        df = s.to_frame().applymap(lambda x: d[x])
 
         # Determine the colors to use.
         if isinstance(colors, str):
@@ -445,6 +466,63 @@ def sorted_samples(self, by, mf=None, keep_empty=False, nonsyn=False):
 
         return df.index.to_list()
 
+    def subset(self, samples, exclude=False):
+        """
+        Subset AnnFrame for specified samples.
+
+        Parameters
+        ----------
+        samples : str or list
+            Sample name or list of names (the order matters).
+        exclude : bool, default: False
+            If True, exclude specified samples.
+
+        Returns
+        -------
+        AnnFrame
+            Subsetted AnnFrame.
+
+        Examples
+        --------
+
+        >>> from fuc import common
+        >>> data = {
+        ...     'SampleID': ['A', 'B', 'C', 'D'],
+        ...     'PatientID': ['P1', 'P1', 'P2', 'P2'],
+        ...     'Tissue': ['Normal', 'Tumor', 'Normal', 'Tumor'],
+        ...     'Age': [30, 30, 57, 57]
+        ... }
+        >>> af = common.AnnFrame.from_dict(data, sample_col='SampleID') # or sample_col=0
+        >>> af.df
+                 PatientID  Tissue  Age
+        SampleID
+        A               P1  Normal   30
+        B               P1   Tumor   30
+        C               P2  Normal   57
+        D               P2   Tumor   57
+
+        We can subset the AnnFrame for the normal samples A and C:
+
+        >>> af.subset(['A', 'C']).df
+                 PatientID  Tissue  Age
+        SampleID
+        A               P1  Normal   30
+        C               P2  Normal   57
+
+        Alternatively, we can exclude those samples:
+
+        >>> af.subset(['A', 'C'], exclude=True).df
+                 PatientID Tissue  Age
+        SampleID
+        B               P1  Tumor   30
+        D               P2  Tumor   57
+        """
+        if isinstance(samples, str):
+            samples = [samples]
+        if exclude:
+            samples = [x for x in self.samples if x not in samples]
+        return self.__class__(self.df.loc[samples])
+
 def _script_name():
     """Return the current script's filename."""
     fn = inspect.stack()[1].filename
@@ -1034,7 +1112,7 @@ def plot_exons(
     ends : list
         List of exon end positions.
     name : str, optional
-        Gene name.
+        Gene name. Use ``name='$text$'`` to italicize the text.
     offset : float, default: 1
         How far gene name should be plotted from the gene model.
     color : str, default: 'black'
@@ -1086,7 +1164,7 @@ def plot_exons(
         ax.text(
             x=(starts[0]+ends[-1]) / 2,
             y=y-offset,
-            s=f'${name}$',
+            s=name,
             horizontalalignment='center',
             fontsize=fontsize,
         )