Merge pull request #31 from sbslee/0.17.0-dev

0.17.0 dev
sbslee · Jul 8, 2021 · eb45ac5 · eb45ac5
2 parents d551451 + b14bbd0
commit eb45ac5
Show file tree

Hide file tree

Showing 15 changed files with 929 additions and 110 deletions.
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -1,6 +1,19 @@
 Changelog
 *********
 
+0.17.0 (2021-07-08)
+-------------------
+
+* Add :meth:`pymaf.MafFrame.plot_lollipop` method.
+* :issue:`30`: Add :meth:`pymaf.MafFrame.plot_rainfall` method.
+* :issue:`30`: Add :meth:`pyvcf.VcfFrame.plot_rainfall` method.
+* Update :meth:`pymaf.MafFrame.to_vcf` method to output sorted VCF.
+* Add :meth:`pymaf.MafFrame.matrix_prevalence` method.
+* Add :meth:`pymaf.MafFrame.plot_regplot` method.
+* Add ``samples`` argument to :meth:`pymaf.MafFrame.plot_snvclss` method.
+* Add :meth:`pymaf.MafFrame.plot_evolution` method.
+* Add new submodule ``pygff``.
+
 0.16.0 (2021-07-02)
 -------------------
 
@@ -22,7 +35,7 @@ Changelog
 * Add :meth:`pyvcf.VcfFrame.plot_snvclsp` method (simply wraps :meth:`pymaf.MafFrame.plot_snvclsp` method).
 * Add :meth:`pyvcf.VcfFrame.plot_snvclss` method (simply wraps :meth:`pymaf.MafFrame.plot_snvclss` method).
 * Add :meth:`pyvcf.VcfFrame.plot_titv` method (simply wraps :meth:`pymaf.MafFrame.plot_titv` method).
-* Update :meth:`pymaf.MafFrame.from_vcf` method to handle unannotated VCF data.
+* :issue:`28`: Update :meth:`pymaf.MafFrame.from_vcf` method to handle unannotated VCF data.
 
 0.15.0 (2021-06-24)
 -------------------

diff --git a/README.rst b/README.rst
@@ -40,6 +40,8 @@ Currently, fuc can be used to analyze, summarize, visualize, and manipulate the
 - Browser Extensible Data (BED)
 - FASTQ
 - FASTA
+- General Feature Format (GFF)
+- Gene Transfer Format (GTF)
 - delimiter-separated values format (e.g. comma-separated values or CSV format)
 
 Additionally, fuc can be used to parse output data from the following programs:
@@ -150,6 +152,7 @@ Below is the list of submodules available in the fuc API:
 - **pybed** : The pybed submodule is designed for working with BED files. It implements ``pybed.BedFrame`` which stores BED data as ``pandas.DataFrame`` via the `pyranges <https://github.com/biocore-ntnu/pyranges>`_ package to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `BED specification <https://genome.ucsc.edu/FAQ/FAQformat.html>`_.
 - **pycov** : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements ``pycov.CovFrame`` which stores read depth data as ``pandas.DataFrame`` via the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation.
 - **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
+- **pygff** : The pygff submodule is designed for working with GFF/GTF files. It implements ``pygff.GffFrame`` which stores GFF/GTF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `GFF specification <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>`_.
 - **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pymaf.MafFrame`` class also contains many useful plotting methods such as ``MafFrame.plot_oncoplot`` and ``MafFrame.plot_summary``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
 - **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It should be used with ``pyvcf.VcfFrame``.
 - **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_tmb``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
@@ -322,8 +325,8 @@ To create an oncoplot with a MAF file:
 
     >>> from fuc import common, pymaf
     >>> common.load_dataset('tcga-laml')
-    >>> f = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
-    >>> mf = pymaf.MafFrame.from_file(f)
+    >>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
+    >>> mf = pymaf.MafFrame.from_file(maf_file)
     >>> mf.plot_oncoplot()
 
 .. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/oncoplot.png
@@ -338,8 +341,8 @@ To create a summary figure for a MAF file:
 
     >>> from fuc import common, pymaf
     >>> common.load_dataset('tcga-laml')
-    >>> f = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
-    >>> mf = pymaf.MafFrame.from_file(f)
+    >>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
+    >>> mf = pymaf.MafFrame.from_file(maf_file)
     >>> mf.plot_summary()
 
 .. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/maf_summary-2.png

diff --git a/data/gff/fasta.gff b/data/gff/fasta.gff
@@ -0,0 +1,33 @@
+##gff-version 3.1.26
+##sequence-region ctg123 1 1497228
+ctg123	.	gene	1000	9000	.	+	.	ID=gene00001;Name=EDEN
+ctg123	.	TF_binding_site	1000	1012	.	+	.	ID=tfbs00001;Parent=gene00001
+ctg123	.	mRNA	1050	9000	.	+	.	ID=mRNA00001;Parent=gene00001;Name=EDEN.1
+ctg123	.	five_prime_UTR	1050	1200	.	+	.	Parent=mRNA00001
+ctg123	.	CDS	1201	1500	.	+	0	ID=cds00001;Parent=mRNA00001
+ctg123	.	CDS	3000	3902	.	+	0	ID=cds00001;Parent=mRNA00001
+ctg123	.	CDS	5000	5500	.	+	0	ID=cds00001;Parent=mRNA00001
+ctg123	.	CDS	7000	7600	.	+	0	ID=cds00001;Parent=mRNA00001
+ctg123	.	three_prime_UTR	7601	9000	.	+	.	Parent=mRNA00001
+ctg123	.	cDNA_match	1050	1500	5.80E-42	+	.	ID=match00001;Target=cdna0123+12+462
+ctg123	.	cDNA_match	5000	5500	8.10E-43	+	.	ID=match00001;Target=cdna0123+463+963
+ctg123	.	cDNA_match	7000	9000	1.40E-40	+	.	ID=match00001;Target=cdna0123+964+2964
+##FASTA
+>ctg123
+cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
+tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
+tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
+aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
+aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
+cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
+gtatttgatttgggtttactatcgaataatgagaattttcaggcttaggc
+ttaggcttaggcttaggcttaggcttaggcttaggcttaggcttaggctt
+aggcttaggcttaggcttaggcttaggcttaggcttaggcttaggcttag
+aatctagctagctatccgaaattcgaggcctgaaaagtgtgacgccattc
+>cnda0123
+ttcaagtgctcagtcaatgtgattcacagtatgtcaccaaatattttggc
+agctttctcaagggatcaaaattatggatcattatggaatacctcggtgg
+aggctcagcgctcgatttaactaaaagtggaaagctggacgaaagtcata
+tcgctgtgattcttcgcgaaattttgaaaggtctcgagtatctgcatagt
+gaaagaaaaatccacagagatattaaaggagccaacgttttgttggaccg
+tcaaacagcggctgtaaaaatttgtgattatggttaaagg
diff --git a/docs/api.rst b/docs/api.rst
@@ -16,6 +16,7 @@ Below is the list of submodules available in the fuc API:
 - **pybed** : The pybed submodule is designed for working with BED files. It implements ``pybed.BedFrame`` which stores BED data as ``pandas.DataFrame`` via the `pyranges <https://github.com/biocore-ntnu/pyranges>`_ package to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `BED specification <https://genome.ucsc.edu/FAQ/FAQformat.html>`_.
 - **pycov** : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements ``pycov.CovFrame`` which stores read depth data as ``pandas.DataFrame`` via the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation.
 - **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
+- **pygff** : The pygff submodule is designed for working with GFF/GTF files. It implements ``pygff.GffFrame`` which stores GFF/GTF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `GFF specification <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>`_.
 - **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pymaf.MafFrame`` class also contains many useful plotting methods such as ``MafFrame.plot_oncoplot`` and ``MafFrame.plot_summary``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
 - **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It should be used with ``pyvcf.VcfFrame``.
 - **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_tmb``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
@@ -58,6 +59,12 @@ fuc.api.pyfq
 .. automodule:: fuc.api.pyfq
    :members:
 
+fuc.api.pygff
+=============
+
+.. automodule:: fuc.api.pygff
+   :members:
+
 fuc.api.pymaf
 =============
 

diff --git a/docs/cli.rst b/docs/cli.rst
@@ -331,7 +331,7 @@ maf-maf2vcf
                           [--cols TEXT [TEXT ...]] [--names TEXT [TEXT ...]]
                           maf
    
-   This command will convert a MAF file to a VCF file.
+   This command will convert a MAF file to a sorted VCF file.
    
    In order to handle INDELs the command makes use of a reference assembly (i.e. FASTA file). If SNVs are your only concern, then you do not need a FASTA file and can just use the '--ignore_indels' flag.
    
@@ -347,7 +347,7 @@ maf-maf2vcf
      $ fuc maf-maf2vcf in.maf --fasta hs37d5.fa --cols i_TumorVAF_WU --names AF > out.vcf
    
    Positional arguments:
-     maf                   MAF file.
+     maf                   MAF file (zipped or unzipped).
    
    Optional arguments:
      -h, --help            Show this help message and exit.

diff --git a/docs/create.py b/docs/create.py
@@ -68,6 +68,8 @@
 - Browser Extensible Data (BED)
 - FASTQ
 - FASTA
+- General Feature Format (GFF)
+- Gene Transfer Format (GTF)
 - delimiter-separated values format (e.g. comma-separated values or CSV format)
 
 Additionally, fuc can be used to parse output data from the following programs:
@@ -310,8 +312,8 @@
 
     >>> from fuc import common, pymaf
     >>> common.load_dataset('tcga-laml')
-    >>> f = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
-    >>> mf = pymaf.MafFrame.from_file(f)
+    >>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
+    >>> mf = pymaf.MafFrame.from_file(maf_file)
     >>> mf.plot_oncoplot()
 
 .. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/oncoplot.png
@@ -326,8 +328,8 @@
 
     >>> from fuc import common, pymaf
     >>> common.load_dataset('tcga-laml')
-    >>> f = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
-    >>> mf = pymaf.MafFrame.from_file(f)
+    >>> maf_file = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
+    >>> mf = pymaf.MafFrame.from_file(maf_file)
     >>> mf.plot_summary()
 
 .. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/maf_summary-2.png

diff --git a/docs/glossary.rst b/docs/glossary.rst
@@ -4,7 +4,7 @@ Glossary
 SNV classes
 ===========
 
-Considering the pyrimidines of the Watson-Crick base pairs, there are only six different possible substitutions: C>A, C>G, C>T, T>A, T>C, and T>G.
+Considering the pyrimidines of the Watson-Crick base pairs, there are only six different possible substitutions: C>A, C>G, C>T, T>A, T>C, T>G.
 
 References:
 
@@ -15,6 +15,14 @@ Transitions (Ti) and transversions (Tv)
 
 DNA substitution mutations are of two types. Transitions are interchanges of two-ring purines (A↔G) or of one-ring pyrimidines (C↔T): they therefore involve bases of similar shape. Transversions are interchanges of purine for pyrimidine bases, which therefore involve exchange of one-ring and two-ring structures.
 
++------+--------------------+
+| Type | SNV classes        |
++======+====================+
+| Ti   | C>T, T>C           |
++------+--------------------+
+| Tv   | C>A, C>G, T>A, T>G |
++------+--------------------+
+
 References:
 
 - `Transitions vs. Transversions <https://www.mun.ca/biology/scarr/Transitions_vs_Transversions.html>`__

diff --git a/fuc/api/common.py b/fuc/api/common.py
@@ -95,6 +95,10 @@ def load_dataset(name, force=False):
             'tcga_laml.vcf',
             'tcga_laml_vep.vcf',
         ],
+        'brca': [
+            'brca.maf.gz',
+            'brca.vcf',
+        ],
         'pyvcf': [
             'plot_comparison.vcf',
             'normal-tumor.vcf',

diff --git a/fuc/api/pycov.py b/fuc/api/pycov.py
@@ -246,10 +246,15 @@ def plot_region(
         df = df.set_index('Position')
         if kwargs is None:
             kwargs = {}
+
+        # Determine which matplotlib axes to plot on.
         if ax is None:
             fig, ax = plt.subplots(figsize=figsize)
+
         sns.lineplot(data=df, ax=ax, **kwargs)
+
         ax.set_ylabel('Depth')
+
         return ax
 
     def slice(self, region):