Merge pull request #18 from sbslee/0.11.0-dev

0.11.0 dev
sbslee · Jun 10, 2021 · 014ba33 · 014ba33
2 parents 86798f1 + 740ce86
commit 014ba33
Show file tree

Hide file tree

Showing 14 changed files with 1,461 additions and 486 deletions.
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -1,6 +1,16 @@
 Changelog
 *********
 
+0.11.0 (2021-06-10)
+-------------------
+
+* :issue:`16`: Add new method :meth:`pyvcf.VcfFrame.cfilter_empty`.
+* Add new methods :meth:`pyvep.filter_af/lof`.
+* Add ``matplotlib-venn`` package as dependency for plotting Venn diagrams.
+* Add new methods :meth:`pyvcf.plot_comparison/regplot/histplot`.
+* :issue:`17`: Add new method :meth:`pyvep.filter_biotype`.
+* Add new class :class:`pyvcf.AnnFrame`.
+
 0.10.0 (2021-06-03)
 -------------------
 

diff --git a/README.rst b/README.rst
@@ -56,24 +56,36 @@ Your contributions (e.g. feature ideas, pull requests) are most welcome.
 CLI Examples
 ============
 
+SAM/BAM/CRAM
+------------
+
 To print the header of a BAM file:
 
 .. code-block:: console
 
    $ fuc bam_head example.bam
 
+BED
+---
+
 To find intersection between BED files:
 
 .. code-block:: console
 
    $ fuc bed_intxn 1.bed 2.bed 3.bed > intersect.bed
 
+FASTQ
+-----
+
 To count sequence reads in a FASTQ file:
 
 .. code-block:: console
 
    $ fuc fq_count example.fastq
 
+FUC
+---
+
 To check whether a file exists in the operating system:
 
 .. code-block:: console
@@ -86,12 +98,18 @@ To find all VCF files within the current directory recursively:
 
    $ fuc fuc_find . vcf
 
+TABLE
+-----
+
 To merge two tab-delimited files:
 
 .. code-block:: console
 
    $ fuc tbl_merge left.txt right.txt > merged.txt
 
+VCF
+---
+
 To merge VCF files:
 
 .. code-block:: console
@@ -101,6 +119,9 @@ To merge VCF files:
 API Examples
 ============
 
+VCF
+---
+
 To filter a VCF file based on a BED file:
 
 .. code:: python3
@@ -119,11 +140,63 @@ To remove indels from a VCF file:
    >>> filtered_vf = vf.filter_indel()
    >>> filtered_vf.to_file('no_indels.vcf')
 
+To create a Venn diagram showing genotype concordance between groups:
+
+.. code:: python3
+
+    >>> from fuc import pyvcf, common
+    >>> common.load_dataset('pyvcf')
+    >>> f = '~/fuc-data/pyvcf/plot_comparison.vcf'
+    >>> vf = pyvcf.VcfFrame.from_file(f)
+    >>> a = ['Steven_A', 'John_A', 'Sara_A']
+    >>> b = ['Steven_B', 'John_B', 'Sara_B']
+    >>> c = ['Steven_C', 'John_C', 'Sara_C']
+    >>> vf.plot_comparison(a, b, c)
+
+.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/plot_comparison.png
+
+To create a histogram of tumor mutational burden (TMB) distribution:
+
+.. code:: python3
+
+    >>> from fuc import pyvcf
+    >>> vcf_data = {
+    ...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
+    ...     'POS': [100, 101, 102, 103, 103],
+    ...     'ID': ['.', '.', '.', '.', '.'],
+    ...     'REF': ['T', 'T', 'T', 'T', 'T'],
+    ...     'ALT': ['C', 'C', 'C', 'C', 'C'],
+    ...     'QUAL': ['.', '.', '.', '.', '.'],
+    ...     'FILTER': ['.', '.', '.', '.', '.'],
+    ...     'INFO': ['.', '.', '.', '.', '.'],
+    ...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
+    ...     'Steven_N': ['0/0', '0/0', '0/1', '0/0', '0/0'],
+    ...     'Steven_T': ['0/0', '0/1', '0/1', '0/1', '0/1'],
+    ...     'Sara_N': ['0/0', '0/1', '0/0', '0/0', '0/0'],
+    ...     'Sara_T': ['0/0', '0/0', '1/1', '1/1', '0/1'],
+    ...     'John_N': ['0/0', '0/0', '0/0', '0/0', '0/0'],
+    ...     'John_T': ['0/1', '0/0', '1/1', '1/1', '0/1'],
+    ...     'Rachel_N': ['0/0', '0/0', '0/0', '0/0', '0/0'],
+    ...     'Rachel_T': ['0/1', '0/1', '0/0', '0/1', '0/1'],
+    ... }
+    >>> annot_data = {
+    ...     'Sample': ['Steven_N', 'Steven_T', 'Sara_N', 'Sara_T', 'John_N', 'John_T', 'Rachel_N', 'Rachel_T'],
+    ...     'Subject': ['Steven', 'Steven', 'Sara', 'Sara', 'John', 'John', 'Rachel', 'Rachel'],
+    ...     'Type': ['Normal', 'Tumor', 'Normal', 'Tumor', 'Normal', 'Tumor', 'Normal', 'Tumor'],
+    ... }
+    >>> vf = pyvcf.VcfFrame.from_dict([], vcf_data)
+    >>> af = pyvcf.AnnFrame.from_dict(annot_data, 'Sample')
+    >>> vf.plot_histplot(hue='Type', af=af)
+
+.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/plot_histplot.png
+
+MAF
+---
+
 To create an oncoplot with a MAF file:
 
 .. code:: python3
 
-    >>> import matplotlib.pyplot as plt
     >>> from fuc import common, pymaf
     >>> common.load_dataset('tcga-laml')
     >>> f = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
@@ -140,7 +213,6 @@ To create a summary figure for a MAF file:
 
 .. code:: python3
 
-    >>> import matplotlib.pyplot as plt
     >>> from fuc import common, pymaf
     >>> common.load_dataset('tcga-laml')
     >>> f = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
@@ -149,11 +221,13 @@ To create a summary figure for a MAF file:
 
 .. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/maf_summary.png
 
+SAM/BAM/CRAM
+------------
+
 To create read depth profile of a region from a CRAM file:
 
 .. code:: python3
 
-    >>> import matplotlib.pyplot as plt
     >>> from fuc import pycov
     >>> cf = pycov.CovFrame.from_file('HG00525.final.cram', zero=True,
     ...    region='chr12:21161194-21239796', names=['HG00525'])
@@ -197,7 +271,7 @@ Finally, you can clone the GitHub repository and then install fuc this way:
    $ cd fuc
    $ pip install .
 
-The nice thing about this approach is that you will have access to development versions that are not available in Anaconda or PyPI. For example, you can access a development branch with the ``git checkout`` command.
+The nice thing about this approach is that you will have access to development versions that are not available in Anaconda or PyPI. For example, you can access a development branch with the ``git checkout`` command. When you do this, please make sure your environment already has all the dependencies installed.
 
 Getting Help
 ============
@@ -251,10 +325,10 @@ Below is the list of submodules available in API:
 - **pybed** : The pybed submodule is designed for working with BED files. It implements ``pybed.BedFrame`` which stores BED data as ``pandas.DataFrame`` via the `pyranges <https://github.com/biocore-ntnu/pyranges>`_ package to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `BED specification <https://genome.ucsc.edu/FAQ/FAQformat.html>`_.
 - **pycov** : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements ``pycov.CovFrame`` which stores read depth data as ``pandas.DataFrame`` via the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation.
 - **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
-- **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The class also contains many useful plotting methods such as ``MafFrame.plot_varcls`` and ``MafFrame.plot_waterfall``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
-- **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It is designed to be used with ``pyvcf.VcfFrame``.
-- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` class which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
-- **pyvep** : The pyvep submodule is designed for parsing VCF annotation data from the `Ensembl VEP <https://asia.ensembl.org/info/docs/tools/vep/index.html>`_. It is designed to be used with ``pyvcf.VcfFrame``.
+- **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pymaf.MafFrame`` class also contains many useful plotting methods such as ``MafFrame.plot_oncoplot`` and ``MafFrame.plot_summary``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
+- **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It should be used with ``pyvcf.VcfFrame``.
+- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_regplot``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
+- **pyvep** : The pyvep submodule is designed for parsing VCF annotation data from the `Ensembl VEP <https://asia.ensembl.org/info/docs/tools/vep/index.html>`_ program. It should be used with ``pyvcf.VcfFrame``.
 
 For getting help on a specific module (e.g. pyvcf):
 

diff --git a/conda.yml b/conda.yml
@@ -8,6 +8,7 @@ dependencies:
   - cython
   - lxml
   - matplotlib
+  - matplotlib-venn
   - notebook
   - numpy
   - pandas

diff --git a/docs/api.rst b/docs/api.rst
@@ -16,10 +16,10 @@ Below is the list of submodules available in API:
 - **pybed** : The pybed submodule is designed for working with BED files. It implements ``pybed.BedFrame`` which stores BED data as ``pandas.DataFrame`` via the `pyranges <https://github.com/biocore-ntnu/pyranges>`_ package to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `BED specification <https://genome.ucsc.edu/FAQ/FAQformat.html>`_.
 - **pycov** : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements ``pycov.CovFrame`` which stores read depth data as ``pandas.DataFrame`` via the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation.
 - **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
-- **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The class also contains many useful plotting methods such as ``MafFrame.plot_varcls`` and ``MafFrame.plot_waterfall``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
-- **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It is designed to be used with ``pyvcf.VcfFrame``.
-- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` class which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
-- **pyvep** : The pyvep submodule is designed for parsing VCF annotation data from the `Ensembl VEP <https://asia.ensembl.org/info/docs/tools/vep/index.html>`_. It is designed to be used with ``pyvcf.VcfFrame``.
+- **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pymaf.MafFrame`` class also contains many useful plotting methods such as ``MafFrame.plot_oncoplot`` and ``MafFrame.plot_summary``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
+- **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It should be used with ``pyvcf.VcfFrame``.
+- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_regplot``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
+- **pyvep** : The pyvep submodule is designed for parsing VCF annotation data from the `Ensembl VEP <https://asia.ensembl.org/info/docs/tools/vep/index.html>`_ program. It should be used with ``pyvcf.VcfFrame``.
 
 For getting help on a specific module (e.g. pyvcf):
 

diff --git a/docs/conf.py b/docs/conf.py
@@ -49,6 +49,8 @@
 
 issues_github_path = 'sbslee/fuc'
 
+napoleon_use_param = False
+
 # Include the example source for plots in API docs
 plot_include_source = True
 plot_formats = [('png', 90)]

diff --git a/docs/create.py b/docs/create.py
@@ -84,24 +84,36 @@
 CLI Examples
 ============
 
+SAM/BAM/CRAM
+------------
+
 To print the header of a BAM file:
 
 .. code-block:: console
 
    $ fuc bam_head example.bam
 
+BED
+---
+
 To find intersection between BED files:
 
 .. code-block:: console
 
    $ fuc bed_intxn 1.bed 2.bed 3.bed > intersect.bed
 
+FASTQ
+-----
+
 To count sequence reads in a FASTQ file:
 
 .. code-block:: console
 
    $ fuc fq_count example.fastq
 
+FUC
+---
+
 To check whether a file exists in the operating system:
 
 .. code-block:: console
@@ -114,12 +126,18 @@
 
    $ fuc fuc_find . vcf
 
+TABLE
+-----
+
 To merge two tab-delimited files:
 
 .. code-block:: console
 
    $ fuc tbl_merge left.txt right.txt > merged.txt
 
+VCF
+---
+
 To merge VCF files:
 
 .. code-block:: console
@@ -129,6 +147,9 @@
 API Examples
 ============
 
+VCF
+---
+
 To filter a VCF file based on a BED file:
 
 .. code:: python3
@@ -147,11 +168,63 @@
    >>> filtered_vf = vf.filter_indel()
    >>> filtered_vf.to_file('no_indels.vcf')
 
+To create a Venn diagram showing genotype concordance between groups:
+
+.. code:: python3
+
+    >>> from fuc import pyvcf, common
+    >>> common.load_dataset('pyvcf')
+    >>> f = '~/fuc-data/pyvcf/plot_comparison.vcf'
+    >>> vf = pyvcf.VcfFrame.from_file(f)
+    >>> a = ['Steven_A', 'John_A', 'Sara_A']
+    >>> b = ['Steven_B', 'John_B', 'Sara_B']
+    >>> c = ['Steven_C', 'John_C', 'Sara_C']
+    >>> vf.plot_comparison(a, b, c)
+
+.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/plot_comparison.png
+
+To create a histogram of tumor mutational burden (TMB) distribution:
+
+.. code:: python3
+
+    >>> from fuc import pyvcf
+    >>> vcf_data = {{
+    ...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
+    ...     'POS': [100, 101, 102, 103, 103],
+    ...     'ID': ['.', '.', '.', '.', '.'],
+    ...     'REF': ['T', 'T', 'T', 'T', 'T'],
+    ...     'ALT': ['C', 'C', 'C', 'C', 'C'],
+    ...     'QUAL': ['.', '.', '.', '.', '.'],
+    ...     'FILTER': ['.', '.', '.', '.', '.'],
+    ...     'INFO': ['.', '.', '.', '.', '.'],
+    ...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
+    ...     'Steven_N': ['0/0', '0/0', '0/1', '0/0', '0/0'],
+    ...     'Steven_T': ['0/0', '0/1', '0/1', '0/1', '0/1'],
+    ...     'Sara_N': ['0/0', '0/1', '0/0', '0/0', '0/0'],
+    ...     'Sara_T': ['0/0', '0/0', '1/1', '1/1', '0/1'],
+    ...     'John_N': ['0/0', '0/0', '0/0', '0/0', '0/0'],
+    ...     'John_T': ['0/1', '0/0', '1/1', '1/1', '0/1'],
+    ...     'Rachel_N': ['0/0', '0/0', '0/0', '0/0', '0/0'],
+    ...     'Rachel_T': ['0/1', '0/1', '0/0', '0/1', '0/1'],
+    ... }}
+    >>> annot_data = {{
+    ...     'Sample': ['Steven_N', 'Steven_T', 'Sara_N', 'Sara_T', 'John_N', 'John_T', 'Rachel_N', 'Rachel_T'],
+    ...     'Subject': ['Steven', 'Steven', 'Sara', 'Sara', 'John', 'John', 'Rachel', 'Rachel'],
+    ...     'Type': ['Normal', 'Tumor', 'Normal', 'Tumor', 'Normal', 'Tumor', 'Normal', 'Tumor'],
+    ... }}
+    >>> vf = pyvcf.VcfFrame.from_dict([], vcf_data)
+    >>> af = pyvcf.AnnFrame.from_dict(annot_data, 'Sample')
+    >>> vf.plot_histplot(hue='Type', af=af)
+
+.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/plot_histplot.png
+
+MAF
+---
+
 To create an oncoplot with a MAF file:
 
 .. code:: python3
 
-    >>> import matplotlib.pyplot as plt
     >>> from fuc import common, pymaf
     >>> common.load_dataset('tcga-laml')
     >>> f = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
@@ -168,7 +241,6 @@
 
 .. code:: python3
 
-    >>> import matplotlib.pyplot as plt
     >>> from fuc import common, pymaf
     >>> common.load_dataset('tcga-laml')
     >>> f = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
@@ -177,11 +249,13 @@
 
 .. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/maf_summary.png
 
+SAM/BAM/CRAM
+------------
+
 To create read depth profile of a region from a CRAM file:
 
 .. code:: python3
 
-    >>> import matplotlib.pyplot as plt
     >>> from fuc import pycov
     >>> cf = pycov.CovFrame.from_file('HG00525.final.cram', zero=True,
     ...    region='chr12:21161194-21239796', names=['HG00525'])
@@ -225,7 +299,7 @@
    $ cd fuc
    $ pip install .
 
-The nice thing about this approach is that you will have access to development versions that are not available in Anaconda or PyPI. For example, you can access a development branch with the ``git checkout`` command.
+The nice thing about this approach is that you will have access to development versions that are not available in Anaconda or PyPI. For example, you can access a development branch with the ``git checkout`` command. When you do this, please make sure your environment already has all the dependencies installed.
 
 Getting Help
 ============

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -2,6 +2,7 @@ sphinx_rtd_theme
 sphinx_issues
 autodocsumm
 matplotlib
+matplotlib-venn
 numpy
 pandas
 pysam
-Original file line number
+Diff line change
@@ Expand Up / @@ -8,6 +8,7 @@ dependencies: @@
       - cython
       - lxml
       - matplotlib
+      - matplotlib-venn
       - notebook
       - numpy
       - pandas
@@ Expand Down @@