Merge pull request #20 from sbslee/0.12.0-dev

0.12.0 dev
sbslee · Jun 12, 2021 · d95736f · d95736f
2 parents 014ba33 + b280e59
commit d95736f
Show file tree

Hide file tree

Showing 15 changed files with 611 additions and 355 deletions.
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -1,6 +1,18 @@
 Changelog
 *********
 
+0.12.0 (2021-06-12)
+-------------------
+
+* Add new method :meth:`pyvcf.VcfFrame.add_af`.
+* Add new method :meth:`pyvcf.VcfFrame.extract`.
+* Deprecate methods :meth:`pyvep.filter_af/biotype/nothas/impact`.
+* Add new method :meth:`pyvep.filter_query`.
+* :issue:`19`: Add new command :command:`vcf_vep`.
+* Rename :meth:`pyvcf.VcfFrame.plot_histplot` to :meth:`pyvcf.VcfFrame.plot_tmb`.
+* Add ``scipy`` package as dependency for performing statistical analysis.
+* Add new method :meth:`pyvcf.VcfFrame.plot_hist`.
+
 0.11.0 (2021-06-10)
 -------------------
 

diff --git a/README.rst b/README.rst
@@ -116,6 +116,12 @@ To merge VCF files:
 
    $ fuc vcf_merge 1.vcf 2.vcf 3.vcf > merged.vcf
 
+To filter a VCF file annotated by Ensemble VEP:
+
+.. code-block:: console
+
+   $ fuc vcf_vep in.vcf 'SYMBOL == "TP53"' > out.vcf
+
 API Examples
 ============
 
@@ -155,40 +161,25 @@ To create a Venn diagram showing genotype concordance between groups:
 
 .. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/plot_comparison.png
 
-To create a histogram of tumor mutational burden (TMB) distribution:
+To create various figures for normal-tumor analysis:
 
 .. code:: python3
 
-    >>> from fuc import pyvcf
-    >>> vcf_data = {
-    ...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
-    ...     'POS': [100, 101, 102, 103, 103],
-    ...     'ID': ['.', '.', '.', '.', '.'],
-    ...     'REF': ['T', 'T', 'T', 'T', 'T'],
-    ...     'ALT': ['C', 'C', 'C', 'C', 'C'],
-    ...     'QUAL': ['.', '.', '.', '.', '.'],
-    ...     'FILTER': ['.', '.', '.', '.', '.'],
-    ...     'INFO': ['.', '.', '.', '.', '.'],
-    ...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
-    ...     'Steven_N': ['0/0', '0/0', '0/1', '0/0', '0/0'],
-    ...     'Steven_T': ['0/0', '0/1', '0/1', '0/1', '0/1'],
-    ...     'Sara_N': ['0/0', '0/1', '0/0', '0/0', '0/0'],
-    ...     'Sara_T': ['0/0', '0/0', '1/1', '1/1', '0/1'],
-    ...     'John_N': ['0/0', '0/0', '0/0', '0/0', '0/0'],
-    ...     'John_T': ['0/1', '0/0', '1/1', '1/1', '0/1'],
-    ...     'Rachel_N': ['0/0', '0/0', '0/0', '0/0', '0/0'],
-    ...     'Rachel_T': ['0/1', '0/1', '0/0', '0/1', '0/1'],
-    ... }
-    >>> annot_data = {
-    ...     'Sample': ['Steven_N', 'Steven_T', 'Sara_N', 'Sara_T', 'John_N', 'John_T', 'Rachel_N', 'Rachel_T'],
-    ...     'Subject': ['Steven', 'Steven', 'Sara', 'Sara', 'John', 'John', 'Rachel', 'Rachel'],
-    ...     'Type': ['Normal', 'Tumor', 'Normal', 'Tumor', 'Normal', 'Tumor', 'Normal', 'Tumor'],
-    ... }
-    >>> vf = pyvcf.VcfFrame.from_dict([], vcf_data)
-    >>> af = pyvcf.AnnFrame.from_dict(annot_data, 'Sample')
-    >>> vf.plot_histplot(hue='Type', af=af)
-
-.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/plot_histplot.png
+    >>> import matplotlib.pyplot as plt
+    >>> from fuc import common, pyvcf
+    >>> common.load_dataset('pyvcf')
+    >>> vf = pyvcf.VcfFrame.from_file('~/fuc-data/pyvcf/normal-tumor.vcf')
+    >>> af = pyvcf.AnnFrame.from_file('~/fuc-data/pyvcf/normal-tumor-annot.tsv', 'Sample')
+    >>> normal = af.df[af.df.Tissue == 'Normal'].index
+    >>> tumor = af.df[af.df.Tissue == 'Tumor'].index
+    >>> fig, [[ax1, ax2], [ax3, ax4]] = plt.subplots(2, 2, figsize=(10, 10))
+    >>> vf.plot_tmb(ax=ax1)
+    >>> vf.plot_tmb(ax=ax2, af=af, hue='Tissue')
+    >>> vf.plot_hist('DP', ax=ax3, af=af, hue='Tissue')
+    >>> vf.plot_regplot(normal, tumor, ax=ax4)
+    >>> plt.tight_layout()
+
+.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/normal-tumor.png
 
 MAF
 ---
@@ -245,10 +236,12 @@ The following packages are required to run fuc:
    biopython
    lxml
    matplotlib
+   matplotlib-venn
    numpy
    pandas
    pyranges
    pysam
+   scipy
    seaborn
 
 There are various ways you can install fuc. The recommended way is via conda:
@@ -307,6 +300,7 @@ For getting help on CLI:
        vcf_merge    [VCF] merge two or more VCF files
        vcf_slice    [VCF] slice a VCF file
        vcf_vcf2bed  [VCF] convert a VCF file to a BED file
+       vcf_vep      [VCF] filter a VCF file annotated by Ensemble VEP
    
    optional arguments:
      -h, --help     show this help message and exit
@@ -327,7 +321,7 @@ Below is the list of submodules available in API:
 - **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
 - **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pymaf.MafFrame`` class also contains many useful plotting methods such as ``MafFrame.plot_oncoplot`` and ``MafFrame.plot_summary``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
 - **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It should be used with ``pyvcf.VcfFrame``.
-- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_regplot``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
+- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_tmb``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
 - **pyvep** : The pyvep submodule is designed for parsing VCF annotation data from the `Ensembl VEP <https://asia.ensembl.org/info/docs/tools/vep/index.html>`_ program. It should be used with ``pyvcf.VcfFrame``.
 
 For getting help on a specific module (e.g. pyvcf):

diff --git a/conda.yml b/conda.yml
@@ -14,6 +14,7 @@ dependencies:
   - pandas
   - pyranges
   - pysam
+  - scipy
   - seaborn
   - sphinx-issues
   - sphinx_rtd_theme

diff --git a/docs/api.rst b/docs/api.rst
@@ -18,7 +18,7 @@ Below is the list of submodules available in API:
 - **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
 - **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pymaf.MafFrame`` class also contains many useful plotting methods such as ``MafFrame.plot_oncoplot`` and ``MafFrame.plot_summary``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
 - **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It should be used with ``pyvcf.VcfFrame``.
-- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_regplot``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
+- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_tmb``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
 - **pyvep** : The pyvep submodule is designed for parsing VCF annotation data from the `Ensembl VEP <https://asia.ensembl.org/info/docs/tools/vep/index.html>`_ program. It should be used with ``pyvcf.VcfFrame``.
 
 For getting help on a specific module (e.g. pyvcf):

diff --git a/docs/cli.rst b/docs/cli.rst
@@ -38,6 +38,7 @@ For getting help on CLI:
        vcf_merge    [VCF] merge two or more VCF files
        vcf_slice    [VCF] slice a VCF file
        vcf_vcf2bed  [VCF] convert a VCF file to a BED file
+       vcf_vep      [VCF] filter a VCF file annotated by Ensemble VEP
    
    optional arguments:
      -h, --help     show this help message and exit
@@ -117,7 +118,7 @@ bam_slice
    
    optional arguments:
      -h, --help  show this help message and exit
-     --no_index  use to this flag to skip indexing
+     --no_index  use this flag to skip indexing
 
 bed_intxn
 =========
@@ -481,3 +482,31 @@ vcf_vcf2bed
    optional arguments:
      -h, --help  show this help message and exit
 
+vcf_vep
+=======
+
+.. code-block:: console
+
+   $ fuc vcf_vep -h
+   usage: fuc vcf_vep [-h] [--opposite] [--as_zero] vcf expr
+   
+   This command will filter a VCF file annotated by Ensemble VEP. It essentially wraps the `pandas.DataFrame.query` method. For details on query expression, please visit the method's documentation page (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html#pandas-dataframe-query).
+   
+   examples:
+     $ fuc vcf_vep in.vcf 'SYMBOL == "TP53"' > out.vcf
+     $ fuc vcf_vep in.vcf 'SYMBOL != "TP53"' > out.vcf
+     $ fuc vcf_vep in.vcf 'SYMBOL == "TP53"' --opposite > out.vcf
+     $ fuc vcf_vep in.vcf 'Consequence in ["splice_donor_variant", "stop_gained"]' > out.vcf
+     $ fuc vcf_vep in.vcf '(SYMBOL == "TP53") and (Consequence.str.contains("stop_gained"))' > out.vcf
+     $ fuc vcf_vep in.vcf 'gnomAD_AF < 0.001' > out.vcf
+     $ fuc vcf_vep in.vcf 'gnomAD_AF < 0.001' --as_zero > out.vcf
+   
+   positional arguments:
+     vcf         Ensemble VEP-annotated VCF file
+     expr        query expression to evaluate
+   
+   optional arguments:
+     -h, --help  show this help message and exit
+     --opposite  use this flag to return records that don’t meet the said criteria
+     --as_zero   use this flag to treat missing values as zero instead of NaN
+
diff --git a/docs/create.py b/docs/create.py
@@ -144,6 +144,12 @@
 
    $ fuc vcf_merge 1.vcf 2.vcf 3.vcf > merged.vcf
 
+To filter a VCF file annotated by Ensemble VEP:
+
+.. code-block:: console
+
+   $ fuc vcf_vep in.vcf 'SYMBOL == "TP53"' > out.vcf
+
 API Examples
 ============
 
@@ -183,40 +189,25 @@
 
 .. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/plot_comparison.png
 
-To create a histogram of tumor mutational burden (TMB) distribution:
+To create various figures for normal-tumor analysis:
 
 .. code:: python3
 
-    >>> from fuc import pyvcf
-    >>> vcf_data = {{
-    ...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
-    ...     'POS': [100, 101, 102, 103, 103],
-    ...     'ID': ['.', '.', '.', '.', '.'],
-    ...     'REF': ['T', 'T', 'T', 'T', 'T'],
-    ...     'ALT': ['C', 'C', 'C', 'C', 'C'],
-    ...     'QUAL': ['.', '.', '.', '.', '.'],
-    ...     'FILTER': ['.', '.', '.', '.', '.'],
-    ...     'INFO': ['.', '.', '.', '.', '.'],
-    ...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
-    ...     'Steven_N': ['0/0', '0/0', '0/1', '0/0', '0/0'],
-    ...     'Steven_T': ['0/0', '0/1', '0/1', '0/1', '0/1'],
-    ...     'Sara_N': ['0/0', '0/1', '0/0', '0/0', '0/0'],
-    ...     'Sara_T': ['0/0', '0/0', '1/1', '1/1', '0/1'],
-    ...     'John_N': ['0/0', '0/0', '0/0', '0/0', '0/0'],
-    ...     'John_T': ['0/1', '0/0', '1/1', '1/1', '0/1'],
-    ...     'Rachel_N': ['0/0', '0/0', '0/0', '0/0', '0/0'],
-    ...     'Rachel_T': ['0/1', '0/1', '0/0', '0/1', '0/1'],
-    ... }}
-    >>> annot_data = {{
-    ...     'Sample': ['Steven_N', 'Steven_T', 'Sara_N', 'Sara_T', 'John_N', 'John_T', 'Rachel_N', 'Rachel_T'],
-    ...     'Subject': ['Steven', 'Steven', 'Sara', 'Sara', 'John', 'John', 'Rachel', 'Rachel'],
-    ...     'Type': ['Normal', 'Tumor', 'Normal', 'Tumor', 'Normal', 'Tumor', 'Normal', 'Tumor'],
-    ... }}
-    >>> vf = pyvcf.VcfFrame.from_dict([], vcf_data)
-    >>> af = pyvcf.AnnFrame.from_dict(annot_data, 'Sample')
-    >>> vf.plot_histplot(hue='Type', af=af)
-
-.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/plot_histplot.png
+    >>> import matplotlib.pyplot as plt
+    >>> from fuc import common, pyvcf
+    >>> common.load_dataset('pyvcf')
+    >>> vf = pyvcf.VcfFrame.from_file('~/fuc-data/pyvcf/normal-tumor.vcf')
+    >>> af = pyvcf.AnnFrame.from_file('~/fuc-data/pyvcf/normal-tumor-annot.tsv', 'Sample')
+    >>> normal = af.df[af.df.Tissue == 'Normal'].index
+    >>> tumor = af.df[af.df.Tissue == 'Tumor'].index
+    >>> fig, [[ax1, ax2], [ax3, ax4]] = plt.subplots(2, 2, figsize=(10, 10))
+    >>> vf.plot_tmb(ax=ax1)
+    >>> vf.plot_tmb(ax=ax2, af=af, hue='Tissue')
+    >>> vf.plot_hist('DP', ax=ax3, af=af, hue='Tissue')
+    >>> vf.plot_regplot(normal, tumor, ax=ax4)
+    >>> plt.tight_layout()
+
+.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/normal-tumor.png
 
 MAF
 ---
@@ -273,10 +264,12 @@
    biopython
    lxml
    matplotlib
+   matplotlib-venn
    numpy
    pandas
    pyranges
    pysam
+   scipy
    seaborn
 
 There are various ways you can install fuc. The recommended way is via conda:

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -6,4 +6,5 @@ matplotlib-venn
 numpy
 pandas
 pysam
+scipy
 seaborn
diff --git a/fuc/api/common.py b/fuc/api/common.py
@@ -73,6 +73,8 @@ def load_dataset(name, force=False):
         ],
         'pyvcf': [
             'plot_comparison.vcf',
+            'normal-tumor.vcf',
+            'normal-tumor-annot.tsv',
         ],
     }
     base_url = ('https://raw.githubusercontent.com/sbslee/fuc-data/main')

diff --git a/fuc/api/pymaf.py b/fuc/api/pymaf.py
@@ -11,39 +11,39 @@
 protein change. However, most of the analysis in pymaf uses the
 following fields:
 
-+------+------------------------+----------------------+-------------------------------+
-| No.  | Name                   | Description          | Examples                      |
-+======+========================+======================+===============================+
-| 1    | Hugo_Symbol            | HUGO gene symbol     | 'TP53', 'Unknown'             |
-+------+------------------------+----------------------+-------------------------------+
-| 2    | Entrez_Gene_Id         | Entrez or Ensembl ID | 0, 8714                       |
-+------+------------------------+----------------------+-------------------------------+
-| 3    | Center                 | Sequencing center    | '.', 'genome.wustl.edu'       |
-+------+------------------------+----------------------+-------------------------------+
-| 4    | NCBI_Build             | Genome assembly      | '37', 'GRCh38'                |
-+------+------------------------+----------------------+-------------------------------+
-| 5    | Chromosome             | Chromosome name      | 'chr1'                        |
-+------+------------------------+----------------------+-------------------------------+
-| 6    | Start_Position         | Start coordinate     | 119031351                     |
-+------+------------------------+----------------------+-------------------------------+
-| 7    | End_Position           | End coordinate       | 44079555                      |
-+------+------------------------+----------------------+-------------------------------+
-| 8    | Strand                 | Genomic strand       | '+', '-'                      |
-+------+------------------------+----------------------+-------------------------------+
-| 9    | Variant_Classification | Translational effect | 'Missense_Mutation', 'Silent' |
-+------+------------------------+----------------------+-------------------------------+
-| 10   | Variant_Type           | Mutation type        | 'SNP', 'INS', 'DEL'           |
-+------+------------------------+----------------------+-------------------------------+
-| 11   | Reference_Allele       | Reference allele     | 'T', '-', 'ACAA'              |
-+------+------------------------+----------------------+-------------------------------+
-| 12   | Tumor_Seq_Allele1      | First tumor allele   | 'A', '-', 'TCA'               |
-+------+------------------------+----------------------+-------------------------------+
-| 13   | Tumor_Seq_Allele2      | Second tumor allele  | 'A', '-', 'TCA'               |
-+------+------------------------+----------------------+-------------------------------+
-| 14   | Tumor_Sample_Barcode   | Sample ID            | 'TCGA-AB-3002'                |
-+------+------------------------+----------------------+-------------------------------+
-| 15   | Protein_Change         | Protein change       | 'p.L558Q'                     |
-+------+------------------------+----------------------+-------------------------------+
++-----+------------------------+----------------------+-------------------------------+
+| No. | Name                   | Description          | Examples                      |
++=====+========================+======================+===============================+
+| 1   | Hugo_Symbol            | HUGO gene symbol     | 'TP53', 'Unknown'             |
++-----+------------------------+----------------------+-------------------------------+
+| 2   | Entrez_Gene_Id         | Entrez or Ensembl ID | 0, 8714                       |
++-----+------------------------+----------------------+-------------------------------+
+| 3   | Center                 | Sequencing center    | '.', 'genome.wustl.edu'       |
++-----+------------------------+----------------------+-------------------------------+
+| 4   | NCBI_Build             | Genome assembly      | '37', 'GRCh38'                |
++-----+------------------------+----------------------+-------------------------------+
+| 5   | Chromosome             | Chromosome name      | 'chr1'                        |
++-----+------------------------+----------------------+-------------------------------+
+| 6   | Start_Position         | Start coordinate     | 119031351                     |
++-----+------------------------+----------------------+-------------------------------+
+| 7   | End_Position           | End coordinate       | 44079555                      |
++-----+------------------------+----------------------+-------------------------------+
+| 8   | Strand                 | Genomic strand       | '+', '-'                      |
++-----+------------------------+----------------------+-------------------------------+
+| 9   | Variant_Classification | Translational effect | 'Missense_Mutation', 'Silent' |
++-----+------------------------+----------------------+-------------------------------+
+| 10  | Variant_Type           | Mutation type        | 'SNP', 'INS', 'DEL'           |
++-----+------------------------+----------------------+-------------------------------+
+| 11  | Reference_Allele       | Reference allele     | 'T', '-', 'ACAA'              |
++-----+------------------------+----------------------+-------------------------------+
+| 12  | Tumor_Seq_Allele1      | First tumor allele   | 'A', '-', 'TCA'               |
++-----+------------------------+----------------------+-------------------------------+
+| 13  | Tumor_Seq_Allele2      | Second tumor allele  | 'A', '-', 'TCA'               |
++-----+------------------------+----------------------+-------------------------------+
+| 14  | Tumor_Sample_Barcode   | Sample ID            | 'TCGA-AB-3002'                |
++-----+------------------------+----------------------+-------------------------------+
+| 15  | Protein_Change         | Protein change       | 'p.L558Q'                     |
++-----+------------------------+----------------------+-------------------------------+
 """
 
 import pandas as pd
-Original file line number
+Diff line change
@@ Expand Up / @@ -14,6 +14,7 @@ dependencies: @@
       - pandas
       - pyranges
       - pysam
+      - scipy
       - seaborn
       - sphinx-issues
       - sphinx_rtd_theme
@@ Expand Down @@