Skip to content

Commit

Permalink
Merge pull request #20 from sbslee/0.12.0-dev
Browse files Browse the repository at this point in the history
0.12.0 dev
  • Loading branch information
sbslee authored Jun 12, 2021
2 parents 014ba33 + b280e59 commit d95736f
Show file tree
Hide file tree
Showing 15 changed files with 611 additions and 355 deletions.
12 changes: 12 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,18 @@
Changelog
*********

0.12.0 (2021-06-12)
-------------------

* Add new method :meth:`pyvcf.VcfFrame.add_af`.
* Add new method :meth:`pyvcf.VcfFrame.extract`.
* Deprecate methods :meth:`pyvep.filter_af/biotype/nothas/impact`.
* Add new method :meth:`pyvep.filter_query`.
* :issue:`19`: Add new command :command:`vcf_vep`.
* Rename :meth:`pyvcf.VcfFrame.plot_histplot` to :meth:`pyvcf.VcfFrame.plot_tmb`.
* Add ``scipy`` package as dependency for performing statistical analysis.
* Add new method :meth:`pyvcf.VcfFrame.plot_hist`.

0.11.0 (2021-06-10)
-------------------

Expand Down
58 changes: 26 additions & 32 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,12 @@ To merge VCF files:
$ fuc vcf_merge 1.vcf 2.vcf 3.vcf > merged.vcf
To filter a VCF file annotated by Ensemble VEP:

.. code-block:: console
$ fuc vcf_vep in.vcf 'SYMBOL == "TP53"' > out.vcf
API Examples
============

Expand Down Expand Up @@ -155,40 +161,25 @@ To create a Venn diagram showing genotype concordance between groups:
.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/plot_comparison.png

To create a histogram of tumor mutational burden (TMB) distribution:
To create various figures for normal-tumor analysis:

.. code:: python3
>>> from fuc import pyvcf
>>> vcf_data = {
... 'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
... 'POS': [100, 101, 102, 103, 103],
... 'ID': ['.', '.', '.', '.', '.'],
... 'REF': ['T', 'T', 'T', 'T', 'T'],
... 'ALT': ['C', 'C', 'C', 'C', 'C'],
... 'QUAL': ['.', '.', '.', '.', '.'],
... 'FILTER': ['.', '.', '.', '.', '.'],
... 'INFO': ['.', '.', '.', '.', '.'],
... 'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
... 'Steven_N': ['0/0', '0/0', '0/1', '0/0', '0/0'],
... 'Steven_T': ['0/0', '0/1', '0/1', '0/1', '0/1'],
... 'Sara_N': ['0/0', '0/1', '0/0', '0/0', '0/0'],
... 'Sara_T': ['0/0', '0/0', '1/1', '1/1', '0/1'],
... 'John_N': ['0/0', '0/0', '0/0', '0/0', '0/0'],
... 'John_T': ['0/1', '0/0', '1/1', '1/1', '0/1'],
... 'Rachel_N': ['0/0', '0/0', '0/0', '0/0', '0/0'],
... 'Rachel_T': ['0/1', '0/1', '0/0', '0/1', '0/1'],
... }
>>> annot_data = {
... 'Sample': ['Steven_N', 'Steven_T', 'Sara_N', 'Sara_T', 'John_N', 'John_T', 'Rachel_N', 'Rachel_T'],
... 'Subject': ['Steven', 'Steven', 'Sara', 'Sara', 'John', 'John', 'Rachel', 'Rachel'],
... 'Type': ['Normal', 'Tumor', 'Normal', 'Tumor', 'Normal', 'Tumor', 'Normal', 'Tumor'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], vcf_data)
>>> af = pyvcf.AnnFrame.from_dict(annot_data, 'Sample')
>>> vf.plot_histplot(hue='Type', af=af)
.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/plot_histplot.png
>>> import matplotlib.pyplot as plt
>>> from fuc import common, pyvcf
>>> common.load_dataset('pyvcf')
>>> vf = pyvcf.VcfFrame.from_file('~/fuc-data/pyvcf/normal-tumor.vcf')
>>> af = pyvcf.AnnFrame.from_file('~/fuc-data/pyvcf/normal-tumor-annot.tsv', 'Sample')
>>> normal = af.df[af.df.Tissue == 'Normal'].index
>>> tumor = af.df[af.df.Tissue == 'Tumor'].index
>>> fig, [[ax1, ax2], [ax3, ax4]] = plt.subplots(2, 2, figsize=(10, 10))
>>> vf.plot_tmb(ax=ax1)
>>> vf.plot_tmb(ax=ax2, af=af, hue='Tissue')
>>> vf.plot_hist('DP', ax=ax3, af=af, hue='Tissue')
>>> vf.plot_regplot(normal, tumor, ax=ax4)
>>> plt.tight_layout()
.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/normal-tumor.png

MAF
---
Expand Down Expand Up @@ -245,10 +236,12 @@ The following packages are required to run fuc:
biopython
lxml
matplotlib
matplotlib-venn
numpy
pandas
pyranges
pysam
scipy
seaborn
There are various ways you can install fuc. The recommended way is via conda:
Expand Down Expand Up @@ -307,6 +300,7 @@ For getting help on CLI:
vcf_merge [VCF] merge two or more VCF files
vcf_slice [VCF] slice a VCF file
vcf_vcf2bed [VCF] convert a VCF file to a BED file
vcf_vep [VCF] filter a VCF file annotated by Ensemble VEP
optional arguments:
-h, --help show this help message and exit
Expand All @@ -327,7 +321,7 @@ Below is the list of submodules available in API:
- **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
- **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pymaf.MafFrame`` class also contains many useful plotting methods such as ``MafFrame.plot_oncoplot`` and ``MafFrame.plot_summary``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
- **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It should be used with ``pyvcf.VcfFrame``.
- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_regplot``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_tmb``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
- **pyvep** : The pyvep submodule is designed for parsing VCF annotation data from the `Ensembl VEP <https://asia.ensembl.org/info/docs/tools/vep/index.html>`_ program. It should be used with ``pyvcf.VcfFrame``.

For getting help on a specific module (e.g. pyvcf):
Expand Down
1 change: 1 addition & 0 deletions conda.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ dependencies:
- pandas
- pyranges
- pysam
- scipy
- seaborn
- sphinx-issues
- sphinx_rtd_theme
Expand Down
2 changes: 1 addition & 1 deletion docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Below is the list of submodules available in API:
- **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
- **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pymaf.MafFrame`` class also contains many useful plotting methods such as ``MafFrame.plot_oncoplot`` and ``MafFrame.plot_summary``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
- **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It should be used with ``pyvcf.VcfFrame``.
- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_regplot``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_tmb``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
- **pyvep** : The pyvep submodule is designed for parsing VCF annotation data from the `Ensembl VEP <https://asia.ensembl.org/info/docs/tools/vep/index.html>`_ program. It should be used with ``pyvcf.VcfFrame``.

For getting help on a specific module (e.g. pyvcf):
Expand Down
31 changes: 30 additions & 1 deletion docs/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ For getting help on CLI:
vcf_merge [VCF] merge two or more VCF files
vcf_slice [VCF] slice a VCF file
vcf_vcf2bed [VCF] convert a VCF file to a BED file
vcf_vep [VCF] filter a VCF file annotated by Ensemble VEP
optional arguments:
-h, --help show this help message and exit
Expand Down Expand Up @@ -117,7 +118,7 @@ bam_slice
optional arguments:
-h, --help show this help message and exit
--no_index use to this flag to skip indexing
--no_index use this flag to skip indexing
bed_intxn
=========
Expand Down Expand Up @@ -481,3 +482,31 @@ vcf_vcf2bed
optional arguments:
-h, --help show this help message and exit
vcf_vep
=======

.. code-block:: console
$ fuc vcf_vep -h
usage: fuc vcf_vep [-h] [--opposite] [--as_zero] vcf expr
This command will filter a VCF file annotated by Ensemble VEP. It essentially wraps the `pandas.DataFrame.query` method. For details on query expression, please visit the method's documentation page (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html#pandas-dataframe-query).
examples:
$ fuc vcf_vep in.vcf 'SYMBOL == "TP53"' > out.vcf
$ fuc vcf_vep in.vcf 'SYMBOL != "TP53"' > out.vcf
$ fuc vcf_vep in.vcf 'SYMBOL == "TP53"' --opposite > out.vcf
$ fuc vcf_vep in.vcf 'Consequence in ["splice_donor_variant", "stop_gained"]' > out.vcf
$ fuc vcf_vep in.vcf '(SYMBOL == "TP53") and (Consequence.str.contains("stop_gained"))' > out.vcf
$ fuc vcf_vep in.vcf 'gnomAD_AF < 0.001' > out.vcf
$ fuc vcf_vep in.vcf 'gnomAD_AF < 0.001' --as_zero > out.vcf
positional arguments:
vcf Ensemble VEP-annotated VCF file
expr query expression to evaluate
optional arguments:
-h, --help show this help message and exit
--opposite use this flag to return records that don’t meet the said criteria
--as_zero use this flag to treat missing values as zero instead of NaN
55 changes: 24 additions & 31 deletions docs/create.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,12 @@
$ fuc vcf_merge 1.vcf 2.vcf 3.vcf > merged.vcf
To filter a VCF file annotated by Ensemble VEP:
.. code-block:: console
$ fuc vcf_vep in.vcf 'SYMBOL == "TP53"' > out.vcf
API Examples
============
Expand Down Expand Up @@ -183,40 +189,25 @@
.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/plot_comparison.png
To create a histogram of tumor mutational burden (TMB) distribution:
To create various figures for normal-tumor analysis:
.. code:: python3
>>> from fuc import pyvcf
>>> vcf_data = {{
... 'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
... 'POS': [100, 101, 102, 103, 103],
... 'ID': ['.', '.', '.', '.', '.'],
... 'REF': ['T', 'T', 'T', 'T', 'T'],
... 'ALT': ['C', 'C', 'C', 'C', 'C'],
... 'QUAL': ['.', '.', '.', '.', '.'],
... 'FILTER': ['.', '.', '.', '.', '.'],
... 'INFO': ['.', '.', '.', '.', '.'],
... 'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
... 'Steven_N': ['0/0', '0/0', '0/1', '0/0', '0/0'],
... 'Steven_T': ['0/0', '0/1', '0/1', '0/1', '0/1'],
... 'Sara_N': ['0/0', '0/1', '0/0', '0/0', '0/0'],
... 'Sara_T': ['0/0', '0/0', '1/1', '1/1', '0/1'],
... 'John_N': ['0/0', '0/0', '0/0', '0/0', '0/0'],
... 'John_T': ['0/1', '0/0', '1/1', '1/1', '0/1'],
... 'Rachel_N': ['0/0', '0/0', '0/0', '0/0', '0/0'],
... 'Rachel_T': ['0/1', '0/1', '0/0', '0/1', '0/1'],
... }}
>>> annot_data = {{
... 'Sample': ['Steven_N', 'Steven_T', 'Sara_N', 'Sara_T', 'John_N', 'John_T', 'Rachel_N', 'Rachel_T'],
... 'Subject': ['Steven', 'Steven', 'Sara', 'Sara', 'John', 'John', 'Rachel', 'Rachel'],
... 'Type': ['Normal', 'Tumor', 'Normal', 'Tumor', 'Normal', 'Tumor', 'Normal', 'Tumor'],
... }}
>>> vf = pyvcf.VcfFrame.from_dict([], vcf_data)
>>> af = pyvcf.AnnFrame.from_dict(annot_data, 'Sample')
>>> vf.plot_histplot(hue='Type', af=af)
.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/plot_histplot.png
>>> import matplotlib.pyplot as plt
>>> from fuc import common, pyvcf
>>> common.load_dataset('pyvcf')
>>> vf = pyvcf.VcfFrame.from_file('~/fuc-data/pyvcf/normal-tumor.vcf')
>>> af = pyvcf.AnnFrame.from_file('~/fuc-data/pyvcf/normal-tumor-annot.tsv', 'Sample')
>>> normal = af.df[af.df.Tissue == 'Normal'].index
>>> tumor = af.df[af.df.Tissue == 'Tumor'].index
>>> fig, [[ax1, ax2], [ax3, ax4]] = plt.subplots(2, 2, figsize=(10, 10))
>>> vf.plot_tmb(ax=ax1)
>>> vf.plot_tmb(ax=ax2, af=af, hue='Tissue')
>>> vf.plot_hist('DP', ax=ax3, af=af, hue='Tissue')
>>> vf.plot_regplot(normal, tumor, ax=ax4)
>>> plt.tight_layout()
.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/normal-tumor.png
MAF
---
Expand Down Expand Up @@ -273,10 +264,12 @@
biopython
lxml
matplotlib
matplotlib-venn
numpy
pandas
pyranges
pysam
scipy
seaborn
There are various ways you can install fuc. The recommended way is via conda:
Expand Down
1 change: 1 addition & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ matplotlib-venn
numpy
pandas
pysam
scipy
seaborn
2 changes: 2 additions & 0 deletions fuc/api/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,8 @@ def load_dataset(name, force=False):
],
'pyvcf': [
'plot_comparison.vcf',
'normal-tumor.vcf',
'normal-tumor-annot.tsv',
],
}
base_url = ('https://raw.githubusercontent.com/sbslee/fuc-data/main')
Expand Down
66 changes: 33 additions & 33 deletions fuc/api/pymaf.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,39 +11,39 @@
protein change. However, most of the analysis in pymaf uses the
following fields:
+------+------------------------+----------------------+-------------------------------+
| No. | Name | Description | Examples |
+======+========================+======================+===============================+
| 1 | Hugo_Symbol | HUGO gene symbol | 'TP53', 'Unknown' |
+------+------------------------+----------------------+-------------------------------+
| 2 | Entrez_Gene_Id | Entrez or Ensembl ID | 0, 8714 |
+------+------------------------+----------------------+-------------------------------+
| 3 | Center | Sequencing center | '.', 'genome.wustl.edu' |
+------+------------------------+----------------------+-------------------------------+
| 4 | NCBI_Build | Genome assembly | '37', 'GRCh38' |
+------+------------------------+----------------------+-------------------------------+
| 5 | Chromosome | Chromosome name | 'chr1' |
+------+------------------------+----------------------+-------------------------------+
| 6 | Start_Position | Start coordinate | 119031351 |
+------+------------------------+----------------------+-------------------------------+
| 7 | End_Position | End coordinate | 44079555 |
+------+------------------------+----------------------+-------------------------------+
| 8 | Strand | Genomic strand | '+', '-' |
+------+------------------------+----------------------+-------------------------------+
| 9 | Variant_Classification | Translational effect | 'Missense_Mutation', 'Silent' |
+------+------------------------+----------------------+-------------------------------+
| 10 | Variant_Type | Mutation type | 'SNP', 'INS', 'DEL' |
+------+------------------------+----------------------+-------------------------------+
| 11 | Reference_Allele | Reference allele | 'T', '-', 'ACAA' |
+------+------------------------+----------------------+-------------------------------+
| 12 | Tumor_Seq_Allele1 | First tumor allele | 'A', '-', 'TCA' |
+------+------------------------+----------------------+-------------------------------+
| 13 | Tumor_Seq_Allele2 | Second tumor allele | 'A', '-', 'TCA' |
+------+------------------------+----------------------+-------------------------------+
| 14 | Tumor_Sample_Barcode | Sample ID | 'TCGA-AB-3002' |
+------+------------------------+----------------------+-------------------------------+
| 15 | Protein_Change | Protein change | 'p.L558Q' |
+------+------------------------+----------------------+-------------------------------+
+-----+------------------------+----------------------+-------------------------------+
| No. | Name | Description | Examples |
+=====+========================+======================+===============================+
| 1 | Hugo_Symbol | HUGO gene symbol | 'TP53', 'Unknown' |
+-----+------------------------+----------------------+-------------------------------+
| 2 | Entrez_Gene_Id | Entrez or Ensembl ID | 0, 8714 |
+-----+------------------------+----------------------+-------------------------------+
| 3 | Center | Sequencing center | '.', 'genome.wustl.edu' |
+-----+------------------------+----------------------+-------------------------------+
| 4 | NCBI_Build | Genome assembly | '37', 'GRCh38' |
+-----+------------------------+----------------------+-------------------------------+
| 5 | Chromosome | Chromosome name | 'chr1' |
+-----+------------------------+----------------------+-------------------------------+
| 6 | Start_Position | Start coordinate | 119031351 |
+-----+------------------------+----------------------+-------------------------------+
| 7 | End_Position | End coordinate | 44079555 |
+-----+------------------------+----------------------+-------------------------------+
| 8 | Strand | Genomic strand | '+', '-' |
+-----+------------------------+----------------------+-------------------------------+
| 9 | Variant_Classification | Translational effect | 'Missense_Mutation', 'Silent' |
+-----+------------------------+----------------------+-------------------------------+
| 10 | Variant_Type | Mutation type | 'SNP', 'INS', 'DEL' |
+-----+------------------------+----------------------+-------------------------------+
| 11 | Reference_Allele | Reference allele | 'T', '-', 'ACAA' |
+-----+------------------------+----------------------+-------------------------------+
| 12 | Tumor_Seq_Allele1 | First tumor allele | 'A', '-', 'TCA' |
+-----+------------------------+----------------------+-------------------------------+
| 13 | Tumor_Seq_Allele2 | Second tumor allele | 'A', '-', 'TCA' |
+-----+------------------------+----------------------+-------------------------------+
| 14 | Tumor_Sample_Barcode | Sample ID | 'TCGA-AB-3002' |
+-----+------------------------+----------------------+-------------------------------+
| 15 | Protein_Change | Protein change | 'p.L558Q' |
+-----+------------------------+----------------------+-------------------------------+
"""

import pandas as pd
Expand Down
Loading

0 comments on commit d95736f

Please sign in to comment.