Skip to content

Commit

Permalink
Merge pull request #18 from sbslee/0.11.0-dev
Browse files Browse the repository at this point in the history
0.11.0 dev
  • Loading branch information
sbslee authored Jun 10, 2021
2 parents 86798f1 + 740ce86 commit 014ba33
Show file tree
Hide file tree
Showing 14 changed files with 1,461 additions and 486 deletions.
10 changes: 10 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
Changelog
*********

0.11.0 (2021-06-10)
-------------------

* :issue:`16`: Add new method :meth:`pyvcf.VcfFrame.cfilter_empty`.
* Add new methods :meth:`pyvep.filter_af/lof`.
* Add ``matplotlib-venn`` package as dependency for plotting Venn diagrams.
* Add new methods :meth:`pyvcf.plot_comparison/regplot/histplot`.
* :issue:`17`: Add new method :meth:`pyvep.filter_biotype`.
* Add new class :class:`pyvcf.AnnFrame`.

0.10.0 (2021-06-03)
-------------------

Expand Down
90 changes: 82 additions & 8 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,24 +56,36 @@ Your contributions (e.g. feature ideas, pull requests) are most welcome.
CLI Examples
============

SAM/BAM/CRAM
------------

To print the header of a BAM file:

.. code-block:: console
$ fuc bam_head example.bam
BED
---

To find intersection between BED files:

.. code-block:: console
$ fuc bed_intxn 1.bed 2.bed 3.bed > intersect.bed
FASTQ
-----

To count sequence reads in a FASTQ file:

.. code-block:: console
$ fuc fq_count example.fastq
FUC
---

To check whether a file exists in the operating system:

.. code-block:: console
Expand All @@ -86,12 +98,18 @@ To find all VCF files within the current directory recursively:
$ fuc fuc_find . vcf
TABLE
-----

To merge two tab-delimited files:

.. code-block:: console
$ fuc tbl_merge left.txt right.txt > merged.txt
VCF
---

To merge VCF files:

.. code-block:: console
Expand All @@ -101,6 +119,9 @@ To merge VCF files:
API Examples
============

VCF
---

To filter a VCF file based on a BED file:

.. code:: python3
Expand All @@ -119,11 +140,63 @@ To remove indels from a VCF file:
>>> filtered_vf = vf.filter_indel()
>>> filtered_vf.to_file('no_indels.vcf')
To create a Venn diagram showing genotype concordance between groups:

.. code:: python3
>>> from fuc import pyvcf, common
>>> common.load_dataset('pyvcf')
>>> f = '~/fuc-data/pyvcf/plot_comparison.vcf'
>>> vf = pyvcf.VcfFrame.from_file(f)
>>> a = ['Steven_A', 'John_A', 'Sara_A']
>>> b = ['Steven_B', 'John_B', 'Sara_B']
>>> c = ['Steven_C', 'John_C', 'Sara_C']
>>> vf.plot_comparison(a, b, c)
.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/plot_comparison.png

To create a histogram of tumor mutational burden (TMB) distribution:

.. code:: python3
>>> from fuc import pyvcf
>>> vcf_data = {
... 'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
... 'POS': [100, 101, 102, 103, 103],
... 'ID': ['.', '.', '.', '.', '.'],
... 'REF': ['T', 'T', 'T', 'T', 'T'],
... 'ALT': ['C', 'C', 'C', 'C', 'C'],
... 'QUAL': ['.', '.', '.', '.', '.'],
... 'FILTER': ['.', '.', '.', '.', '.'],
... 'INFO': ['.', '.', '.', '.', '.'],
... 'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
... 'Steven_N': ['0/0', '0/0', '0/1', '0/0', '0/0'],
... 'Steven_T': ['0/0', '0/1', '0/1', '0/1', '0/1'],
... 'Sara_N': ['0/0', '0/1', '0/0', '0/0', '0/0'],
... 'Sara_T': ['0/0', '0/0', '1/1', '1/1', '0/1'],
... 'John_N': ['0/0', '0/0', '0/0', '0/0', '0/0'],
... 'John_T': ['0/1', '0/0', '1/1', '1/1', '0/1'],
... 'Rachel_N': ['0/0', '0/0', '0/0', '0/0', '0/0'],
... 'Rachel_T': ['0/1', '0/1', '0/0', '0/1', '0/1'],
... }
>>> annot_data = {
... 'Sample': ['Steven_N', 'Steven_T', 'Sara_N', 'Sara_T', 'John_N', 'John_T', 'Rachel_N', 'Rachel_T'],
... 'Subject': ['Steven', 'Steven', 'Sara', 'Sara', 'John', 'John', 'Rachel', 'Rachel'],
... 'Type': ['Normal', 'Tumor', 'Normal', 'Tumor', 'Normal', 'Tumor', 'Normal', 'Tumor'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], vcf_data)
>>> af = pyvcf.AnnFrame.from_dict(annot_data, 'Sample')
>>> vf.plot_histplot(hue='Type', af=af)
.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/plot_histplot.png

MAF
---

To create an oncoplot with a MAF file:

.. code:: python3
>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> f = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
Expand All @@ -140,7 +213,6 @@ To create a summary figure for a MAF file:

.. code:: python3
>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> f = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
Expand All @@ -149,11 +221,13 @@ To create a summary figure for a MAF file:
.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/maf_summary.png

SAM/BAM/CRAM
------------

To create read depth profile of a region from a CRAM file:

.. code:: python3
>>> import matplotlib.pyplot as plt
>>> from fuc import pycov
>>> cf = pycov.CovFrame.from_file('HG00525.final.cram', zero=True,
... region='chr12:21161194-21239796', names=['HG00525'])
Expand Down Expand Up @@ -197,7 +271,7 @@ Finally, you can clone the GitHub repository and then install fuc this way:
$ cd fuc
$ pip install .
The nice thing about this approach is that you will have access to development versions that are not available in Anaconda or PyPI. For example, you can access a development branch with the ``git checkout`` command.
The nice thing about this approach is that you will have access to development versions that are not available in Anaconda or PyPI. For example, you can access a development branch with the ``git checkout`` command. When you do this, please make sure your environment already has all the dependencies installed.

Getting Help
============
Expand Down Expand Up @@ -251,10 +325,10 @@ Below is the list of submodules available in API:
- **pybed** : The pybed submodule is designed for working with BED files. It implements ``pybed.BedFrame`` which stores BED data as ``pandas.DataFrame`` via the `pyranges <https://github.com/biocore-ntnu/pyranges>`_ package to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `BED specification <https://genome.ucsc.edu/FAQ/FAQformat.html>`_.
- **pycov** : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements ``pycov.CovFrame`` which stores read depth data as ``pandas.DataFrame`` via the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation.
- **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
- **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The class also contains many useful plotting methods such as ``MafFrame.plot_varcls`` and ``MafFrame.plot_waterfall``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
- **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It is designed to be used with ``pyvcf.VcfFrame``.
- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` class which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
- **pyvep** : The pyvep submodule is designed for parsing VCF annotation data from the `Ensembl VEP <https://asia.ensembl.org/info/docs/tools/vep/index.html>`_. It is designed to be used with ``pyvcf.VcfFrame``.
- **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pymaf.MafFrame`` class also contains many useful plotting methods such as ``MafFrame.plot_oncoplot`` and ``MafFrame.plot_summary``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
- **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It should be used with ``pyvcf.VcfFrame``.
- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_regplot``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
- **pyvep** : The pyvep submodule is designed for parsing VCF annotation data from the `Ensembl VEP <https://asia.ensembl.org/info/docs/tools/vep/index.html>`_ program. It should be used with ``pyvcf.VcfFrame``.

For getting help on a specific module (e.g. pyvcf):

Expand Down
1 change: 1 addition & 0 deletions conda.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ dependencies:
- cython
- lxml
- matplotlib
- matplotlib-venn
- notebook
- numpy
- pandas
Expand Down
8 changes: 4 additions & 4 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@ Below is the list of submodules available in API:
- **pybed** : The pybed submodule is designed for working with BED files. It implements ``pybed.BedFrame`` which stores BED data as ``pandas.DataFrame`` via the `pyranges <https://github.com/biocore-ntnu/pyranges>`_ package to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `BED specification <https://genome.ucsc.edu/FAQ/FAQformat.html>`_.
- **pycov** : The pycov submodule is designed for working with depth of coverage data from sequence alingment files (SAM/BAM/CRAM). It implements ``pycov.CovFrame`` which stores read depth data as ``pandas.DataFrame`` via the `pysam <https://pysam.readthedocs.io/en/latest/api.html>`_ package to allow fast computation and easy manipulation.
- **pyfq** : The pyfq submodule is designed for working with FASTQ files. It implements ``pyfq.FqFrame`` which stores FASTQ data as ``pandas.DataFrame`` to allow fast computation and easy manipulation.
- **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The class also contains many useful plotting methods such as ``MafFrame.plot_varcls`` and ``MafFrame.plot_waterfall``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
- **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It is designed to be used with ``pyvcf.VcfFrame``.
- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` class which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
- **pyvep** : The pyvep submodule is designed for parsing VCF annotation data from the `Ensembl VEP <https://asia.ensembl.org/info/docs/tools/vep/index.html>`_. It is designed to be used with ``pyvcf.VcfFrame``.
- **pymaf** : The pymaf submodule is designed for working with MAF files. It implements ``pymaf.MafFrame`` which stores MAF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pymaf.MafFrame`` class also contains many useful plotting methods such as ``MafFrame.plot_oncoplot`` and ``MafFrame.plot_summary``. The submodule strictly adheres to the standard `MAF specification <https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/>`_.
- **pysnpeff** : The pysnpeff submodule is designed for parsing VCF annotation data from the `SnpEff <https://pcingola.github.io/SnpEff/>`_ program. It should be used with ``pyvcf.VcfFrame``.
- **pyvcf** : The pyvcf submodule is designed for working with VCF files. It implements ``pyvcf.VcfFrame`` which stores VCF data as ``pandas.DataFrame`` to allow fast computation and easy manipulation. The ``pyvcf.VcfFrame`` class also contains many useful plotting methods such as ``VcfFrame.plot_comparison`` and ``VcfFrame.plot_regplot``. The submodule strictly adheres to the standard `VCF specification <https://samtools.github.io/hts-specs/VCFv4.3.pdf>`_.
- **pyvep** : The pyvep submodule is designed for parsing VCF annotation data from the `Ensembl VEP <https://asia.ensembl.org/info/docs/tools/vep/index.html>`_ program. It should be used with ``pyvcf.VcfFrame``.

For getting help on a specific module (e.g. pyvcf):

Expand Down
2 changes: 2 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@

issues_github_path = 'sbslee/fuc'

napoleon_use_param = False

# Include the example source for plots in API docs
plot_include_source = True
plot_formats = [('png', 90)]
Expand Down
82 changes: 78 additions & 4 deletions docs/create.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,24 +84,36 @@
CLI Examples
============
SAM/BAM/CRAM
------------
To print the header of a BAM file:
.. code-block:: console
$ fuc bam_head example.bam
BED
---
To find intersection between BED files:
.. code-block:: console
$ fuc bed_intxn 1.bed 2.bed 3.bed > intersect.bed
FASTQ
-----
To count sequence reads in a FASTQ file:
.. code-block:: console
$ fuc fq_count example.fastq
FUC
---
To check whether a file exists in the operating system:
.. code-block:: console
Expand All @@ -114,12 +126,18 @@
$ fuc fuc_find . vcf
TABLE
-----
To merge two tab-delimited files:
.. code-block:: console
$ fuc tbl_merge left.txt right.txt > merged.txt
VCF
---
To merge VCF files:
.. code-block:: console
Expand All @@ -129,6 +147,9 @@
API Examples
============
VCF
---
To filter a VCF file based on a BED file:
.. code:: python3
Expand All @@ -147,11 +168,63 @@
>>> filtered_vf = vf.filter_indel()
>>> filtered_vf.to_file('no_indels.vcf')
To create a Venn diagram showing genotype concordance between groups:
.. code:: python3
>>> from fuc import pyvcf, common
>>> common.load_dataset('pyvcf')
>>> f = '~/fuc-data/pyvcf/plot_comparison.vcf'
>>> vf = pyvcf.VcfFrame.from_file(f)
>>> a = ['Steven_A', 'John_A', 'Sara_A']
>>> b = ['Steven_B', 'John_B', 'Sara_B']
>>> c = ['Steven_C', 'John_C', 'Sara_C']
>>> vf.plot_comparison(a, b, c)
.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/plot_comparison.png
To create a histogram of tumor mutational burden (TMB) distribution:
.. code:: python3
>>> from fuc import pyvcf
>>> vcf_data = {{
... 'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
... 'POS': [100, 101, 102, 103, 103],
... 'ID': ['.', '.', '.', '.', '.'],
... 'REF': ['T', 'T', 'T', 'T', 'T'],
... 'ALT': ['C', 'C', 'C', 'C', 'C'],
... 'QUAL': ['.', '.', '.', '.', '.'],
... 'FILTER': ['.', '.', '.', '.', '.'],
... 'INFO': ['.', '.', '.', '.', '.'],
... 'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
... 'Steven_N': ['0/0', '0/0', '0/1', '0/0', '0/0'],
... 'Steven_T': ['0/0', '0/1', '0/1', '0/1', '0/1'],
... 'Sara_N': ['0/0', '0/1', '0/0', '0/0', '0/0'],
... 'Sara_T': ['0/0', '0/0', '1/1', '1/1', '0/1'],
... 'John_N': ['0/0', '0/0', '0/0', '0/0', '0/0'],
... 'John_T': ['0/1', '0/0', '1/1', '1/1', '0/1'],
... 'Rachel_N': ['0/0', '0/0', '0/0', '0/0', '0/0'],
... 'Rachel_T': ['0/1', '0/1', '0/0', '0/1', '0/1'],
... }}
>>> annot_data = {{
... 'Sample': ['Steven_N', 'Steven_T', 'Sara_N', 'Sara_T', 'John_N', 'John_T', 'Rachel_N', 'Rachel_T'],
... 'Subject': ['Steven', 'Steven', 'Sara', 'Sara', 'John', 'John', 'Rachel', 'Rachel'],
... 'Type': ['Normal', 'Tumor', 'Normal', 'Tumor', 'Normal', 'Tumor', 'Normal', 'Tumor'],
... }}
>>> vf = pyvcf.VcfFrame.from_dict([], vcf_data)
>>> af = pyvcf.AnnFrame.from_dict(annot_data, 'Sample')
>>> vf.plot_histplot(hue='Type', af=af)
.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/plot_histplot.png
MAF
---
To create an oncoplot with a MAF file:
.. code:: python3
>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> f = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
Expand All @@ -168,7 +241,6 @@
.. code:: python3
>>> import matplotlib.pyplot as plt
>>> from fuc import common, pymaf
>>> common.load_dataset('tcga-laml')
>>> f = '~/fuc-data/tcga-laml/tcga_laml.maf.gz'
Expand All @@ -177,11 +249,13 @@
.. image:: https://raw.githubusercontent.com/sbslee/fuc-data/main/images/maf_summary.png
SAM/BAM/CRAM
------------
To create read depth profile of a region from a CRAM file:
.. code:: python3
>>> import matplotlib.pyplot as plt
>>> from fuc import pycov
>>> cf = pycov.CovFrame.from_file('HG00525.final.cram', zero=True,
... region='chr12:21161194-21239796', names=['HG00525'])
Expand Down Expand Up @@ -225,7 +299,7 @@
$ cd fuc
$ pip install .
The nice thing about this approach is that you will have access to development versions that are not available in Anaconda or PyPI. For example, you can access a development branch with the ``git checkout`` command.
The nice thing about this approach is that you will have access to development versions that are not available in Anaconda or PyPI. For example, you can access a development branch with the ``git checkout`` command. When you do this, please make sure your environment already has all the dependencies installed.
Getting Help
============
Expand Down
1 change: 1 addition & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ sphinx_rtd_theme
sphinx_issues
autodocsumm
matplotlib
matplotlib-venn
numpy
pandas
pysam
Expand Down
Loading

0 comments on commit 014ba33

Please sign in to comment.