Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: jdoughertyii/PyVCF
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: master
Choose a base ref
...
head repository: jamescasbon/PyVCF
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: master
Choose a head ref
Can’t automatically merge. Don’t worry, you can still create the pull request.

Commits on Oct 30, 2011

  1. added parser

    James Casbon committed Oct 30, 2011
    Copy the full SHA
    7b6fefb View commit details
  2. minimal setup to get into pythonpath

    James Casbon committed Oct 30, 2011
    Copy the full SHA
    35cfa6d View commit details
  3. vcf_melt for turning wide tsv into long tsv

    James Casbon committed Oct 30, 2011
    Copy the full SHA
    e149aec View commit details

Commits on Nov 2, 2011

  1. convienience methods for changing a record

    James Casbon committed Nov 2, 2011
    Copy the full SHA
    4ad5983 View commit details

Commits on Jan 12, 2012

  1. Merge remote-tracking branch 'upstream/master'

    James Casbon committed Jan 12, 2012
    Copy the full SHA
    c0894c7 View commit details
  2. add meta to setup.py, note about fork to README

    James Casbon committed Jan 12, 2012
    Copy the full SHA
    60625ba View commit details

Commits on Jan 16, 2012

  1. Tolerate negative Number values in INFO and FORMAT fields

    Negative values for the Number entry in INFO and FORMAT fields are not allowed
    by the VCF spec, but we see them in practice (samtools uses -1 in the PL
    format). We silently convert them to None or '.'.
    martijnvermaat authored and James Casbon committed Jan 16, 2012
    Copy the full SHA
    ae27992 View commit details
  2. include really basic tests

    James Casbon committed Jan 16, 2012
    Copy the full SHA
    dcbb72c View commit details
  3. use ordered dict for samples, fixes #2

    James Casbon committed Jan 16, 2012
    Copy the full SHA
    bc8c85e View commit details
  4. remove redundant code now using ordereddict

    James Casbon committed Jan 16, 2012
    Copy the full SHA
    3e559e7 View commit details
  5. Copy the full SHA
    cde3e75 View commit details
  6. run doctest, fixes #5

    James Casbon committed Jan 16, 2012
    Copy the full SHA
    b60d8fc View commit details
  7. update README.rst, add HISTORY, fix doctests

    James Casbon committed Jan 16, 2012
    Copy the full SHA
    4e24fd0 View commit details
  8. add sample object

    James Casbon committed Jan 16, 2012
    Copy the full SHA
    4ebe037 View commit details
  9. update doctest for sample objects

    James Casbon committed Jan 16, 2012
    Copy the full SHA
    544ba1e View commit details
  10. include VCFWriter class and test cases

    James Casbon committed Jan 16, 2012
    Copy the full SHA
    f94ed1e View commit details
  11. change vcf.VCFReader to vcf.Reader

    James Casbon committed Jan 16, 2012
    Copy the full SHA
    98e58f1 View commit details
  12. change vcf.VCFReader to vcf.Reader (docs)

    James Casbon committed Jan 16, 2012
    Copy the full SHA
    9ae8aff View commit details

Commits on Jan 17, 2012

  1. Copy the full SHA
    2d68a32 View commit details
  2. support opening via filename and gzipped data

    James Casbon committed Jan 17, 2012
    Copy the full SHA
    c2b5c4b View commit details
  3. Support fetching from tabix files, fixes #7

    James Casbon committed Jan 17, 2012
    Copy the full SHA
    03faeab View commit details
  4. REST syntax fix

    James Casbon committed Jan 17, 2012
    Copy the full SHA
    181e237 View commit details

Commits on Jan 18, 2012

  1. test for 1kg files, allow QUAL=.

    James Casbon committed Jan 18, 2012
    Copy the full SHA
    ba52427 View commit details
  2. Copy the full SHA
    ebf9449 View commit details
  3. speed optimisations, drop properties for direct access

    James Casbon committed Jan 18, 2012
    Copy the full SHA
    d6b95c8 View commit details

Commits on Jan 19, 2012

  1. Remove ordereddict for performance, replace with genotype method for

    lookup of genotypes
    James Casbon committed Jan 19, 2012
    Copy the full SHA
    614b5ba View commit details

Commits on Jan 20, 2012

  1. git ignore

    James Casbon committed Jan 20, 2012
    Copy the full SHA
    f9150ed View commit details
  2. writer now correctly writes metadata

    James Casbon committed Jan 20, 2012
    Copy the full SHA
    65e2b66 View commit details
  3. typo in HISTORY

    James Casbon committed Jan 20, 2012
    Copy the full SHA
    0ec7e3e View commit details
  4. initial work on an extensible VCF filter

    James Casbon committed Jan 20, 2012
    Copy the full SHA
    7f7b7bd View commit details
  5. add filter lines, but writer needs improvement

    James Casbon committed Jan 20, 2012
    Copy the full SHA
    fb49dd4 View commit details
  6. add vcf_filter.py script

    James Casbon committed Jan 20, 2012
    Copy the full SHA
    11912cf View commit details
  7. argparse for python < 2.7

    James Casbon committed Jan 20, 2012
    Copy the full SHA
    b59b6f7 View commit details
  8. vcf filter now inserts into meta, applies ok

    James Casbon committed Jan 20, 2012
    Copy the full SHA
    23611c2 View commit details
  9. add TODOs

    James Casbon committed Jan 20, 2012
    Copy the full SHA
    7883ab4 View commit details
  10. add FILTERS.md

    James Casbon committed Jan 20, 2012
    Copy the full SHA
    9fcf4f6 View commit details
  11. version bump

    James Casbon committed Jan 20, 2012
    Copy the full SHA
    c0c7d16 View commit details
  12. update README

    James Casbon committed Jan 20, 2012
    Copy the full SHA
    b129011 View commit details
  13. Copy the full SHA
    6e646f1 View commit details
  14. Copy the full SHA
    a2416e1 View commit details
  15. Copy the full SHA
    7b20929 View commit details
  16. Copy the full SHA
    0b75b28 View commit details
  17. Copy the full SHA
    c8753d1 View commit details
  18. Copy the full SHA
    c971410 View commit details

Commits on Jan 21, 2012

  1. HISTORY changes

    James Casbon committed Jan 21, 2012
    Copy the full SHA
    1ed5d65 View commit details
  2. merge v0.1

    James Casbon committed Jan 21, 2012
    Copy the full SHA
    6615493 View commit details
  3. Copy the full SHA
    26d4d2d View commit details
  4. warning about aggro

    James Casbon committed Jan 21, 2012
    Copy the full SHA
    6e1b436 View commit details
  5. backwards compatible call data lookup

    James Casbon committed Jan 21, 2012
    Copy the full SHA
    f95c106 View commit details
  6. Copy the full SHA
    0a67ab6 View commit details
Showing with 7,380 additions and 485 deletions.
  1. +13 −0 .gitignore
  2. +18 −0 .travis.yml
  3. +28 −0 LICENSE
  4. +1 −0 MANIFEST.in
  5. +123 −30 README.rst
  6. +56 −0 docs/API.rst
  7. +158 −0 docs/FILTERS.rst
  8. +199 −0 docs/HISTORY.rst
  9. +4 −0 docs/INTRO.rst
  10. +130 −0 docs/Makefile
  11. +217 −0 docs/conf.py
  12. +22 −0 docs/index.rst
  13. +3 −0 requirements/common-requirements.txt
  14. +1 −0 requirements/pypy-requirements.txt
  15. +168 −0 scripts/vcf_filter.py
  16. +48 −0 scripts/vcf_melt
  17. +39 −0 scripts/vcf_sample_filter.py
  18. +80 −0 setup.py
  19. +21 −0 tox.ini
  20. +0 −455 vcf.py
  21. +15 −0 vcf/__init__.py
  22. +95 −0 vcf/cparse.pyx
  23. +209 −0 vcf/filters.py
  24. +701 −0 vcf/model.py
  25. +784 −0 vcf/parser.py
  26. +115 −0 vcf/sample_filter.py
  27. +200 −0 vcf/test/1kg.sites.vcf
  28. BIN vcf/test/1kg.vcf.gz
  29. +50 −0 vcf/test/FT.vcf
  30. +3 −0 vcf/test/README.md
  31. 0 vcf/test/__init__.py
  32. +14 −0 vcf/test/bad-info-character.vcf
  33. +779 −0 vcf/test/bcftools.vcf
  34. +5 −0 vcf/test/contig_idonly.vcf
  35. +24 −0 vcf/test/example-4.0.vcf
  36. +34 −0 vcf/test/example-4.1-bnd.vcf
  37. +7 −0 vcf/test/example-4.1-info-multiple-values.vcf
  38. +21 −0 vcf/test/example-4.1-ploidy.vcf
  39. +35 −0 vcf/test/example-4.1-sv.vcf
  40. +24 −0 vcf/test/example-4.1.vcf
  41. +56 −0 vcf/test/example-4.2.vcf
  42. +159 −0 vcf/test/freebayes.vcf
  43. +156 −0 vcf/test/gatk.vcf
  44. +4 −0 vcf/test/gatk_26_meta.vcf
  45. +120 −0 vcf/test/gonl.chr20.release4.gtc.vcf
  46. +8 −0 vcf/test/info-type-character.vcf
  47. +35 −0 vcf/test/issue-140-file1.vcf
  48. +34 −0 vcf/test/issue-140-file2.vcf
  49. +25 −0 vcf/test/issue-140-file3.vcf
  50. +21 −0 vcf/test/issue-16.vcf
  51. BIN vcf/test/issue-201.vcf.gz
  52. BIN vcf/test/issue-201.vcf.gz.tbi
  53. +32 −0 vcf/test/issue-214.vcf
  54. +9 −0 vcf/test/issue-254.vcf
  55. +34 −0 vcf/test/issue_49.vcf
  56. +56 −0 vcf/test/metadata-whitespace.vcf
  57. +24 −0 vcf/test/mixed-filtering.vcf
  58. +119 −0 vcf/test/null_genotype_mono.vcf
  59. +6 −0 vcf/test/parse-meta-line.vcf
  60. +33 −0 vcf/test/prof.py
  61. +10 −0 vcf/test/samples-space.vcf
  62. +34 −0 vcf/test/samtools.vcf
  63. +57 −0 vcf/test/strelka.vcf
  64. +8 −0 vcf/test/string_as_flag.vcf
  65. BIN vcf/test/tb.vcf.gz
  66. BIN vcf/test/tb.vcf.gz.tbi
  67. +1,755 −0 vcf/test/test_vcf.py
  68. +8 −0 vcf/test/uncalled_genotypes.vcf
  69. +25 −0 vcf/test/walk_left.vcf
  70. +22 −0 vcf/test/walk_refcall.vcf
  71. +86 −0 vcf/utils.py
13 changes: 13 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
PyVCF.egg-info
build
dist
*.pyc
docs/_build
.ropeproject
1kg.prof
.noseids
.tox
.DS_Store
vcf/cparse.c
vcf/cparse.so
.coverage
18 changes: 18 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Validate this file using http://lint.travis-ci.org/
language: python
sudo: false
cache:
directories:
- $HOME/.cache/pip
python:
- "2.7"
- "3.4"
- "3.5"
- "3.6"
- "nightly"
- "pypy"
- "pypy3"
install:
- if [[ "$TRAVIS_PYTHON_VERSION" =~ ^pypy ]]; then pip install -r requirements/pypy-requirements.txt; else pip install -r requirements/common-requirements.txt; fi
- python setup.py install
script: python setup.py test
28 changes: 28 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,3 +1,31 @@
Copyright (c) 2011-2012, Population Genetics Technologies Ltd, All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this
list of conditions and the following disclaimer in the documentation and/or
other materials provided with the distribution.

3. Neither the name of the Population Genetics Technologies Ltd nor the names of
its contributors may be used to endorse or promote products derived from this
software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


Copyright (c) 2011 John Dougherty

Permission is hereby granted, free of charge, to any person obtaining a copy of
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
recursive-include vcf *.pyx
153 changes: 123 additions & 30 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
A VCFv4.0 parser for Python.
A VCFv4.0 and 4.1 parser for Python.

Online version of PyVCF documentation is available at http://pyvcf.rtfd.org/

The intent of this module is to mimic the ``csv`` module in the Python stdlib,
as opposed to more flexible serialization formats like JSON or YAML. ``vcf``
@@ -8,22 +10,22 @@ specified in the meta-information lines -- specifically the ##INFO and
against the reserved types mentioned in the spec. Failing that, it will just
return strings.

There is currently one piece of interface: ``VCFReader``. It takes a file-like
There main interface is the class: ``Reader``. It takes a file-like
object and acts as a reader::

>>> import vcf
>>> vcf_reader = vcf.VCFReader(open('example.vcf', 'rb'))
>>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
>>> for record in vcf_reader:
... print record
Record(CHROM='20', POS=14370, ID='rs6054257', REF='G', ALT=['A'], QUAL=29,
FILTER='PASS', INFO={'H2': True, 'NS': 3, 'DB': True, 'DP': 14, 'AF': [0.5]
}, FORMAT='GT:GQ:DP:HQ', samples=[{'GT': '0', 'HQ': [58, 50], 'DP': 3, 'GQ'
: 49, 'name': 'NA00001'}, {'GT': '0', 'HQ': [65, 3], 'DP': 5, 'GQ': 3, 'nam
e' : 'NA00002'}, {'GT': '0', 'DP': 3, 'GQ': 41, 'name': 'NA00003'}])
Record(CHROM=20, POS=14370, REF=G, ALT=[A])
Record(CHROM=20, POS=17330, REF=T, ALT=[A])
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
Record(CHROM=20, POS=1234567, REF=GTCT, ALT=[G, GTACT])


This produces a great deal of information, but it is conveniently accessed.
The attributes of a Record are the 8 fixed fields from the VCF spec plus two
more. That is:
The attributes of a Record are the 8 fixed fields from the VCF spec::

* ``Record.CHROM``
* ``Record.POS``
@@ -34,55 +36,146 @@ more. That is:
* ``Record.FILTER``
* ``Record.INFO``

plus two more attributes to handle genotype information:
plus attributes to handle genotype information:

* ``Record.FORMAT``
* ``Record.samples``
* ``Record.genotype``

``samples``, not being the title of any column, is left lowercase. The format
``samples`` and ``genotype``, not being the title of any column, are left lowercase. The format
of the fixed fields is from the spec. Comma-separated lists in the VCF are
converted to lists. In particular, one-entry VCF lists are converted to
one-entry Python lists (see, e.g., ``Record.ALT``). Semicolon-delimited lists
of key=value pairs are converted to Python dictionaries, with flags being given
a ``True`` value. Integers and floats are handled exactly as you'd expect::

>>> record = vcf_reader.next()
>>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
>>> record = next(vcf_reader)
>>> print record.POS
17330
14370
>>> print record.ALT
['A']
[A]
>>> print record.INFO['AF']
[0.017]
[0.5]

There are a number of convenience methods and properties for each ``Record`` allowing you to
examine properties of interest::

>>> print record.num_called, record.call_rate, record.num_unknown
3 1.0 0
>>> print record.num_hom_ref, record.num_het, record.num_hom_alt
1 1 1
>>> print record.nucl_diversity, record.aaf, record.heterozygosity
0.6 [0.5] 0.5
>>> print record.get_hets()
[Call(sample=NA00002, CallData(GT=1|0, GQ=48, DP=8, HQ=[51, 51]))]
>>> print record.is_snp, record.is_indel, record.is_transition, record.is_deletion
True False True False
>>> print record.var_type, record.var_subtype
snp ts
>>> print record.is_monomorphic
False

``record.FORMAT`` will be a string specifying the format of the genotype
fields. In case the FORMAT column does not exist, ``record.FORMAT`` is
``None``. Finally, ``record.samples`` is a list of dictionaries containing the
parsed sample column::
parsed sample column and ``record.genotype`` is a way of looking up genotypes
by sample name::

>>> record = vcf_reader.next()
>>> record = next(vcf_reader)
>>> for sample in record.samples:
... print sample['GT']
'1|2'
'2|1'
'2/2'
0|0
0|1
0/0
>>> print record.genotype('NA00001')['GT']
0|0

The genotypes are represented by ``Call`` objects, which have three attributes: the
corresponding Record ``site``, the sample name in ``sample`` and a dictionary of
call data in ``data``::

>>> call = record.genotype('NA00001')
>>> print call.site
Record(CHROM=20, POS=17330, REF=T, ALT=[A])
>>> print call.sample
NA00001
>>> print call.data
CallData(GT=0|0, GQ=49, DP=3, HQ=[58, 50])

Please note that as of release 0.4.0, attributes known to have single values (such as
``DP`` and ``GQ`` above) are returned as values. Other attributes are returned
as lists (such as ``HQ`` above).

There are also a number of methods::

>>> print call.called, call.gt_type, call.gt_bases, call.phased
True 0 T|T True

Metadata regarding the VCF file itself can be investigated through the
following attributes:

* ``VCFReader.metadata``
* ``VCFReader.infos``
* ``VCFReader.filters``
* ``VCFReader.formats``
* ``VCFReader.samples``
* ``Reader.metadata``
* ``Reader.infos``
* ``Reader.filters``
* ``Reader.formats``
* ``Reader.samples``

For example::

>>> vcf_reader.metadata['fileDate']
20090805
'20090805'
>>> vcf_reader.samples
['NA00001', 'NA00002', 'NA00003']
>>> vcf_reader.filters
{'q10': Filter(id='q10', desc='Quality below 10'),
's50': Filter(id='s50', desc='Less than 50% of samples have data')}
OrderedDict([('q10', Filter(id='q10', desc='Quality below 10')), ('s50', Filter(id='s50', desc='Less than 50% of samples have data'))])
>>> vcf_reader.infos['AA'].desc
Ancestral Allele
'Ancestral Allele'

ALT records are actually classes, so that you can interrogate them::

>>> reader = vcf.Reader(open('vcf/test/example-4.1-bnd.vcf'))
>>> _ = next(reader); row = next(reader)
>>> print row
Record(CHROM=1, POS=2, REF=T, ALT=[T[2:3[])
>>> bnd = row.ALT[0]
>>> print bnd.withinMainAssembly, bnd.orientation, bnd.remoteOrientation, bnd.connectingSequence
True False True T

The Reader supports retrieval of records within designated regions for files
with tabix indexes via the fetch method. This requires the pysam module as a
dependency. Pass in a chromosome, and, optionally, start and end coordinates,
for the regions of interest::

>>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
>>> # fetch all records on chromosome 20 from base 1110696 through 1230237
>>> for record in vcf_reader.fetch('20', 1110695, 1230237): # doctest: +SKIP
... print record
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
Record(CHROM=20, POS=1230237, REF=T, ALT=[None])

Note that the start and end coordinates are in the zero-based, half-open
coordinate system, similar to ``_Record.start`` and ``_Record.end``. The very
first base of a chromosome is index 0, and the the region includes bases up
to, but not including the base at the end coordinate. For example::

>>> # fetch all records on chromosome 4 from base 11 through 20
>>> vcf_reader.fetch('4', 10, 20) # doctest: +SKIP

would include all records overlapping a 10 base pair region from the 11th base
of through the 20th base (which is at index 19) of chromosome 4. It would not
include the 21st base (at index 20). (See
http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms for more
information on the zero-based, half-open coordinate system.)

The ``Writer`` class provides a way of writing a VCF file. Currently, you must specify a
template ``Reader`` which provides the metadata::

>>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
>>> vcf_writer = vcf.Writer(open('/dev/null', 'w'), vcf_reader)
>>> for record in vcf_reader:
... vcf_writer.write_record(record)

An extensible script is available to filter vcf files in vcf_filter.py. VCF filters
declared by other packages will be available for use in this script. Please
see :doc:`FILTERS` for full description.
56 changes: 56 additions & 0 deletions docs/API.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
API
===

vcf.Reader
----------

.. autoclass:: vcf.Reader
:members:

vcf.Writer
----------

.. autoclass:: vcf.Writer
:members:

vcf.model._Record
-----------------

.. autoclass:: vcf.model._Record
:members:

vcf.model._Call
---------------

.. autoclass:: vcf.model._Call
:members:

vcf.model._AltRecord
--------------------

.. autoclass:: vcf.model._AltRecord
:members:

vcf.model._Substitution
-----------------------

.. autoclass:: vcf.model._Substitution
:members:

vcf.model._SV
-------------

.. autoclass:: vcf.model._SV
:members:

vcf.model._SingleBreakend
-------------------------

.. autoclass:: vcf.model._SingleBreakend
:members:

vcf.model._Breakend
-------------------

.. autoclass:: vcf.parser._Breakend
:members:
Loading