vcf2pandas

vcf2pandas is a python package to convert vcf files to pandas dataframes.

Install

pip install vcf2pandas

Dependencies

pandas (2.1.0)
pysam (0.22.1)

Usage

Selecting all columns (default behaviour)

from vcf2pandas import vcf2pandas
import pandas

df = vcf2pandas("path_to_vcf.vcf")

Remove all empty columns

Sometimes where will be INFO or FORMAT fields from the header where none of the variants or samples have that field. You can choose to remove all of these from the pandas dataframe.

df = vcf2pandas("path_to_vcf.vcf", remove_empty_columns=True)

Selecting custom columns and samples

info_fields = ["info_field_1", "info_field_2"]
sample_list = ["sample_name_1", "sample_name_2"]
format_fields = ["format_name_1", "format_name_2"]

df_selected = vcf2pandas(
    "path_to_vcf.vcf",
    info_fields=info_fields,
    sample_list=sample_list,
    format_fields=format_fields,
)

Renaming custom columns and samples

From v0.2.0, renaming column and sample names is supported. Simply input a dictionary instead of a list with your name mapping. See example below.

info_fields = {
    "info_field_1": "renamed_info_field_1",
    "info_field_2": "renamed_info_field_2"
}
sample_list = {
    "sample_name_1": "renamed_sample_name_1",
    "sample_name_2": "renamed_sample_name_2"
}
format_fields = {
    "format_name_1": "renamed_format_name_1",
    "format_name_2": "renamed_format_name_2"
}

df_renamed = vcf2pandas(
    "path_to_vcf.vcf",
    info_fields=info_fields,
    sample_list=sample_list,
    format_fields=format_fields,
)

Note

You do not need to have everything a list or everything a dictionary, you can mix and match defaults, lists and dictionaries for info_fields, sample_list and format_fields.

Custom column ordering

vcf2pandas can select custom/specific:

INFO fields
samples
FORMAT fields

And order the selected columns based on the input list.

E.g. The following list:

info_fields = ["DP", "MQM", "QA"]

Gets the columns (in that order)

INFO:DP    INFO:MQM    INFO:QA

Output

INFO and FORMAT headings

INFO:INFO_FIELD                     e.g. INFO:DP
FORMAT:SAMPLE_NAME:FORMAT_FIELD     e.g. FORMAT:HG002:GT

The info field, format field and sample names can also be mapped to custom values by using a dictionary. See Renaming custom columns and samples.

INFO or FORMAT fields not present for some variants

When certain INFO or FORMAT fields are not present for certain variants, vcf2pandas inserts a . instead in that cell. E.g. for vcf3_all.txt you can see INFO:GENE column has . for the first 7 variants.

Examples

Example vcf and output files (dataframes as a .txt file) are available in examples/

Example Usage

df1_all = vcf2pandas("examples/vcf1.vcf")
df2_all = vcf2pandas("examples/vcf2.vcf")

df3_all = vcf2pandas("examples/vcf3.vcf")

info_fields = ["DP"]
sample_list = ["HG002"]
format_fields = ["GT", "AO"]

df3_selected = vcf2pandas(
    "examples/vcf3.vcf",
    info_fields=info_fields,
    sample_list=sample_list,
    format_fields=format_fields
)

To print to a text file:

with open("path_to_txt_file.txt", "w", encoding='utf-8') as f:
    f.write(df.to_string())

For more examples, see tests/run_examples.py.

To recreate the examples in the examples/ folder, run:

cd vcf2pandas
poetry run python tests/run_examples.py

Changelog

v0.1.0

Initial project.

v0.1.1

Fixed converting variant filter into string properly.

v0.1.2

Updated pysam version to 0.22.1.

v0.2.0

Fixed bug where some info/format fields would be overwritten with . if not all samples/variants had all the info/format values.
Changed behaviour of getting info/format fields, it now takes from the vcf headers.
Added functionality to rename columns using dictionaries. This is a non-breaking change, all existing uses of this package will still work.
Added functionality to remove columns that are completely empty. Also a non-breaking change.
Updated README with more examples.
Added more tests for renaming columns.
Added unit testing with pytest.

Issues

Please open an issue if you encounter any problems! Thanks!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
examples		examples
tests		tests
vcf2pandas		vcf2pandas
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vcf2pandas

Install

Dependencies

Usage

Selecting all columns (default behaviour)

Remove all empty columns

Selecting custom columns and samples

Renaming custom columns and samples

Custom column ordering

Output

INFO and FORMAT headings

INFO or FORMAT fields not present for some variants

Examples

Example Usage

Changelog

v0.1.0

v0.1.1

v0.1.2

v0.2.0

Issues

About

Releases

Packages

Languages

License

trentzz/vcf2pandas

Folders and files

Latest commit

History

Repository files navigation

vcf2pandas

Install

Dependencies

Usage

Selecting all columns (default behaviour)

Remove all empty columns

Selecting custom columns and samples

Renaming custom columns and samples

Custom column ordering

Output

INFO and FORMAT headings

INFO or FORMAT fields not present for some variants

Examples

Example Usage

Changelog

v0.1.0

v0.1.1

v0.1.2

v0.2.0

Issues

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages