Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF Record format/sample write order #1325

Open
awgymer opened this issue Jan 16, 2025 · 0 comments
Open

VCF Record format/sample write order #1325

awgymer opened this issue Jan 16, 2025 · 0 comments

Comments

@awgymer
Copy link

awgymer commented Jan 16, 2025

The SAMPLE information for a VariantRecord is written out in the insertion order of the sample dict of the first sample populated for the record. This can lead to two samples in the same VCF having different orders for their FORMAT column which is somewhat unintuitive, although I am not clear if it's an actual bug or against spec.

Here is an MRE:

from pysam import VariantFile, VariantHeader, VariantRecord


header = VariantHeader()
header.contigs.add('chr1')
header.formats.add("GT", "1", "String", "Genotype")
header.formats.add("DP", "1", "Integer", "Read depth in the tumor BAM")
header.formats.add("AF", "1", "Float", "Estimated allele frequency in the tumor BAM")
header.add_samples(['sample1', 'sample2'])


SAMPLE1 = {"GT": (1,0), "DP": 10, "AF": 0.5}
SAMPLE2 = {"GT": (1,0), "DP": 10, "AF": 0.5}

rec1 = header.new_record(
    contig='chr1',
    id='formats_in_order',
    start=0,
    stop=1,
    alleles=('T', 'A')
)

for k, v in SAMPLE1.items():
    rec1.samples['sample1'][k] = v

for k, v in SAMPLE2.items():
    rec1.samples['sample2'][k] = v

rec2 = header.new_record(
    contig='chr1',
    id='first_sample_not_in_order',
    start=0,
    stop=1,
    alleles=('T', 'A')
)

# Note how we insert the sample values in reverse order
for k, v in list(SAMPLE1.items())[::-1]:
    rec2.samples['sample1'][k] = v

for k, v in SAMPLE2.items():
    rec2.samples['sample2'][k] = v

rec3 = header.new_record(
    contig='chr1',
    id='second_sample_not_in_order',
    start=0,
    stop=1,
    alleles=('T', 'A')
)

for k, v in list(SAMPLE2.items())[::-1]:
    rec3.samples['sample2'][k] = v

for k, v in SAMPLE1.items():
    rec3.samples['sample1'][k] = v


testfile = VariantFile('-', 'w', header=header)
testfile.write(rec1)
testfile.write(rec2)
testfile.write(rec3)
testfile.close()

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth in the tumor BAM">
##FORMAT=<ID=AF,Number=1,Type=Float,Description="Estimated allele frequency in the tumor BAM">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample1	sample2
chr1	1	formats_in_order	T	A	.	.	.	GT:DP:AF	1/0:10:0.5	1/0:10:0.5
chr1	1	first_sample_not_in_order	T	A	.	.	.	GT:AF:DP	1/0:0.5:10	1/0:0.5:10
chr1	1	second_sample_not_in_order	T	A	.	.	.	GT:AF:DP	1/0:0.5:10	1/0:0.5:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant