-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Other misc. VCF test data queries. #608
Comments
I'd also argue that we need bcf-comaptible copies of these files. The lack of INFO and Contig lines means it's currently not possible to do round-trip tests. We should have both minimal and fully-specified variants for each file to permit implementers to validate their BCF code. |
Looking at genuine VCFs, I see 1000 genomes data containing:
So maybe we've been accepting for years VCF headers that aren't key=value. In that case the specification needs changing to permit them, instead of explicitly forbidding them as it does now. |
That last one's easy: “Unstructured header lines” are the ones that aren't structured; and “structured” ones are those that have That's how I interpret this verbiage anyway. It could certainly be more clearly written and comprehensive. The VCF spec has historically been fluffy enough that it needs a non-adversarial reader happy to constructively fill in the blanks and interpret the vagueness in a way consistent with extant VCF files… (It would be nice to improve that obvs, and you've listed a number of good header-related candidates for more clarity…) |
Oh, so when it says key = value it means a single Ie does it have some nested syntax where ##key=value is a single key, but value can be < (key=value,)* >, permitting key-value pairs themselves to be values? Frankly the entire introduction needs rewriting, preferably by someone who understands VCF (ie not me). Perhaps starting with what all the meta-characters actually mean. It's clear as mud right now. :( Edit: also before posting this issue, I'd already searched for unstructured and non-structured. It doesn't exist. I don't know where the distinction between structured and unstructured lines exists in the spec. |
So going back to the two entries I see that cause bcftools breakage, is it the case that any "<" means structured and absence of "<" means unstructured? If so I assume it's failing because it's attempting to apply structured logic to an unstructured entry, and the error here is that those VCF header lines should be rewritten to exclude the angle brackets. Edit: Yes! I see All structured lines that have their value enclosed within ``<>'' require an ID which must be unique within their type. I'm guessing it's implicit that lack of <> means unstructured. What's the syntax checking on unstructured? Anything? Ie value is I wonder though how we got them in the first place. I thought all of these test files were a product of the VCF syntax checker, in which case at least one person has interpretted the spec differently. :) |
Yep. I assumed “pairs” (plural) was because it was talking about multiple meta-information lines, or specifically the set of all the meta-information lines in a VCF file. That sentence:—
has existed unchanged in the spec since the 2013 G1K wiki spec page, which is before the formalism about structured lines was introduced. It is agreed (by me) that this is fabulously muddily written. |
Following on from the conference call, with thanks to John for identifying the relevant text: In answer to the first point above, "<1>" appears to be deliberate. In the spec we see: https://github.com/samtools/hts-specs/blob/master/VCFv4.3.tex#L296-L297
The second sentence there however is broken by this example, as we go from "1" to "<1>" back to "1". All entries are sequential positions, with "<1>" being an alternate allele part of chr "1". I believe the example considers all CHROM being a contiguous block as it equates 1 and alternate-allele-of-1 to be the same thing, but in practice that's challenging to handle. Regarding the second problem above, I notice this file also contains:
The "other" pedigree syntax. ;-) The spec lists that syntax as well as |
On The original G1K wiki VCF ur-specs and the 4.1 and 4.2 documents have two instances of
The 4.3 document is similar, but f247cee changed one of the instances:
Reading the tea leaves, it seems that PR #88 (which contained that commit) was making the claim that these angle brackets were merely metasyntactic variable notation not literal characters like the My feeling is that that claim is surely correct and the other instance should be similarly updated (or, since we're a typeset document, we could I dunno use italics or something 😄). And hence the test files are indeed erroneous. However as usual it really comes down to how |
I've never used If I had to pick, I'd ditch the existing VCF/BCF header format and replace it with YAML front matter. Where VCF format would look something like:
|
|
PR #620 attempts to clarify this section, by rewording it as outlined in #608 (comment). |
§1.4.2 states that for §1.4.4 is not explicit, but states that IMHO §1.4.3 and §1.4.4 could stand to be a bit more explicit… |
With field ordering now clarified as insignificant in VCF 4.4, this is an outstanding question that needs to be resolved. The consequence of unordered fields increases the complexity of parsing otherwise.
I'd also like this confirmed. Unfortunately, the spec reuses the same symbols for unstructured and structured lines ( |
The format is parsimonious, it should be Similarly, |
The fact we're having this conversation though indicates the spec is ambiguous, or at the very least unclear. It should (but is not) be "obvious" to a naive reader. The obvious interpretation I'd take is that values may be quoted, and that presumably there is some quote escape mechanism because that's a logical corrolary of adopting quoting. (This interpretation however would be wrong.) What's certainly not obvious from a naive not-reading-the-spec fashion is that Description must be quoted. Even reading the spec, I still don't know if it's true that eg Type must not be quoted. What happens if we do? The spec needs to be much more explicit on these things. All I can do is read between the lines, which gives me the gut feeling there are known keys and unknown (optional) keys. Optional keys have no known types, and are listed as requiring quoting, even if numeric. This implies the parser doesn't determine type by content but by decoration. Known keys have hard specified types, and as determined already the parser cannot intuit a type by content. From this I would assume the parsers would hard fail if a controlled vocabulary matched That's pure guesswork though. The spec needs to remove the need for such speculation. |
21 June meeting proposed clarification:
|
Some of this is an issue for the data, but some also needs some major clarifications in the spec.
The VCF spec carefully lists the syntax of specific line types. These are what VCF calls "structured lines", but it's less clear on non-structured data. The spec mentions structured lines, but I could find no definition for unstructured. Is the opposite of structured "meta-information"? If so the spec doesn't say much about them other than that they are key/value and must be "well formed", but sadly doesn't explain what that means. That needs fixing. I assume they must still follow the same quoting rules.
Which leads me on to meta-characters. Some fields are quoted, and others not. Eg from the spec:
Here
Description
is quoted, presumably because it may contain spaces, or more importantly commas. Does it need to be quoted? CouldDescription=Genotype
be used? Could we sayType="Integer"
? There is no grammar or even a regexp for how to parse headers, which is a major problem.So, on to the specifics which tripped up bcftools:
This one is entirely quoted, but it has no = in there at all. Hence it also fails the requirement of being key=value.
Now this line does have an equals sign, so technically matches the key=value requirement. It clearly wasn't intended to be one though, but why is it failing? I assume bcftools dislikes some characters in the key, eg : or ?. What characters are permitted in the keys? When would we need to quote it? Can we even quote it if we desired to? What characters must be quoted. Obviously comma, and maybe <>, but = too? I don't see why it would for a value, but clearly it would for the key and we'd probably not want to have differing quote requires for both.
Either way, I believe these two lines to be illegal. In addition to the spec needing much more clarification as to what is acceptable.
##CauseOfFailure=
lines with no key=value formatting.To be clear, the specification states that "It contains meta-information lines (prefixed with ##)" and "File meta-information is included after the ## string and must be key=value pairs."
I know they're failures, but the failure shouldn't be the thing describing the failure, as that makes validating that we fail rather hard (we fail for other reasons). They should all be of the form
##CauseOfFailure=<Reason="text">
.The text was updated successfully, but these errors were encountered: