Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tabix: how to retain header for arbitrary file types? #1434

Open
carbocation opened this issue May 11, 2022 · 2 comments
Open

tabix: how to retain header for arbitrary file types? #1434

carbocation opened this issue May 11, 2022 · 2 comments
Assignees

Comments

@carbocation
Copy link

I'm creating tabix indexes for arbitrarily formatted GWAS summary stats. This works fine, but the header seems to be lost. If I pass --skip-lines, then when I query the header, I get a blank line. If I don't pass --skip-lines, then I get [E::get_intv] Failed to parse TBX_GENERIC, was wrong -p [type] used?, which I think makes sense since because the first row is not numeric (since it is the header).

Does retaining the header effectively only work if it is prefixed with #? If so, should that be mentioned in the documentation?

Apologies if this is documented but I somehow missed it.

$ tabix -h

Version: 1.10.2
Usage:   tabix [OPTIONS] [FILE] [REGION [...]]

Indexing Options:
   -0, --zero-based           coordinates are zero-based
   -b, --begin INT            column number for region start [4]
   -c, --comment CHAR         skip comment lines starting with CHAR [null]
   -C, --csi                  generate CSI index for VCF (default is TBI)
   -e, --end INT              column number for region end (if no end, set INT to -b) [5]
   -f, --force                overwrite existing index without asking
   -m, --min-shift INT        set minimal interval size for CSI indices to 2^INT [14]
   -p, --preset STR           gff, bed, sam, vcf
   -s, --sequence INT         column number for sequence names (suppressed by -p) [1]
   -S, --skip-lines INT       skip first INT lines [0]

Querying and other options:
   -h, --print-header         print also the header lines
   -H, --only-header          print only the header lines
   -l, --list-chroms          list chromosome names
   -r, --reheader FILE        replace the header with the content of FILE
   -R, --regions FILE         restrict to regions listed in the file
   -T, --targets FILE         similar to -R but streams rather than index-jumps
   -D                         do not download the index file
@daviesrob daviesrob self-assigned this May 17, 2022
@daviesrob
Copy link
Member

You're right that it's not very well documented, but the header options only use the stored -c value when working out which lines to print. This is fine for most of the formats that tabix was originally designed for like SAM, VCF and BED, as they all have prefixes on header lines.

Unfortunately I don't think tabix stores the number of lines that were skipped in the index, although it might be possible to work it out by finding the first indexed line. So it may be possible to add a way to output only the unindexed data at the start of the file (I'm not sure it would be good to use --print-header for this as we don't know if data skipped using -S was due to it being header lines or for some other reason).

@dtaylo95
Copy link

I'm getting the same issue/behavior as @carbocation. @daviesrob it would be great if there was an additional option that allowed users to specify a specific line/lines as header lines when indexing, and would show up when querying with --print-header.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants