Sorting of extract calls output and tabix indexing #313

richardheery · 2024-12-10T14:56:51Z

Hello,

I am wondering how is the output of modkit extract calls sorted as I have noticed that it does not seem to be sorted by genomic position and neither do the reads occur in the same order as in the input BAM file? This also seems to be the case when using the --bgzf flag. Isn't the idea of compressing with bgzip that an index can then be created using tabix, though this requires that the input file was originally sorted by sequence and position? I have sorted the output files myself using sort -k4,4 -k3,3n, though this took several hours due to the size of the output files by extract calls. Would it be possible to request a flag for pre-sorted output to save having to perform this step?

Cheers,

Richard

ArtRand · 2024-12-11T16:49:05Z

Hello @richardheery,

If you use the --ignore-index flag the reads in the output should be in the same order as the input modBAM - but this routine doesn't leverage as much parallelism.

Isn't the idea of compressing with bgzip that an index can then be created using tabix, though this requires that the input file was originally sorted by sequence and position?

Actually, the idea is to make the output smaller. As you've likely found, the output is grouped by read and within each group the records are sorted but this isn't the sorting that tabix usually requires (contig/position). If you sort the table by contig and position, then joining the calls by read_id becomes more difficult. If you want to look at methylation calls per-genomic position, I'd recommend using pileup. On the other hand, if you want "rapid-access" to the read-level information I recommend running modkit extract calls with the --region option or an --include-bed file. These options will use an indexed, sorted modBAM and quickly retrieve the reads in the region you're querying for. One nice thing about the current grouping is that if you stream the output to another program you can operate on each read's calls. I'll consider adding a flag that sorts the output the way you're asking, but I think using --region and piping to sort might be a good way to do it. Maybe you can tell me more about your use case?

ArtRand added the question Looking for clarification on inputs and/or outputs label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sorting of extract calls output and tabix indexing #313

Sorting of extract calls output and tabix indexing #313

richardheery commented Dec 10, 2024

ArtRand commented Dec 11, 2024

Sorting of extract calls output and tabix indexing #313

Sorting of extract calls output and tabix indexing #313

Comments

richardheery commented Dec 10, 2024

ArtRand commented Dec 11, 2024