Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sorting of extract calls output and tabix indexing #313

Open
richardheery opened this issue Dec 10, 2024 · 1 comment
Open

Sorting of extract calls output and tabix indexing #313

richardheery opened this issue Dec 10, 2024 · 1 comment
Labels
question Looking for clarification on inputs and/or outputs

Comments

@richardheery
Copy link

Hello,

I am wondering how is the output of modkit extract calls sorted as I have noticed that it does not seem to be sorted by genomic position and neither do the reads occur in the same order as in the input BAM file? This also seems to be the case when using the --bgzf flag. Isn't the idea of compressing with bgzip that an index can then be created using tabix, though this requires that the input file was originally sorted by sequence and position? I have sorted the output files myself using sort -k4,4 -k3,3n, though this took several hours due to the size of the output files by extract calls. Would it be possible to request a flag for pre-sorted output to save having to perform this step?

Cheers,

Richard

@ArtRand ArtRand added the question Looking for clarification on inputs and/or outputs label Dec 11, 2024
@ArtRand
Copy link
Contributor

ArtRand commented Dec 11, 2024

Hello @richardheery,

If you use the --ignore-index flag the reads in the output should be in the same order as the input modBAM - but this routine doesn't leverage as much parallelism.

Isn't the idea of compressing with bgzip that an index can then be created using tabix, though this requires that the input file was originally sorted by sequence and position?

Actually, the idea is to make the output smaller. As you've likely found, the output is grouped by read and within each group the records are sorted but this isn't the sorting that tabix usually requires (contig/position). If you sort the table by contig and position, then joining the calls by read_id becomes more difficult. If you want to look at methylation calls per-genomic position, I'd recommend using pileup. On the other hand, if you want "rapid-access" to the read-level information I recommend running modkit extract calls with the --region option or an --include-bed file. These options will use an indexed, sorted modBAM and quickly retrieve the reads in the region you're querying for. One nice thing about the current grouping is that if you stream the output to another program you can operate on each read's calls. I'll consider adding a flag that sorts the output the way you're asking, but I think using --region and piping to sort might be a good way to do it. Maybe you can tell me more about your use case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

2 participants