-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace completeness multitasker with d4tools perc_cov #349
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #349 +/- ##
==========================================
+ Coverage 90.18% 91.34% +1.15%
==========================================
Files 30 30
Lines 1457 1467 +10
==========================================
+ Hits 1314 1340 +26
+ Misses 143 127 -16
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
I've compared the results with a real d4 locally but less genes, main branch and this branch give exactly the same results. The code for computing of the percentage of the fully covered genes hasn't changed in this PR so I assume it must be the changes in the stats over the intervals. The python method and the d4tools perc_cov might give slightly different results, and if for instance in one method you get coverage completeness 99,99999 instead of 1, then the gene interval will count as not fully covered. I'll test with more gene panels |
Cool to see this moving! Close to goal now :D One thing I wanted to ask. Are you handling that the output format sometimes comes in scientific notation and somtimes in numeric? I think in the end, the logic is like this:
|
The output from d4tools you mean? |
Yes. Here is how the test output looks now: https://github.com/38/d4-format/blob/master/d4tools/test/stat/perc_cov-multiple-ranges/output.txt
(No cases |
Hmm. I'll see if I can test this a bit as well. |
mmm in theory the scientific notation should be taken care of on line 32 of src/chanjo2/meta/handle_completeness_stats.py, when it gets converted to float.. |
Hmm, has the expected format for the report end-point changed? This command has worked for me before:
I am getting:
I'll continue digging, but suspecting this is something obvious :) |
Yes, I removed the support for json requests, so it accepts only x-www-form-urlencoded form data (same requests that are sent from Scout). I did this with the idea that the "coverage" endpoints should accept json instead. It makes maintenance easier on the long run I think Could you try this request instead?
|
Thanks! That put me on the right track. This worked:
|
Ah I just found the bug. d4tools changes the order of the provided intervals and writes a result file ordered by chrom, start and stop. So I have to take care of this. Will fix! |
Quality Gate passedIssues Measures |
Woa, that's a lot of coverage! 😆 |
Hmmm. I will check, but I think I am in the right place. And also, I get some concerning numbers from d4tools itself:
|
Ok. Are you using the latest main branch from d4tools, perhaps you didn't update or something? |
The code above is executed inside the Chanjo2 container, so if the container is up to date, the command above would be. I'll test around some more. |
Hmm, running the same ranges for other datasets, things seem to look fine. The one above is one of the first d4-files I created, so maybe something was weird back then. Looks fine for a whole bunch of others:
I'll continue testing, but slightly less concerned now. |
OK, I could reproduce it for that specific d4-file after compiling and running the latest version of d4tools locally. But I cannot reproduce it for any other files. I suspect I indexed the d4files in the beginning in our pipeline. And that this is related to that (38/d4-format#80). Later I removed the indexing, which fits with that all later samples look OK. Conclusion: I don't think it is something of concern here. |
More testing. Here is the report for a real case: Inside the Chanjo2 container, I can confirm that the numbers fully covered genes check out:
|
That's also what I'm seeing with our demo case. Looks like that the 100% covered genes are the same in chanjo2 and d4tools |
And it's faster right? |
I timed a case now with 680 IDs. It loaded the page in ~40 seconds. (This is with 5 thresholds though, which might be a bit greedy :) ) Running just the d4-part it finished in 6 seconds, which is much faster than before. So looks like the bottleneck has shifted somewhere else now. But I think ~40 seconds will be acceptible loading time for the geneticists ... |
I think it looks great, looking forward to finally put this into action 👌 |
I think so too. And I know there is room for improvement. This PR was meant to be a non-invasive and non-problematic alternative to the code we have on main branch. Shall I merge? |
Yes, let's merge 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worked nicely in testing. Didn't see anything to mention when eyeing through the code.
Nice 👌
Description
Added/Changed/Fixed
How to test
Expected outcome
Review
This version is a