Replace completeness multitasker with d4tools perc_cov #349

northwestwitch · 2024-09-20T10:16:14Z

Description

Added/Changed/Fixed

Closes Improve performance of get_coverage_completeness.py #316

How to test

Deploy on stage and compare coverage completeness results with main branch

Expected outcome

Stats from using this branch should be quicker

Review

Tests executed by CR, Jakob37
"Merge and deploy" approved by Jakob37

This version is a

MAJOR - when you make incompatible API changes
MINOR - when you add functionality in a backwards compatible manner
PATCH - when you make backwards compatible bug fixes or documentation/instructions

codecov-commenter · 2024-09-26T06:07:44Z

Codecov Report

Attention: Patch coverage is 93.93939% with 2 lines in your changes missing coverage. Please review.

Project coverage is 91.34%. Comparing base (719f1b8) to head (4263b2e).
Report is 25 commits behind head on main.

Files with missing lines	Patch %	Lines
src/chanjo2/meta/handle_d4.py	60.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #349      +/-   ##
==========================================
+ Coverage   90.18%   91.34%   +1.15%     
==========================================
  Files          30       30              
  Lines        1457     1467      +10     
==========================================
+ Hits         1314     1340      +26     
+ Misses        143      127      -16

Flag	Coverage Δ
unittests	`91.34% <93.93%> (+1.15%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

northwestwitch · 2024-09-26T06:19:55Z

Test with a scout case and OMIM-AUTO panel (4637 genes)

Main branch on stage: -> report created in about 120 seconds:

This test on stage -> report created in 13 seconds

So the report creating is about 10 times faster when using the perc_cov from d4tools. The numbers are the same, but not the percentage of fully covered genes (compare the 2 screeshots). I'll check which one is correct!

northwestwitch · 2024-09-26T07:04:08Z

So the report creating is about 10 times faster when using the perc_cov from d4tools. The numbers are the same, but not the percentage of fully covered genes (compare the 2 screenshots). I'll check which one is correct!

I've compared the results with a real d4 locally but less genes, main branch and this branch give exactly the same results. The code for computing of the percentage of the fully covered genes hasn't changed in this PR so I assume it must be the changes in the stats over the intervals. The python method and the d4tools perc_cov might give slightly different results, and if for instance in one method you get coverage completeness 99,99999 instead of 1, then the gene interval will count as not fully covered. I'll test with more gene panels

Jakob37 · 2024-09-26T07:14:41Z

Cool to see this moving! Close to goal now :D

One thing I wanted to ask. Are you handling that the output format sometimes comes in scientific notation and somtimes in numeric?

I think in the end, the logic is like this:

If > 0.001, then 0.230 format
If < 0.001, then 1.34e-5 format (this will also apply to == 0 I think, not perfect, but I didn't have the heart to squeeze cardemich for any more updates)

northwestwitch · 2024-09-26T07:17:28Z

Are you handling that the output format sometimes comes in scientific notation and somtimes in numeric?

The output from d4tools you mean?

Jakob37 · 2024-09-26T07:19:20Z

Are you handling that the output format sometimes comes in scientific notation and somtimes in numeric?

The output from d4tools you mean?

Yes. Here is how the test output looks now:

https://github.com/38/d4-format/blob/master/d4tools/test/stat/perc_cov-multiple-ranges/output.txt

chr	0	2	0.000e0	0.000e0
chr	0	8	0.625	0.375
chr	0	10	0.600	0.300
chr	1	6	0.600	0.400
chr	3	9	1.000	0.500
chr	4	5	1.000	1.000
chr	5	10	0.800	0.400

(No cases > 0 && < 0.001 here unfortunately, but that would be e syntax as well)

Jakob37 · 2024-09-26T07:20:12Z

I've compared the results with a real d4 locally but less genes, main branch and this branch give exactly the same results. The code for computing of the percentage of the fully covered genes hasn't changed in this PR so I assume it must be the changes in the stats over the intervals. The python method and the d4tools perc_cov might give slightly different results, and if for instance in one method you get coverage completeness 99,99999 instead of 1, then the gene interval will count as not fully covered. I'll test with more gene panels

Hmm. I'll see if I can test this a bit as well.

northwestwitch · 2024-09-26T07:28:34Z

mmm in theory the scientific notation should be taken care of on line 32 of src/chanjo2/meta/handle_completeness_stats.py, when it gets converted to float..

northwestwitch · 2024-09-26T08:18:11Z

Mmm there is something weird with some genes for this branch. For instance I have a panel with 44 genes:

Main branch report:

Note that BRAFP1 is not fully coverered.

Same report with this branch:

BRAFP1 seems fully covered at 10X

But it's obviously not, because when you check the coverage for just that gene you get this

Jakob37 · 2024-09-26T08:19:26Z

Hmm, has the expected format for the report end-point changed?

This command has worked for me before:

curl -X 'POST' \
  'http://localhost:9000/report' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "build": "GRCh38",
  "hgnc_gene_ids": [2933,6842,10447,11947],
  "interval_type": "genes",
  "default_level": 10,
  "panel_name": "Custom panel",
  "case_display_name": "Test sample",
  "samples": [
    {
      "name": "string",
      "coverage_file_path": "/home/worker/app/data/hg002.d4",
      "case_name": "Test case",
      "analysis_date": "2023-10-04T08:22:06.980106"
    }
  ]
}'

I am getting:

{"detail":"[{\"type\":\"enum\",\"loc\":[\"build\"],\"msg\":\"Input should be 'GRCh37' or 'GRCh38'\",\"input\":null,\"ctx\":{\"expected\":\"'GRCh37' or 'GRCh38'\"},\"url\":\"https://errors.pydantic.dev/2.8/v/enum\"},{\"type\":\"enum\",\"loc\":[\"interval_type\"],\"msg\":\"Input should be 'genes', 'transcripts', 'exons' or 'custom_intervals'\",\"input\":null,\"ctx\":{\"expected\":\"'genes', 'transcripts', 'exons' or 'custom_intervals'\"},\"url\":\"https://errors.pydantic.dev/2.8/v/enum\"},{\"type\":\"list_type\",\"loc\":[\"samples\"],\"msg\":\"Input should be a valid list\",\"input\":null,\"url\":\"https://errors.pydantic.dev/2.8/v/list_type\"}]"}

I'll continue digging, but suspecting this is something obvious :)

northwestwitch · 2024-09-26T08:25:41Z

Hmm, has the expected format for the report end-point changed?

This command has worked for me before:

curl -X 'POST' \
  'http://localhost:9000/report' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "build": "GRCh38",
  "hgnc_gene_ids": [2933,6842,10447,11947],
  "interval_type": "genes",
  "default_level": 10,
  "panel_name": "Custom panel",
  "case_display_name": "Test sample",
  "samples": [
    {
      "name": "string",
      "coverage_file_path": "/home/worker/app/data/hg002.d4",
      "case_name": "Test case",
      "analysis_date": "2023-10-04T08:22:06.980106"
    }
  ]
}'

Yes, I removed the support for json requests, so it accepts only x-www-form-urlencoded form data (same requests that are sent from Scout). I did this with the idea that the "coverage" endpoints should accept json instead. It makes maintenance easier on the long run I think

Could you try this request instead?

curl -X 'POST' \
  'http://localhost:9000/report' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -d 'build=GRCh38' \
  -d 'hgnc_gene_ids=2933' \
  -d 'hgnc_gene_ids=6842' \
  -d 'hgnc_gene_ids=10447' \
  -d 'hgnc_gene_ids=11947' \
  -d 'interval_type=genes' \
  -d 'default_level=10' \
  -d 'panel_name=Custom panel' \
  -d 'case_display_name=Test sample' \
  -d 'samples[0][name]=string' \
  -d 'samples[0][coverage_file_path]=/home/worker/app/data/hg002.d4' \
  -d 'samples[0][case_name]=Test case' \
  -d 'samples[0][analysis_date]=2023-10-04T08:22:06.980106'

Jakob37 · 2024-09-26T08:32:22Z

Thanks! That put me on the right track. This worked:

curl -X POST http://localhost:9000/report \
  -H "Accept: application/json" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "build=GRCh38" \
  -d "hgnc_gene_ids=2933,6842,10447,11947" \
  -d "interval_type=genes" \
  -d "default_level=10" \
  -d "panel_name=Custom panel" \
  -d "case_display_name=Test sample" \
  -d 'samples=[{"name":"string","coverage_file_path":"/home/worker/app/data/hg002.d4","case_name":"Test case","analysis_date":"2023-10-04T08:22:06.980106"}]'

northwestwitch · 2024-09-26T09:48:09Z

Ah I just found the bug. d4tools changes the order of the provided intervals and writes a result file ordered by chrom, start and stop. So I have to take care of this. Will fix!

sonarcloud · 2024-09-26T11:00:15Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

northwestwitch · 2024-09-26T11:09:36Z

Ok the fix that I've just pushed covers the BRAFP1 problem described above. The gene now shows:

But the number of fully covered genes is still different from main branch. I'll debug some more 🤔

Jakob37 · 2024-09-26T12:10:54Z

I have the new version up and running now on our dev server.

Looking on a previous case. This does not look right 🤔

Edit: Numbers are low (and plausible) for a single gene, but go crazy when two genes are studied at the same time.

I'll see if I can see whether it comes from d4tools or Chanjo2.

northwestwitch · 2024-09-26T12:25:31Z

Edit: Numbers are low (and plausible) for a single gene, but go crazy when two genes are studied at the same time.

Woa, that's a lot of coverage! 😆
I get more of less the same values as before, the only number slightly changed is the fully covered genes. Are you sure you are on this branch and its latest version? I had a pair of other branches where I started to address this change that now are gone

Jakob37 · 2024-09-26T12:31:46Z

Hmmm. I will check, but I think I am in the right place. And also, I get some concerning numbers from d4tools itself:

worker@8f332e90467a:~/app$ d4tools stat -s perc_cov=10,15,20,50,100 --region /tmp/tmpy6598chg /fs1/results_dev/wgs/cov/<id>-d4_wgs_1_coverage.d4 -H
#Chr    Start       End         10x     15x     20x     50x     100x
12      21797389    21942543    0.123   0.113   0.102   0.010   0.000e0
17      7217125     7225266     339.552 339.512 339.474 0.179   0.000e0

northwestwitch · 2024-09-26T12:33:38Z

worker@8f332e90467a:~/app$ d4tools stat -s perc_cov=10,15,20,50,100 --region /tmp/tmpy6598chg /fs1/results_dev/wgs/cov/-d4_wgs_1_coverage.d4 -H
#Chr Start End 10x 15x 20x 50x 100x
12 21797389 21942543 0.123 0.113 0.102 0.010 0.000e0
17 7217125 7225266 339.552 339.512 339.474 0.179 0.000e0

Ok. Are you using the latest main branch from d4tools, perhaps you didn't update or something?

Jakob37 · 2024-09-26T12:34:22Z

worker@8f332e90467a:~/app$ d4tools stat -s perc_cov=10,15,20,50,100 --region /tmp/tmpy6598chg /fs1/results_dev/wgs/cov/-d4_wgs_1_coverage.d4 -H
#Chr Start End 10x 15x 20x 50x 100x
12 21797389 21942543 0.123 0.113 0.102 0.010 0.000e0
17 7217125 7225266 339.552 339.512 339.474 0.179 0.000e0

Ok. Are you using the latest main branch from d4tools, perhaps you didn't update or something?

The code above is executed inside the Chanjo2 container, so if the container is up to date, the command above would be.

I'll test around some more.

Jakob37 · 2024-09-26T12:43:13Z

Hmm, running the same ranges for other datasets, things seem to look fine. The one above is one of the first d4-files I created, so maybe something was weird back then.

Looks fine for a whole bunch of others:

12      21797389        21942543        0.122   0.111   0.099   0.006   0.000e0
17      7217125 7225266 0.910   0.864   0.809   0.149   0.000e0
12      21797389        21942543        0.123   0.111   0.101   0.006   0.000e0
17      7217125 7225266 0.918   0.873   0.820   0.228   0.000e0
12      21797389        21942543        0.124   0.114   0.103   0.008   0.000e0
17      7217125 7225266 0.912   0.874   0.824   0.135   0.000e0
12      21797389        21942543        0.132   0.125   0.119   0.079   0.000e0
17      7217125 7225266 0.942   0.927   0.898   0.700   0.000e0
12      21797389        21942543        0.127   0.118   0.108   0.025   0.000e0
17      7217125 7225266 0.923   0.886   0.849   0.239   0.000e0
12      21797389        21942543        0.117   0.102   0.089   5.511e-5        0.000e0
17      7217125 7225266 0.884   0.816   0.736   0.000e0 0.000e0
12      21797389        21942543        0.124   0.113   0.102   0.007   0.000e0
17      7217125 7225266 0.903   0.849   0.794   0.096   0.000e0
12      21797389        21942543        0.994   0.993   0.990   0.212   0.000e0
17      7217125 7225266 1.000   0.983   0.959   0.284   0.000e0

I'll continue testing, but slightly less concerned now.

Jakob37 · 2024-09-26T12:54:23Z

OK, I could reproduce it for that specific d4-file after compiling and running the latest version of d4tools locally.

But I cannot reproduce it for any other files.

I suspect I indexed the d4files in the beginning in our pipeline. And that this is related to that (38/d4-format#80). Later I removed the indexing, which fits with that all later samples look OK.

Conclusion: I don't think it is something of concern here.

Jakob37 · 2024-09-26T13:07:40Z

More testing. Here is the report for a real case:

Inside the Chanjo2 container, I can confirm that the numbers fully covered genes check out:

$ d4tools stat -s perc_cov=10,15,20,50,100 --region /tmp/tmpwsbck5aw /access/wgs/cov/<ID>_coverage.d4 > /tmp/out.txt
$ wc -l /tmp/out.txt
71
$ cat /tmp/out.txt | cut -f4 | grep -v "1.000" | wc -l
34

northwestwitch · 2024-09-26T13:11:08Z

Inside the Chanjo2 container, I can confirm that the numbers fully covered genes check out:

That's also what I'm seeing with our demo case. Looks like that the 100% covered genes are the same in chanjo2 and d4tools

northwestwitch · 2024-09-26T13:13:09Z

And it's faster right?

Jakob37 · 2024-09-26T13:18:22Z

And it's faster right?

I timed a case now with 680 IDs. It loaded the page in ~40 seconds. (This is with 5 thresholds though, which might be a bit greedy :) )

Running just the d4-part it finished in 6 seconds, which is much faster than before.

So looks like the bottleneck has shifted somewhere else now.

But I think ~40 seconds will be acceptible loading time for the geneticists ...

Jakob37 · 2024-09-26T13:19:40Z

I think it looks great, looking forward to finally put this into action 👌

northwestwitch · 2024-09-26T13:24:13Z

But I think ~40 seconds will be acceptible loading time for the geneticists ...

I think so too. And I know there is room for improvement. This PR was meant to be a non-invasive and non-problematic alternative to the code we have on main branch. Shall I merge?

Jakob37 · 2024-09-26T14:39:10Z

Yes, let's merge 🎉

Jakob37

Worked nicely in testing. Didn't see anything to mention when eyeing through the code.
Nice 👌

Chiara Rasi and others added 16 commits September 20, 2024 11:05

Seems to work

2f09104

rename variales

1d0de9b

rename variales

5a83c9f

Almost fixed also coverage

1fa6bbc

format

27beec0

Small but important fix

0466704

Rename a file and a function

a5ae2c6

Forgot to add the changes

86f1ffb

Removed unused logging and vars

ce1b2a7

Updated changelog

ba6d8e9

typo

62d843a

Merge branch 'main' into remove_completeness_multitasker

d6b8c7e

Merge branch 'main' into remove_completeness_multitasker

fa9812f

Merge branch 'main' into remove_completeness_multitasker

25bd5b4

Fix error due to merging with main

76f4b0e

Update changelog

80a2ef0

northwestwitch marked this pull request as ready for review September 26, 2024 07:32

Chiara Rasi added 2 commits September 26, 2024 12:56

Fix order of intervals being evaulated

987dc08

More user-friendly code

4263b2e

Jakob37 self-requested a review September 26, 2024 14:45

Jakob37 approved these changes Sep 26, 2024

View reviewed changes

northwestwitch merged commit 000f53e into main Sep 27, 2024
7 checks passed

northwestwitch deleted the remove_completeness_multitasker branch October 8, 2024 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace completeness multitasker with d4tools perc_cov #349

Replace completeness multitasker with d4tools perc_cov #349

northwestwitch commented Sep 20, 2024 •

edited

Loading

codecov-commenter commented Sep 26, 2024 •

edited

Loading

northwestwitch commented Sep 26, 2024 •

edited

Loading

northwestwitch commented Sep 26, 2024 •

edited

Loading

Jakob37 commented Sep 26, 2024 •

edited

Loading

northwestwitch commented Sep 26, 2024

Jakob37 commented Sep 26, 2024

Jakob37 commented Sep 26, 2024

northwestwitch commented Sep 26, 2024 •

edited

Loading

northwestwitch commented Sep 26, 2024

Jakob37 commented Sep 26, 2024 •

edited

Loading

northwestwitch commented Sep 26, 2024 •

edited

Loading

Jakob37 commented Sep 26, 2024

northwestwitch commented Sep 26, 2024

sonarcloud bot commented Sep 26, 2024

northwestwitch commented Sep 26, 2024 •

edited

Loading

Jakob37 commented Sep 26, 2024 •

edited

Loading

northwestwitch commented Sep 26, 2024

Jakob37 commented Sep 26, 2024

northwestwitch commented Sep 26, 2024

Jakob37 commented Sep 26, 2024

Jakob37 commented Sep 26, 2024

Jakob37 commented Sep 26, 2024 •

edited

Loading

Jakob37 commented Sep 26, 2024

northwestwitch commented Sep 26, 2024 •

edited

Loading

northwestwitch commented Sep 26, 2024

Jakob37 commented Sep 26, 2024 •

edited

Loading

Jakob37 commented Sep 26, 2024

northwestwitch commented Sep 26, 2024 •

edited

Loading

Jakob37 commented Sep 26, 2024

Jakob37 left a comment

Replace completeness multitasker with d4tools perc_cov #349

Replace completeness multitasker with d4tools perc_cov #349

Conversation

northwestwitch commented Sep 20, 2024 • edited Loading

Description

Added/Changed/Fixed

How to test

Expected outcome

Review

This version is a

codecov-commenter commented Sep 26, 2024 • edited Loading

Codecov Report

northwestwitch commented Sep 26, 2024 • edited Loading

Main branch on stage: -> report created in about 120 seconds:

This test on stage -> report created in 13 seconds

northwestwitch commented Sep 26, 2024 • edited Loading

Jakob37 commented Sep 26, 2024 • edited Loading

northwestwitch commented Sep 26, 2024

Jakob37 commented Sep 26, 2024

Jakob37 commented Sep 26, 2024

northwestwitch commented Sep 26, 2024 • edited Loading

northwestwitch commented Sep 26, 2024

Jakob37 commented Sep 26, 2024 • edited Loading

northwestwitch commented Sep 26, 2024 • edited Loading

Jakob37 commented Sep 26, 2024

northwestwitch commented Sep 26, 2024

sonarcloud bot commented Sep 26, 2024

Quality Gate passed

northwestwitch commented Sep 26, 2024 • edited Loading

Jakob37 commented Sep 26, 2024 • edited Loading

northwestwitch commented Sep 26, 2024

Jakob37 commented Sep 26, 2024

northwestwitch commented Sep 26, 2024

Jakob37 commented Sep 26, 2024

Jakob37 commented Sep 26, 2024

Jakob37 commented Sep 26, 2024 • edited Loading

Jakob37 commented Sep 26, 2024

northwestwitch commented Sep 26, 2024 • edited Loading

northwestwitch commented Sep 26, 2024

Jakob37 commented Sep 26, 2024 • edited Loading

Jakob37 commented Sep 26, 2024

northwestwitch commented Sep 26, 2024 • edited Loading

Jakob37 commented Sep 26, 2024

Jakob37 left a comment

Choose a reason for hiding this comment

northwestwitch commented Sep 20, 2024 •

edited

Loading

codecov-commenter commented Sep 26, 2024 •

edited

Loading

northwestwitch commented Sep 26, 2024 •

edited

Loading

northwestwitch commented Sep 26, 2024 •

edited

Loading

Jakob37 commented Sep 26, 2024 •

edited

Loading

northwestwitch commented Sep 26, 2024 •

edited

Loading

Jakob37 commented Sep 26, 2024 •

edited

Loading

northwestwitch commented Sep 26, 2024 •

edited

Loading

northwestwitch commented Sep 26, 2024 •

edited

Loading

Jakob37 commented Sep 26, 2024 •

edited

Loading

Jakob37 commented Sep 26, 2024 •

edited

Loading

northwestwitch commented Sep 26, 2024 •

edited

Loading

Jakob37 commented Sep 26, 2024 •

edited

Loading

northwestwitch commented Sep 26, 2024 •

edited

Loading