Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace completeness multitasker with d4tools perc_cov #349

Merged
merged 18 commits into from
Sep 27, 2024

Conversation

northwestwitch
Copy link
Member

@northwestwitch northwestwitch commented Sep 20, 2024

Description

Added/Changed/Fixed

How to test

  • Deploy on stage and compare coverage completeness results with main branch

Expected outcome

  • Stats from using this branch should be quicker

Review

  • Tests executed by CR, Jakob37
  • "Merge and deploy" approved by Jakob37

This version is a

  • MAJOR - when you make incompatible API changes
  • MINOR - when you add functionality in a backwards compatible manner
  • PATCH - when you make backwards compatible bug fixes or documentation/instructions

@codecov-commenter
Copy link

codecov-commenter commented Sep 26, 2024

Codecov Report

Attention: Patch coverage is 93.93939% with 2 lines in your changes missing coverage. Please review.

Project coverage is 91.34%. Comparing base (719f1b8) to head (4263b2e).
Report is 25 commits behind head on main.

Files with missing lines Patch % Lines
src/chanjo2/meta/handle_d4.py 60.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #349      +/-   ##
==========================================
+ Coverage   90.18%   91.34%   +1.15%     
==========================================
  Files          30       30              
  Lines        1457     1467      +10     
==========================================
+ Hits         1314     1340      +26     
+ Misses        143      127      -16     
Flag Coverage Δ
unittests 91.34% <93.93%> (+1.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@northwestwitch
Copy link
Member Author

northwestwitch commented Sep 26, 2024

Test with a scout case and OMIM-AUTO panel (4637 genes)

Main branch on stage: -> report created in about 120 seconds:

image

This test on stage -> report created in 13 seconds

image

So the report creating is about 10 times faster when using the perc_cov from d4tools. The numbers are the same, but not the percentage of fully covered genes (compare the 2 screeshots). I'll check which one is correct!

@northwestwitch
Copy link
Member Author

northwestwitch commented Sep 26, 2024

So the report creating is about 10 times faster when using the perc_cov from d4tools. The numbers are the same, but not the percentage of fully covered genes (compare the 2 screenshots). I'll check which one is correct!

I've compared the results with a real d4 locally but less genes, main branch and this branch give exactly the same results. The code for computing of the percentage of the fully covered genes hasn't changed in this PR so I assume it must be the changes in the stats over the intervals. The python method and the d4tools perc_cov might give slightly different results, and if for instance in one method you get coverage completeness 99,99999 instead of 1, then the gene interval will count as not fully covered. I'll test with more gene panels

@Jakob37
Copy link
Collaborator

Jakob37 commented Sep 26, 2024

Cool to see this moving! Close to goal now :D

One thing I wanted to ask. Are you handling that the output format sometimes comes in scientific notation and somtimes in numeric?

I think in the end, the logic is like this:

  • If > 0.001, then 0.230 format
  • If < 0.001, then 1.34e-5 format (this will also apply to == 0 I think, not perfect, but I didn't have the heart to squeeze cardemich for any more updates)

@northwestwitch
Copy link
Member Author

Are you handling that the output format sometimes comes in scientific notation and somtimes in numeric?

The output from d4tools you mean?

@Jakob37
Copy link
Collaborator

Jakob37 commented Sep 26, 2024

Are you handling that the output format sometimes comes in scientific notation and somtimes in numeric?

The output from d4tools you mean?

Yes. Here is how the test output looks now:

https://github.com/38/d4-format/blob/master/d4tools/test/stat/perc_cov-multiple-ranges/output.txt

chr	0	2	0.000e0	0.000e0
chr	0	8	0.625	0.375
chr	0	10	0.600	0.300
chr	1	6	0.600	0.400
chr	3	9	1.000	0.500
chr	4	5	1.000	1.000
chr	5	10	0.800	0.400

(No cases > 0 && < 0.001 here unfortunately, but that would be e syntax as well)

@Jakob37
Copy link
Collaborator

Jakob37 commented Sep 26, 2024

I've compared the results with a real d4 locally but less genes, main branch and this branch give exactly the same results. The code for computing of the percentage of the fully covered genes hasn't changed in this PR so I assume it must be the changes in the stats over the intervals. The python method and the d4tools perc_cov might give slightly different results, and if for instance in one method you get coverage completeness 99,99999 instead of 1, then the gene interval will count as not fully covered. I'll test with more gene panels

Hmm. I'll see if I can test this a bit as well.

@northwestwitch
Copy link
Member Author

northwestwitch commented Sep 26, 2024

mmm in theory the scientific notation should be taken care of on line 32 of src/chanjo2/meta/handle_completeness_stats.py, when it gets converted to float..

@northwestwitch northwestwitch marked this pull request as ready for review September 26, 2024 07:32
@northwestwitch
Copy link
Member Author

Mmm there is something weird with some genes for this branch. For instance I have a panel with 44 genes:

Main branch report:

image

Note that BRAFP1 is not fully coverered.

Same report with this branch:

image

BRAFP1 seems fully covered at 10X

But it's obviously not, because when you check the coverage for just that gene you get this

image

@Jakob37
Copy link
Collaborator

Jakob37 commented Sep 26, 2024

Hmm, has the expected format for the report end-point changed?

This command has worked for me before:

curl -X 'POST' \
  'http://localhost:9000/report' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "build": "GRCh38",
  "hgnc_gene_ids": [2933,6842,10447,11947],
  "interval_type": "genes",
  "default_level": 10,
  "panel_name": "Custom panel",
  "case_display_name": "Test sample",
  "samples": [
    {
      "name": "string",
      "coverage_file_path": "/home/worker/app/data/hg002.d4",
      "case_name": "Test case",
      "analysis_date": "2023-10-04T08:22:06.980106"
    }
  ]
}'

I am getting:

{"detail":"[{\"type\":\"enum\",\"loc\":[\"build\"],\"msg\":\"Input should be 'GRCh37' or 'GRCh38'\",\"input\":null,\"ctx\":{\"expected\":\"'GRCh37' or 'GRCh38'\"},\"url\":\"https://errors.pydantic.dev/2.8/v/enum\"},{\"type\":\"enum\",\"loc\":[\"interval_type\"],\"msg\":\"Input should be 'genes', 'transcripts', 'exons' or 'custom_intervals'\",\"input\":null,\"ctx\":{\"expected\":\"'genes', 'transcripts', 'exons' or 'custom_intervals'\"},\"url\":\"https://errors.pydantic.dev/2.8/v/enum\"},{\"type\":\"list_type\",\"loc\":[\"samples\"],\"msg\":\"Input should be a valid list\",\"input\":null,\"url\":\"https://errors.pydantic.dev/2.8/v/list_type\"}]"}

I'll continue digging, but suspecting this is something obvious :)

@northwestwitch
Copy link
Member Author

northwestwitch commented Sep 26, 2024

Hmm, has the expected format for the report end-point changed?

This command has worked for me before:

curl -X 'POST' \
  'http://localhost:9000/report' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "build": "GRCh38",
  "hgnc_gene_ids": [2933,6842,10447,11947],
  "interval_type": "genes",
  "default_level": 10,
  "panel_name": "Custom panel",
  "case_display_name": "Test sample",
  "samples": [
    {
      "name": "string",
      "coverage_file_path": "/home/worker/app/data/hg002.d4",
      "case_name": "Test case",
      "analysis_date": "2023-10-04T08:22:06.980106"
    }
  ]
}'

Yes, I removed the support for json requests, so it accepts only x-www-form-urlencoded form data (same requests that are sent from Scout). I did this with the idea that the "coverage" endpoints should accept json instead. It makes maintenance easier on the long run I think

Could you try this request instead?

curl -X 'POST' \
  'http://localhost:9000/report' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -d 'build=GRCh38' \
  -d 'hgnc_gene_ids=2933' \
  -d 'hgnc_gene_ids=6842' \
  -d 'hgnc_gene_ids=10447' \
  -d 'hgnc_gene_ids=11947' \
  -d 'interval_type=genes' \
  -d 'default_level=10' \
  -d 'panel_name=Custom panel' \
  -d 'case_display_name=Test sample' \
  -d 'samples[0][name]=string' \
  -d 'samples[0][coverage_file_path]=/home/worker/app/data/hg002.d4' \
  -d 'samples[0][case_name]=Test case' \
  -d 'samples[0][analysis_date]=2023-10-04T08:22:06.980106' 

@Jakob37
Copy link
Collaborator

Jakob37 commented Sep 26, 2024

Thanks! That put me on the right track. This worked:

curl -X POST http://localhost:9000/report \
  -H "Accept: application/json" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "build=GRCh38" \
  -d "hgnc_gene_ids=2933,6842,10447,11947" \
  -d "interval_type=genes" \
  -d "default_level=10" \
  -d "panel_name=Custom panel" \
  -d "case_display_name=Test sample" \
  -d 'samples=[{"name":"string","coverage_file_path":"/home/worker/app/data/hg002.d4","case_name":"Test case","analysis_date":"2023-10-04T08:22:06.980106"}]'

@northwestwitch
Copy link
Member Author

Ah I just found the bug. d4tools changes the order of the provided intervals and writes a result file ordered by chrom, start and stop. So I have to take care of this. Will fix!

Copy link

sonarcloud bot commented Sep 26, 2024

@northwestwitch
Copy link
Member Author

northwestwitch commented Sep 26, 2024

Ok the fix that I've just pushed covers the BRAFP1 problem described above. The gene now shows:

image

But the number of fully covered genes is still different from main branch. I'll debug some more 🤔

@Jakob37
Copy link
Collaborator

Jakob37 commented Sep 26, 2024

I have the new version up and running now on our dev server.

Looking on a previous case. This does not look right 🤔

chanjo2_test

Edit: Numbers are low (and plausible) for a single gene, but go crazy when two genes are studied at the same time.

I'll see if I can see whether it comes from d4tools or Chanjo2.

@northwestwitch
Copy link
Member Author

Edit: Numbers are low (and plausible) for a single gene, but go crazy when two genes are studied at the same time.

Woa, that's a lot of coverage! 😆
I get more of less the same values as before, the only number slightly changed is the fully covered genes. Are you sure you are on this branch and its latest version? I had a pair of other branches where I started to address this change that now are gone

@Jakob37
Copy link
Collaborator

Jakob37 commented Sep 26, 2024

Hmmm. I will check, but I think I am in the right place. And also, I get some concerning numbers from d4tools itself:

worker@8f332e90467a:~/app$ d4tools stat -s perc_cov=10,15,20,50,100 --region /tmp/tmpy6598chg /fs1/results_dev/wgs/cov/<id>-d4_wgs_1_coverage.d4 -H
#Chr    Start       End         10x     15x     20x     50x     100x
12      21797389    21942543    0.123   0.113   0.102   0.010   0.000e0
17      7217125     7225266     339.552 339.512 339.474 0.179   0.000e0

@northwestwitch
Copy link
Member Author

worker@8f332e90467a:~/app$ d4tools stat -s perc_cov=10,15,20,50,100 --region /tmp/tmpy6598chg /fs1/results_dev/wgs/cov/-d4_wgs_1_coverage.d4 -H
#Chr Start End 10x 15x 20x 50x 100x
12 21797389 21942543 0.123 0.113 0.102 0.010 0.000e0
17 7217125 7225266 339.552 339.512 339.474 0.179 0.000e0

Ok. Are you using the latest main branch from d4tools, perhaps you didn't update or something?

@Jakob37
Copy link
Collaborator

Jakob37 commented Sep 26, 2024

worker@8f332e90467a:~/app$ d4tools stat -s perc_cov=10,15,20,50,100 --region /tmp/tmpy6598chg /fs1/results_dev/wgs/cov/-d4_wgs_1_coverage.d4 -H
#Chr Start End 10x 15x 20x 50x 100x
12 21797389 21942543 0.123 0.113 0.102 0.010 0.000e0
17 7217125 7225266 339.552 339.512 339.474 0.179 0.000e0

Ok. Are you using the latest main branch from d4tools, perhaps you didn't update or something?

The code above is executed inside the Chanjo2 container, so if the container is up to date, the command above would be.

I'll test around some more.

@Jakob37
Copy link
Collaborator

Jakob37 commented Sep 26, 2024

Hmm, running the same ranges for other datasets, things seem to look fine. The one above is one of the first d4-files I created, so maybe something was weird back then.

Looks fine for a whole bunch of others:

12      21797389        21942543        0.122   0.111   0.099   0.006   0.000e0
17      7217125 7225266 0.910   0.864   0.809   0.149   0.000e0
12      21797389        21942543        0.123   0.111   0.101   0.006   0.000e0
17      7217125 7225266 0.918   0.873   0.820   0.228   0.000e0
12      21797389        21942543        0.124   0.114   0.103   0.008   0.000e0
17      7217125 7225266 0.912   0.874   0.824   0.135   0.000e0
12      21797389        21942543        0.132   0.125   0.119   0.079   0.000e0
17      7217125 7225266 0.942   0.927   0.898   0.700   0.000e0
12      21797389        21942543        0.127   0.118   0.108   0.025   0.000e0
17      7217125 7225266 0.923   0.886   0.849   0.239   0.000e0
12      21797389        21942543        0.117   0.102   0.089   5.511e-5        0.000e0
17      7217125 7225266 0.884   0.816   0.736   0.000e0 0.000e0
12      21797389        21942543        0.124   0.113   0.102   0.007   0.000e0
17      7217125 7225266 0.903   0.849   0.794   0.096   0.000e0
12      21797389        21942543        0.994   0.993   0.990   0.212   0.000e0
17      7217125 7225266 1.000   0.983   0.959   0.284   0.000e0

I'll continue testing, but slightly less concerned now.

@Jakob37
Copy link
Collaborator

Jakob37 commented Sep 26, 2024

OK, I could reproduce it for that specific d4-file after compiling and running the latest version of d4tools locally.

But I cannot reproduce it for any other files.

I suspect I indexed the d4files in the beginning in our pipeline. And that this is related to that (38/d4-format#80). Later I removed the indexing, which fits with that all later samples look OK.

Conclusion: I don't think it is something of concern here.

@Jakob37
Copy link
Collaborator

Jakob37 commented Sep 26, 2024

More testing. Here is the report for a real case:

case

Inside the Chanjo2 container, I can confirm that the numbers fully covered genes check out:

$ d4tools stat -s perc_cov=10,15,20,50,100 --region /tmp/tmpwsbck5aw /access/wgs/cov/<ID>_coverage.d4 > /tmp/out.txt
$ wc -l /tmp/out.txt
71
$ cat /tmp/out.txt | cut -f4 | grep -v "1.000" | wc -l
34

@northwestwitch
Copy link
Member Author

northwestwitch commented Sep 26, 2024

Inside the Chanjo2 container, I can confirm that the numbers fully covered genes check out:

That's also what I'm seeing with our demo case. Looks like that the 100% covered genes are the same in chanjo2 and d4tools

@northwestwitch
Copy link
Member Author

And it's faster right?

@Jakob37
Copy link
Collaborator

Jakob37 commented Sep 26, 2024

And it's faster right?

I timed a case now with 680 IDs. It loaded the page in ~40 seconds. (This is with 5 thresholds though, which might be a bit greedy :) )

Running just the d4-part it finished in 6 seconds, which is much faster than before.

So looks like the bottleneck has shifted somewhere else now.

But I think ~40 seconds will be acceptible loading time for the geneticists ...

@Jakob37
Copy link
Collaborator

Jakob37 commented Sep 26, 2024

I think it looks great, looking forward to finally put this into action 👌

@northwestwitch
Copy link
Member Author

northwestwitch commented Sep 26, 2024

But I think ~40 seconds will be acceptible loading time for the geneticists ...

I think so too. And I know there is room for improvement. This PR was meant to be a non-invasive and non-problematic alternative to the code we have on main branch. Shall I merge?

@Jakob37
Copy link
Collaborator

Jakob37 commented Sep 26, 2024

Yes, let's merge 🎉

@Jakob37 Jakob37 self-requested a review September 26, 2024 14:45
Copy link
Collaborator

@Jakob37 Jakob37 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worked nicely in testing. Didn't see anything to mention when eyeing through the code.
Nice 👌

@northwestwitch northwestwitch merged commit 000f53e into main Sep 27, 2024
7 checks passed
@northwestwitch northwestwitch deleted the remove_completeness_multitasker branch October 8, 2024 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of get_coverage_completeness.py
3 participants