Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Value Error in prune_ngsLD.py #43

Closed
TrishikaS opened this issue Feb 16, 2023 · 16 comments
Closed

Value Error in prune_ngsLD.py #43

TrishikaS opened this issue Feb 16, 2023 · 16 comments

Comments

@TrishikaS
Copy link

Hi again:)

I am trying now to run prune_ngsLD.py script, but I get an error:

Checking if file is gzipped...
Reading in data...
Filtering edges by distance...
Filtering edges by weight...
Beginning pruning by dropping heaviest position...
Starting with 77988 positions with 393573 edges between them...
Traceback (most recent call last):
File "/home/ubuntu/ngsLD/scripts/prune_ngsLD.py", line 160, in
map_property_values(G.ep["weight"], edge_weight, lambda x: int(x * weight_precision))
File "/home/ubuntu/anaconda3/envs/ngsLD_N/lib/python3.11/site-packages/graph_tool/init.py", line 1191, in map_property_values
libcore.property_map_values(u._Graph__graph,
File "/home/ubuntu/ngsLD/scripts/prune_ngsLD.py", line 160, in
map_property_values(G.ep["weight"], edge_weight, lambda x: int(x * weight_precision))
^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot convert float NaN to integer


Would you have any suggestion and ideas how to fix this issue?

Thanks in advance

@fgvieira
Copy link
Owner

Do you have any missing values on your input file?

@TrishikaS
Copy link
Author

Yes, I generate LD_input file from beagle file. And I guess I do have some NaNs in there.
Should I and how, remove them?

@TrishikaS
Copy link
Author

If this helps, these are my steps ...

angsd -GL 2 -out $OUT/chr25_gl -minMapQ 30 -minQ 20 -nThreads 10 -doGlf 2 -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -bam $BAM\

#2.make LD input file

ngsLD --geno chr25_gl.beagle.gz --probs --ignore_miss_data --rnd_sample 0.05 --seed 1 --n_ind 20 --n_sites 78000 --pos sites --out LD_input

#3. run py script for pruning

prune_ngsLD.py --input LD_input --max_dist 50000 --min_weight 0.1 --out testLD_unlinked.pos

@fgvieira
Copy link
Owner

Can you paste here some lines where you have missing data?

@TrishikaS
Copy link
Author

25:449 25:3558 3109 0.135633 -0.000000 nan nan
25:449 25:4145 3696 0.015543 -0.023482 0.999814 0.035279
25:449 25:4525 4076 0.006558 -0.017165 0.224540 0.007366
25:449 25:5174 4725 0.169403 0.169219 0.999968 0.999910
25:449 25:6389 5940 0.032531 -0.103020 0.999272 0.227801
25:449 25:7569 7120 0.153657 -0.080698 1.000000 0.157663
25:449 25:7800 7351 0.121338 -0.125412 0.999920 0.310249
25:449 25:14675 14226 0.135194 -0.145148 0.999973 0.408850
25:449 25:15575 15126 0.133985 -0.145086 0.999972 0.408497
25:449 25:30626 30177 0.024597 -0.000000 nan nan

@fgvieira
Copy link
Owner

fgvieira commented Feb 16, 2023

I suspect that the problem is that some of those sites are not variable. If so, you can try increasing the option --min_maf 0.03.

If not, can you send me the information for sites 25:449 and 25:3558 on the maf and beagle files?

@TrishikaS
Copy link
Author

Thanks for your reply,
I just tried it out with --min_maf 0.03 and it gave the same error. When I remove lines with nan (around 10% of all sites) the script works just fine.
So here is the information for sites 25:449 and 25:3558

Mafs
25 449 T C 0.269521 4.790291e-08 7

Beagle
25_449 3 1 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.666622 0.333333 0.000044 0.333333 0.333333 0.333333 0.000044 0.333333 0.666622 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.666622 0.333333 0.000044 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.665962 0.333333 0.000704 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.799979 0.200021 0.000000 0.000044 0.333333 0.666622 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.665962 0.333333 0.000704

Mafs
25 3558 C T 0.285733 4.107843e-08 7

Beagle
25_3558 1 3 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.666622 0.333333 0.000044 0.000044 0.333333 0.666622 0.000044 0.333333 0.666622 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.666622 0.333333 0.000044 0.666622 0.333333 0.000044 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.665962 0.333333 0.000704 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.666622 0.333333 0.000044 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333

@fgvieira
Copy link
Owner

What is the coverage of these samples?
Can you try running ngsLD with --min_maf 0.03 but without --ignore_miss_data.

@TrishikaS
Copy link
Author

The coverage is ~10x and we want to run analysis also with 4x and 2x coverage in the future.
I tried running ngsLD without --ignore_miss_data, and again the same error appeared:(

Checking if file is gzipped...
Reading in data...
Filtering edges by distance...
Filtering edges by weight...
Beginning pruning by dropping heaviest position...
Starting with 76695 positions with 395834 edges between them...
Traceback (most recent call last):
File "/home/ubuntu/ngsLD/scripts/prune_ngsLD.py", line 160, in
map_property_values(G.ep["weight"], edge_weight, lambda x: int(x * weight_precision))
File "/home/ubuntu/anaconda3/envs/ngsLD_N/lib/python3.11/site-packages/graph_tool/init.py", line 1191, in map_property_values
libcore.property_map_values(u._Graph__graph,
File "/home/ubuntu/ngsLD/scripts/prune_ngsLD.py", line 160, in
map_property_values(G.ep["weight"], edge_weight, lambda x: int(x * weight_precision))

@fgvieira
Copy link
Owner

But with 10x coverage, how do you have so much missing data (those two sites have 70% missing data)?! Is it whole genome or targeted sequencing?

@TrishikaS
Copy link
Author

yeah 70% seems a lot. It's whole genome sequencing, I used only one chromosome just to check if the script is running ok.

@fgvieira
Copy link
Owner

Could it be some repetitive regions? If so, maybe it is good to use a minInd when running angsd.

@TrishikaS
Copy link
Author

So, I tested with -minind 10 (I have 20 ind overall), and then I ran ngsLD with min_maf 0.03 and no missing data, and I still get the same error... Quick explanation: Those 20 samples are from 2 populations (10 ind per pop) with Fst 2%. The species I'm working with has low diversity anyways.

Despite this issue, should the py script ignore NaNs? Is there some way to circumvent this issue? When I deleted lines with nan, the py script worked just fine.

@fgvieira
Copy link
Owner

It should be ok to remove NaNs, but I am just a bit surprised by the amount of missing data on samples that were sequenced at 10x.

Just noticed that it seems you run angsd with just one BAM? But then, how do you get the beagle?

@TrishikaS
Copy link
Author

Ah sorry, I guess it got lost in copy pasting, I have a bamlist file with all 20 bam files:)
some samples have the coverage of ~8x, and some go up to 15x.
If I may ask, what amount of missing data would you expect for samples at 10x? (I'll soon do the analysis with more populations sequenced at 10x and 15x, so I'll check if I get less miss_data)
I'll run the analysis now with the whole genome of these 2 pop., and remove NaNs.

Thanks a lot for your quick replies!

@fgvieira
Copy link
Owner

fgvieira commented Mar 20, 2023

If you used angsd with -minInd 10 and have 20 individuals, you should not have more than 50% missing data, no?
How does your samples have 70%?

Can you also paste here the extended output of some sites with missing data?

I also just made a faster alternative to the pruning scripts; do you think you could give it a try and see how it goes?

@fgvieira fgvieira closed this as completed Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants