Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjusting genomescope parameters for long reads #180

Open
bioannap opened this issue Dec 5, 2024 · 2 comments
Open

Adjusting genomescope parameters for long reads #180

bioannap opened this issue Dec 5, 2024 · 2 comments

Comments

@bioannap
Copy link

bioannap commented Dec 5, 2024

Dear developers,

We are trying to use genomescope and smudgeplot for inferring the ploidy on a non-model plant. We expect it to be polyploid but we don't have any proof about it.

We have long read data generated form Pacbio Revio.

For creating the genomescope file .histo we tried to use the recommended parameters:

$ kmc -k21 -t10 -m64 -ci1 -cs10000 myrawreads.fastq reads tmp/
$ kmc_tools transform reads histogram reads.histo -cx10000

and for visualizing we used the online platform http://genomescope.org/ setting k-mer lenth = 21, Ploidy = 2 (but only because we don't know the ploidy), Max k-mer coverage = -1, Averge k-mer coverage =-1

And this is the result:
Genomescope

When trying to use ploidy = 4 instead the result would be this:
image

The non-log scale doesn't seem to have any peak, and we don't understand how to interpret the log scale. Also the model fit is about 0%. Would you suggest to use different parameters for long reads?

Thank you very much in advance!

@KamilSJaron
Copy link
Owner

Hi @bioannap, so sorry for very slow response, this issue somehow slipped through the cracks (I am usually good with not marking issues as read if I don't respond).

Long reads are totally fine, what long reads we are talking about? HiFi or duplex or corrected nanopore are quite alright, but older long reads can be a bit messy. Nevertheless, your dataset... looks a bit funny, I don't understand why the non-log version is so ... error dominated. I would load it the spectrum in R and replot it manually to see how it looks on a non-log scale when the y axis is sanely scaled (you can exclude the first 40x coverge, your genome has 1n coverage 80x anyway, so you won't exclude any of the genomic k-mers). Alternativelly, you can fabricate it in your histogram file and reupload it to the webserver, I am sure it i will show more... reasonable.

Also, did you use http://genomescope.org/genomescope2/? I presume so, given you talk about trying higher ploidy. How does the transformed plot looks like, I imagine that one makes more sense, no?

@bioannap
Copy link
Author

Hi @KamilSJaron thank you for your answer!

Don't worry, I actually had the time to practice a little bit more and try out FastK for kmer counting.
We have Pacbio HiFi reads so I guess they are fine for Genomescope2.0
The reason for that strange looking plot, as you suggested, could have been scaling which is not automatic using the webpage for visualizing the histo plot. I generated using the command-line interface of Genomescope2.0 and the linear plot looks much better!
Another reason could be the tool used for kmer counting but that would be unusual.

Here's the newly generated plot, it looks much better.
Image

Based on this I would say it's a diploid, but I will also run smudgeplot to be sure.

Thanks again for your help!
Anna

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants