Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ploidy fixes #135

Merged
merged 15 commits into from
Sep 26, 2023
Merged

Ploidy fixes #135

merged 15 commits into from
Sep 26, 2023

Conversation

stschiff
Copy link
Member

A new feature in validate in version 1.4.0.0 checks whether Genotype_Ploidy from the Janno file is consistent with genotype data.

When running this on the community-archive, I noticed that many samples had indeed heterozygotes, even though being marked as haploid in the Janno. I have updated the respective packages. There could be some more cases, as I have only run the usual first 100 SNPs, but I think I should have caught most.

@nevrome
Copy link
Member

nevrome commented Sep 20, 2023

Brilliant! I love it when the validation helps to make the data better.

I also ran validate with --fullGeno on the entire dataset and did not find any more packages with this issue. You seemed to have caught all of them by looking at the first 100 SNPs.

Why did you also remove/set to n/a the Capture_Type column for some of the affected .janno files?

@stschiff
Copy link
Member Author

Thanks, good to know that --fullGeno also didn't report more.

Well the thing with the Capture_Type just came up: For modern samples we don't use Capture, so that should be set to n/a. I don't want to make this a general rule, though, as - in principle - one could sequence a modern-day genome with Capture. It's just that it typically isn't done.

@nevrome
Copy link
Member

nevrome commented Sep 20, 2023

Good catch! Did you search for that systematically as well? Or just fixed it when it came up together with the ploidy issue?

@stschiff
Copy link
Member Author

Hmm, I think I did not, no. Not sure exactly now.

@nevrome
Copy link
Member

nevrome commented Sep 20, 2023

I think this would be the relevant query, right?

qjanno "SELECT Poseidon_ID,Capture_Type FROM d(.) WHERE Date_Type = 'modern' AND Capture_Type IS NOT NULL"

Which reminds me that qjanno should add a column with the path to the .janno file a given sample is coming from. Or even the name of the package. The R package already has a feature like this. This would make it much more easy to determine which packages are affected here. I think I'll squeeze this into poseidon-framework/qjanno#4.

@stschiff
Copy link
Member Author

Actually, Shotgun is OK for modern data, e.g. in case of 1000Genomes data. I will check again which packages need updating in that respect

@stschiff
Copy link
Member Author

OK, I think I've got them all.

@stschiff
Copy link
Member Author

So, to summarise, I've changed the Genotype_Ploidy where necessary. There is one issue with one of the 1000 Genomes samples, which has heterozygotes in the genotype data, even though it should be pseudo-haploid pulldown. We need to check whether that's an error that came from the AADR.

I also fixed a lot of modern samples that had Capture_Type set to OtherCapture, which in most cases should be n/a (because there is no capture).

@AyGhal do you want to approve this quickly? Because our release of trident-1.4.0.0 and in particular xerxes-1.0.0.0 rely on this change, I'd appreciate quick feedback, otherwise I will take the risk and merge this in myself. No pressure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants