Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add package 2022_Lazaridis_SouthernArk #209

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

93Boy
Copy link
Contributor

@93Boy 93Boy commented Aug 23, 2024

PR Checklist for a new package submission

  • The package does not exist already in the community archive, also not with a different name.
  • The package title in the POSEIDON.yml conforms to the general title structure suggested here: <Year>_<Last name of first author>_<Region, time period or special feature of the paper>, e.g. 2021_Zegarac_SoutheasternEurope, 2021_SeguinOrlando_BellBeaker or 2021_Kivisild_MedievalEstonia.
  • The package is stored in a directory that is named like the package title.

  • The package is complete and features the following elements:
    • Genotype data in binary PLINK format (not EIGENSTRAT format).
    • A POSEIDON.yml file with not just the file-referencing fields, but also the following meta-information fields present and filled: poseidonVersion, title, description, contributor, packageVersion, lastModified (see here for their definition)
    • A reasonably filled .janno file (for a list of available fields look here and here for more detailed documentation about them).
    • A .bib file with the necessary literature references for each sample in the .janno file.
  • Every file in the submission is correctly referenced in the POSEIDON.yml file and there are no additional, supplementary files in the submission that are not documented there.
  • Genotype data, .janno and .bib file are all named after the package title and only differ in the file extension.
  • The package version in the POSEIDON.yml file is 1.0.0.
  • The poseidonVersion of the package in the POSEIDON.yml file is set to the latest version of the Poseidon schema.
  • The POSEIDON.yml file contains the corresponding checksums for the fields genoFile, snpFile, indFile, jannoFile and bibFile.
  • There is either no CHANGELOG file or one with a single entry for version 1.0.0.

  • The Publication column in the .janno file is filled and the respective .bib file has complete entries for the listed mentioned keys.
  • The .janno file does not include any empty columns or columns only filled with n/a.
  • The order of columns in the .janno file adheres to the standard order as defined in the Poseidon schema here.
  • The .janno and the .ssf files are not fully quoted, so they only use single- or double quotes ("...", '...') to enclose text fields where it is strictly necessary (i.e. their entry includes a TAB).

  • The package passes a validation with trident validate --fullGeno.

  • Large genotype data files are properly tracked with Git LFS and not directly pushed to the repository. For an instruction on how to set up Git LFS please look here. If you accidentally pushed the files the wrong way you can fix it with git lfs migrate import --no-rewrite path/to/file.bed (see here).

93Boy added 5 commits August 22, 2024 13:37
github.com:poseidon-framework/community-archive into
2022_Lazaridis_SouthernArk
Updating branch
@stschiff
Copy link
Member

Just a quick comment: The ind file has >5000 samples, the janna >700. They must have the same numbers!

@stschiff
Copy link
Member

stschiff commented Sep 2, 2024

Have you checked this, @93Boy ?

@93Boy
Copy link
Contributor Author

93Boy commented Sep 3, 2024

I have gone through the paper and the supplementary materials. The publication also has 778 entries. Therefore I compared genotype data with supplementary data. I have attached the analysis herewith, can you kindly check this? I have checked around 30 random IDs to check whether there is a match in our Poseidon database. I didn't get a positive match for them as well
SouthernArc_mismatches.csv

@stschiff
Copy link
Member

stschiff commented Sep 4, 2024

I think you've taken in genotype data which contains way more data than just the newly published individuals from within that study! You will have to either extract the correct individuals from the AADR, or if you want to use the package provided on David's website, you will have to extract the correct individuals from there.

@93Boy
Copy link
Contributor Author

93Boy commented Sep 4, 2024

I have directly downloaded the genotype data available on the Reich lab website. So I will filter out the rest

@nevrome nevrome changed the title 2022 lazaridis southern ark Add package 2022_Lazaridis_SouthernArk Sep 6, 2024
@93Boy
Copy link
Contributor Author

93Boy commented Sep 12, 2024

I have tried to use trident-forge on the genotype data available on reich lab website, but it failed as multiple IDs were not present in the genotype data, Then I tried to forge them from AADR v54 but it also threw the same error. May I manually remove those entries from my Janno file ? These are the IDs that do not available on the genotype data
Screenshot from 2024-09-12 23-13-48

@stschiff
Copy link
Member

Has there been an update on this one? I think we discussed that merging isn't necessary, right? You can just extract the data from the AADR.

@93Boy
Copy link
Contributor Author

93Boy commented Sep 23, 2024

Hello Stephan , Sorry for the late response. I have encountered another mismatch when fetching the data from Poseidon AADR. It gave me 1566 entries. It seems like almost all the entries have duplicated considering the source data. Herewith attached a small analysis regarding the duplicate values and the number of occurrences. Can you tell me what should I do with these duplicate values? There is another mismatch with the publication and AADR V54.1 which contain 736 unique IDs, When I remove the duplicate of Poseidon AADR file it results 786 unique values
duplicate_values.txt

@stschiff
Copy link
Member

OK I see. OK I will have to take a look at this, which will not happen this week.

@stschiff
Copy link
Member

stschiff commented Oct 8, 2024

@AyGhal will take a look at this one.

@stschiff
Copy link
Member

stschiff commented Dec 3, 2024

@AyGhal any update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants