Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 2021 posth sci adv #214

Merged
merged 10 commits into from
Oct 25, 2024
Merged

Add 2021 posth sci adv #214

merged 10 commits into from
Oct 25, 2024

Conversation

stschiff
Copy link
Member

@stschiff stschiff commented Oct 11, 2024

PR Checklist for a new package submission

  • The package does not exist already in the community archive, also not with a different name.
  • The package title in the POSEIDON.yml conforms to the general title structure suggested here: <Year>_<Last name of first author>_<Region, time period or special feature of the paper>, e.g. 2021_Zegarac_SoutheasternEurope, 2021_SeguinOrlando_BellBeaker or 2021_Kivisild_MedievalEstonia.
  • The package is stored in a directory that is named like the package title.

  • The package is complete and features the following elements:
    • Genotype data in binary PLINK format (not EIGENSTRAT format).
    • A POSEIDON.yml file with not just the file-referencing fields, but also the following meta-information fields present and filled: poseidonVersion, title, description, contributor, packageVersion, lastModified (see here for their definition)
    • A reasonably filled .janno file (for a list of available fields look here and here for more detailed documentation about them).
    • A .bib file with the necessary literature references for each sample in the .janno file.
  • Every file in the submission is correctly referenced in the POSEIDON.yml file and there are no additional, supplementary files in the submission that are not documented there.
  • Genotype data, .janno and .bib file are all named after the package title and only differ in the file extension.
  • The package version in the POSEIDON.yml file is 1.0.0.
  • The poseidonVersion of the package in the POSEIDON.yml file is set to the latest version of the Poseidon schema.
  • The POSEIDON.yml file contains the corresponding checksums for the fields genoFile, snpFile, indFile, jannoFile and bibFile.
  • There is either no CHANGELOG file or one with a single entry for version 1.0.0.

  • The Publication column in the .janno file is filled and the respective .bib file has complete entries for the listed mentioned keys.
  • The .janno file does not include any empty columns or columns only filled with n/a.
  • The order of columns in the .janno file adheres to the standard order as defined in the Poseidon schema here.
  • The .janno and the .ssf files are not fully quoted, so they only use single- or double quotes ("...", '...') to enclose text fields where it is strictly necessary (i.e. their entry includes a TAB).

  • The package passes a validation with trident validate --fullGeno.

  • Large genotype data files are properly tracked with Git LFS and not directly pushed to the repository. For an instruction on how to set up Git LFS please look here. If you accidentally pushed the files the wrong way you can fix it with git lfs migrate import --no-rewrite path/to/file.bed (see here).

@stschiff
Copy link
Member Author

OK, so this adds Posth et al. 2021 Science Advances. The Janno-File was kindly prepared by @EleniSef, and the genotype data provided by the authors themselves. I fixed alignment issues between the Janno and the genotype data (removed some samples from Elenis Janno and brought the genotype data into the right order).

There are some issues left to fix, in particular with the Janno.
@EleniSef, the longitude/latitude entries are wrong (they are numbers containing two decimal points). Can you please check? You may not have the right to edit this janno here in this PR, but perhaps you could simply download the new Janno-file from this PR, fix the Lat/Long and email me the corrected version, so I can upload it here and continue with this PR?

Thanks a lot!

@EleniSef
Copy link
Contributor

@stschiff thank you for pointing that out.
The edited janno is in your inbox.

@stschiff
Copy link
Member Author

Great, I've added @EleniSef's new Janno file, and fixed some other minor things. I think the package is ready for review.

@stschiff stschiff marked this pull request as ready for review October 22, 2024 07:47
@stschiff
Copy link
Member Author

I have no idea why the validation check fails. This package validates on my Computer (MacOS). The janno-checksum is correct. Any hint, @nevrome ?

@nevrome
Copy link
Member

nevrome commented Oct 22, 2024

Maybe it's again this obscure issue? poseidon-framework/poseidon-hs#302

@stschiff
Copy link
Member Author

Uff, it probably is... which means that I should rerun rectify on a Linux machine. I'll try that.

@stschiff
Copy link
Member Author

Hmm... no. On our server I get the exact same checksum. I have no idea how to proceed. Can someone perhaps pull this branch and run validate on their linux machine?

@nevrome
Copy link
Member

nevrome commented Oct 22, 2024

On my system the validation fails because of a wrong checksum for 2021_Posth_Etruscans.janno. The correct checksum is 64903fc4201e9deed1e18710569c0fca.

I don't know where this discrepancy is coming from. I was convinced this must be an issue of your proprietary operating system, but if you observed the same behaviour on a proper (Linux) computer, then I'm as lost as you. Could this be something about Git and autocrlf? We have set it to

* text eol=lf

@stschiff
Copy link
Member Author

Yes, that was it. The Janno-file had CRLF line-endings. Perhaps because I used @EleniSef's file as as a start, and she used Windows? On Mac, default line endings are LF. I wonder whether we can have trident validate check this. I will open an issue for it, so at least on our individual local systems we can already catch these.

With respect to git-attributes, I wonder why we need this attribute-setting. It seems to me that it breaks the consistency of our checksum-system. If a package validates locally, I cannot trust anymore that it validates on another computer. That's bad, right? I will open another issue just for that.

I fixed the Janno and the checksum, so this is good to review. Thanks.

@stschiff
Copy link
Member Author

stschiff commented Oct 23, 2024

OK, hold on, there is another issue that our check found: We seem to have a duplicate individual with a sample in 2021 Biagini et al:

[Error]     IndividualInfo {indInfoName = "MAS003", indInfoGroups = ["Spanish"], indInfoPac = *2019_Biagini_Spain-2.2.1*}
[Error]     IndividualInfo {indInfoName = "MAS003", indInfoGroups = ["C.Italy_Imperial"], indInfoPac = *2021_Posth_Etruscans-1.0.0*}

@EleniSef if you have any input here, that'd be welcome, but I will also have a look at the other paper myself. Would be good to know whether that is really the same sample, or just a coincidence, in which case we need to rename it.

@EleniSef
Copy link
Contributor

For your comment above, I used google spreadsheet to prepare the .janno file, as I could easily convert to tsv. I am not sure if that was the problem.
For the duplicate individuals, I had a look at this paper: DOI: 10.1038/s41431-019-0361-1 and it seems like both papers used the same sample code but they are different individuals.
2019_Biagini_Spain - Catalunya_Central MAS003
2021_Posth_Etruscans - Necropolis of Le Pianacce of Poggio Pozzino MAS003

Here you can also validate this:
for the Biagini paper https://figshare.com/articles/dataset/People_from_Ibiza_dataset/7162025?file=13179191
for Cosimo's paper is on page 10 of Supplementary Text

@stschiff
Copy link
Member Author

OK wonderful, thanks for investigating. OK, then we need to find a new Poseidon_ID for MAS003 in Posth et al. I would the suggest to rename all MAS00? samples to Marsiliana00?. In other words:

MAS001 -> Marsiliana001
MAS002 -> Marsiliana002
MAS003 -> Marsiliana003
MAS004 -> Marsiliana004

Would that be acceptable, you think? Perhaps @AyGhal can briefly comment?

@AyGhal
Copy link
Contributor

AyGhal commented Oct 23, 2024

That works and we can have the original IDs in the Alternative_IDs. But didn't we discuss this in one of the pandora or big data meeting, something like MAS001_eva, or was it a prefix?

@stschiff
Copy link
Member Author

stschiff commented Oct 23, 2024

Hmm, good point. Calling in @TCLamnidis @wolfgangaroo and @nevrome -> Can you remember how we wanted to deal with Pandora IDs in publications in cases where there already is an ID from another lab? As you see above, I propose to rename four individuals from Posth et al. 2021 with explicit site names (so MAS001 -> Marsiliana001). @AyGhal's proposal (MAS001 -> MAS001_eva) seems fine, but in my view it emphasises our laboratory a bit too much. In the end the term "eva" has little to do with the actual samples, right?

@nevrome
Copy link
Member

nevrome commented Oct 23, 2024

I think the last time we extended the name (GOR/Gordion?) , so I would vote for MAS001 -> Marsiliana001.

@stschiff
Copy link
Member Author

OK, this is now good for review and merge, @AyGhal.

@AyGhal
Copy link
Contributor

AyGhal commented Oct 25, 2024

Okay, I see that you have changed the Poseidon_ID for these individuals. However, the original IDs from the paper should be recorded in the Alternative_IDs. We don't want to loose this information.

@stschiff
Copy link
Member Author

Good catch. I just added the original names as Alternative_ID, and I also put a note in the Note field.

@AyGhal
Copy link
Contributor

AyGhal commented Oct 25, 2024

Looks good to me, thanks!

@AyGhal AyGhal merged commit 8af3a88 into master Oct 25, 2024
1 check passed
@AyGhal AyGhal deleted the add_2021_Posth_SciAdv branch October 25, 2024 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants