Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding 2023_Koptekin_SouthwestAsia #181

Merged
merged 7 commits into from
Jul 12, 2024
Merged

Conversation

93Boy
Copy link
Contributor

@93Boy 93Boy commented May 22, 2024

PR Checklist for a new package submission

  • The package does not exist already in the community archive, also not with a different name.
  • The package title in the POSEIDON.yml conforms to the general title structure suggested here: <Year>_<Last name of first author>_<Region, time period or special feature of the paper>, e.g. 2021_Zegarac_SoutheasternEurope, 2021_SeguinOrlando_BellBeaker or 2021_Kivisild_MedievalEstonia.
  • The package is stored in a directory that is named like the package title.

  • The package is complete and features the following elements:
    • Genotype data in binary PLINK format (not EIGENSTRAT format).
    • A POSEIDON.yml file with not just the file-referencing fields, but also the following meta-information fields present and filled: poseidonVersion, title, description, contributor, packageVersion, lastModified (see here for their definition)
    • A reasonably filled .janno file (for a list of available fields look here and here for more detailed documentation about them).
    • A .bib file with the necessary literature references for each sample in the .janno file.
  • Every file in the submission is correctly referenced in the POSEIDON.yml file and there are no additional, supplementary files in the submission that are not documented there.
  • Genotype data, .janno and .bib file are all named after the package title and only differ in the file extension.
  • The package version in the POSEIDON.yml file is 1.0.0.
  • The poseidonVersion of the package in the POSEIDON.yml file is set to the latest version of the Poseidon schema.
  • The POSEIDON.yml file contains the corresponding checksums for the fields genoFile, snpFile, indFile, jannoFile and bibFile.
  • There is either no CHANGELOG file or one with a single entry for version 1.0.0.

  • The Publication column in the .janno file is filled and the respective .bib file has complete entries for the listed mentioned keys.
  • The .janno file does not include any empty columns or columns only filled with n/a.
  • The order of columns in the .janno file adheres to the standard order as defined in the Poseidon schema here.

  • The package passes a validation with trident validate --fullGeno.

  • Large genotype data files are properly tracked with Git LFS and not directly pushed to the repository. For an instruction on how to set up Git LFS please look here. If you accidentally pushed the files the wrong way you can fix it with git lfs migrate import --no-rewrite path/to/file.bed (see here).

@nevrome nevrome changed the title Initial commit Adding 2023_Koptekin_SouthwestAsia May 23, 2024
@stschiff
Copy link
Member

Thanks for submits, @93Boy. Could you please fill the checkboxes, and see that you get the automatic checks to pass?

@nevrome
Copy link
Member

nevrome commented Jun 7, 2024

Thanks for preparing this package, @93Boy.

Maybe you could apply some more minor changes to the .janno file:

  1. Add the bibtex key of the paper (2023_Koptekin_SouthwestAsia) to the Publication column.
  2. Remove the excessive backticks (''') where they pop up.
  3. Add temporal information in the Date_* columns based on the information in this supplementary table: https://www.cell.com/cms/10.1016/j.cub.2022.11.034/attachment/e5290170-172b-41de-8d59-0be86b60a590/mmc2.xlsx The radiocarbon dating columns are at the end of the table.

The validation fails now, because we have already a sample with the Poseidon_ID GOR001 in the archive. In an other, entirely unrelated package.

[Error]   [2024-06-06 21:42:05] There are duplicated individuals in this package collection. Set --ignoreDuplicates to ignore this issue.
[Error]   [2024-06-06 21:42:05] Duplicate individual "GOR001"
[Error]   [2024-06-06 21:42:05]   IndividualInfo {indInfoName = "GOR001", indInfoGroups = ["Anatolia_Gordion_IA"], indInfoPac = *2023_Koptekin_SouthwestAsia-1.0.0*}
[Error]   [2024-06-06 21:42:05]   IndividualInfo {indInfoName = "GOR001", indInfoGroups = ["VolgaOka_MA1"], indInfoPac = *2023_Peltola_VolgaOka-1.1.0*}

What should the process be here, @stschiff, @AyGhal and @TCLamnidis? I suggest we rename the newly added GOR001 to GOR001Anatolia.

@nevrome
Copy link
Member

nevrome commented Jun 7, 2024

I actually see now that you already added the C14 dates, but then removed them again in 69b2347, @dhananjaya93. The way you added them was incorrect, though. The information from the supplementary material must be split for the .janno columns

Please check again here what the correct structure for the Date_ columns is. Let me know if this is confusing, then we can have a look together.

@TCLamnidis
Copy link
Member

Maybe we should suffix with (part of) the Site name, to avoid future clashes with other samples with the same ID from the same region?

@nevrome
Copy link
Member

nevrome commented Jun 17, 2024

In this case the site is "Gordion", so GOR001Gordion?

@stschiff
Copy link
Member

OK, so first of all, there are again quotations all over, @93Boy.
Regarding GOR001 and GOR002, I would suggest to indeed rename those two:
GOR001 -> Gordion001
GOR002 -> Gordion002

Is that OK with people?

@TCLamnidis
Copy link
Member

GOR001 -> Gordion001
GOR002 -> Gordion002
sounds pretty good to me.

@nevrome
Copy link
Member

nevrome commented Jun 22, 2024

Gordion... sounds like a good solution in this case.

@93Boy
Copy link
Contributor Author

93Boy commented Jul 3, 2024

"Can't read sample in ./2023_Koptekin_SouthwestAsia.janno in line 2: parse error (Failed reading: conversion error: Age 2700 later than 2023, which is impossible. Did you accidentally enter a BP date?)" I keep getting this error for C14 start and end Dates. I have gone through supplementary materials multiple times. They all use BCE format. Can you please check and give me a guideline?
mmc3.xlsx

@93Boy
Copy link
Contributor Author

93Boy commented Jul 3, 2024

After discussing with @nevrome , We have assumed date formats are in BCE format and done the changes accordingly. Please go through on these data and give a feedback. Apart from this issue , I have found contamination dates but could not find Contamination_Meas in supplementary data. There fore trident rectify cannot recognize the values. Other than these issues , I think this package is ready to publish

@nevrome
Copy link
Member

nevrome commented Jul 12, 2024

Looks good to me now - will merge.

@nevrome nevrome merged commit 057d6b9 into master Jul 12, 2024
1 check passed
@nevrome nevrome deleted the 2023_Koptekin_SouthwestAsia branch July 12, 2024 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants