Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

information for modern samples added #147

Merged
merged 23 commits into from
Jun 22, 2024
Merged

information for modern samples added #147

merged 23 commits into from
Jun 22, 2024

Conversation

93Boy
Copy link
Contributor

@93Boy 93Boy commented Nov 16, 2023

PR Checklist for modifying one or multiple existing packages

  • The changes maintain the structural integrity of the affected packages.
  • The checksums of the modified files in the respective POSEIDON.yml files were adjusted properly.
  • Every file in the submission is correctly referenced in the relevant POSEIDON.yml files and there are no additional, supplementary files in the submission that are not documented there.

  • The packageVersion numbers of the affected packages were increased in their POSEIDON.yml files.
  • The changes were documented in the respective CHANGELOG files. If no CHANGELOG files existed previously it was added here.
  • The lastModified fields of the affected POSEIDON.yml files were updated.
  • The contributor fields were updated with name, email and orcid of the relevant, new contributors.

  • All affected packages pass a validation with trident validate --fullGeno.

  • Large genotype data files are properly tracked with Git LFS and not directly pushed to the repository. For an instruction on how to set up Git LFS please look here. If you accidentally pushed the files the wrong way you can fix it with git lfs migrate import --no-rewrite path/to/file.bed (see here).

@nevrome
Copy link
Member

nevrome commented Nov 17, 2023

Cool that you started to add this info, @93Boy! Could you go through the checklist above and mark what you have done?

@93Boy
Copy link
Contributor Author

93Boy commented Nov 17, 2023

Sure , I have a few more fields to complete in KilincSciAdv2021. I will check the package information afterwards

@stschiff
Copy link
Member

quotations!

@stschiff
Copy link
Member

And location string should not contain Y-haplogroup, that should go into the right column

@stschiff
Copy link
Member

Also there are encoding problems. Make sure the text export is UTF8 (e.g. "Tai–Kadai")

@stschiff
Copy link
Member

Can we get an update on this one, @93Boy, please?

@stschiff
Copy link
Member

Please go through the checklist, and in particular update the versions. We also have seen backticks in locations and haplogroups in the Janno-file. Please clean them up @93Boy.

@stschiff
Copy link
Member

Please update the Changelog and the version in the YAML. Also there are still single quotes (''''''') in the Wang_EastAsia janno file. Please fix.

@93Boy
Copy link
Contributor Author

93Boy commented Feb 29, 2024

I have made the informed changes. Please check and merge if everything is good

@stschiff
Copy link
Member

stschiff commented Mar 5, 2024

Hi @93Boy. Thanks, but please remember to always check the "Files Changes" tab. You can easily see that the Wang-Janno has quotations. Please remove them.

@93Boy
Copy link
Contributor Author

93Boy commented Mar 5, 2024

I am extremely sorry for this quotation issue. It's a default setting in Libre Office which I can't bypass. It will simply change the structure of a file, even if I open it to check. I have fixed it with the below commit.

Copy link
Member

@stschiff stschiff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looks like this is finally good to merge. @AyGhal, @nevrome do you agree?

Copy link
Contributor

@AyGhal AyGhal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you!
My only question is do we need to update lastModified: for 2021_Wang_EastAsia ?

@stschiff
Copy link
Member

Excellent catch, @AyGhal. I have just done that inline. Now it should be good to merge.

@nevrome
Copy link
Member

nevrome commented Mar 15, 2024

I see many new warnings of the form:

[Warning] [2024-03-15 22:31:08] Issue in data/2021_Kilinc_northeastAsia/Kilinc_northeastAsia.janno in line 2 (Poseidon_ID: bla001): Date_Type is "C14", but either Date_C14_Uncal_BP or Date_C14_Uncal_BP_Err are empty

Note that this check was only added to trident in v1.4.1.0. It should be an error, but we have some legacy packages with this issue, so we decided to only make it a warning. We should absolutely avoid it for new packages.

@93Boy: You seem to get this error, because Date_Type is set to C14 in 2021_Kilinc_northeastAsia, but Date_C14_Uncal_BP and Date_C14_Uncal_BP_Err are empty. We only seem to have contextual dates in this .janno file. These columns seem actually to be completely empty, so they might as well be removed from the .janno file.

@stschiff and @AyGhal: I made it a habit to search the GitHub validation log for the keyword warning in my reviews. Maybe we should also add this to the submission checklist.

@stschiff
Copy link
Member

OK, I will run the validator and fix the outstanding issues quickly

@stschiff
Copy link
Member

stschiff commented Mar 25, 2024

OK, I think indeed the Kilinc package lists C14 dates but no IDs or uncalibrated dates. I will take a look at the paper myself and give Dana a pointer.

@stschiff
Copy link
Member

OK, so I've checked the paper (https://www.science.org/doi/10.1126/sciadv.abc4587) and here attach the supplementary Tables that list the C14 dates. @93Boy can you please go through these and add the C14 IDs and uncalibrated dates to the Janno file? Alternatively, if you feel that would take too much time for now, we can also merge this PR for now and then fix this separately. But I think it could be a relatively quick task of about one hour to add all of them.

abc4587_data_file_s1.xlsx

@93Boy
Copy link
Contributor Author

93Boy commented Apr 22, 2024

@nevrome I have removed the typos caused by my editor and changed group names. I have changed .fam files accordingly but trident-rectify throws the following error. Cross-file consistency issue in package 2021_Kilinc_northeastAsia: Individual GroupID mismatch between genotype data (left) and .janno files (right). Note that this could be due to a wrong Plink file population-name encoding (see the --inPlinkPopName option). (Russia_AginBuryat_M.SG = Russia_ArgunRiver_M.SG)

@93Boy
Copy link
Contributor Author

93Boy commented Apr 22, 2024

Regarding the 2021_Wang_EastAsia , I havent changed anything in the location information but I can manually trim it off.

@stschiff
Copy link
Member

stschiff commented Apr 23, 2024

Regarding 2021_Kilinc_northeastAsia:

  • There is a mismatch on line 7, where one says Russia_AginBuryat_M.SG (fam-file) and the other Russia_ArgunRiver_M.SG (jango-file). According to the Supplementary File above (column "Region" for sample cta016), Russia_ArgunRiver_M.SG seems right, so the janno file is correct, the fam file is incorrect. Please double-check and fix if necessary.
  • Regarding the "Collection ID", I think here we should actually take the entries in the column "Sample_Name" in the supplementary file. The column "Lab_ID" seems to match the Poseidon_ID. So please add the "Sample_Name" info to as Collection_ID

Regarding 2021_Wang_EastAsia:

  • There are many samples in this package where the "Location" has a string like Tibetic, Bodish, Sino-Tibetan, Gannan, Gansu, Y.haplogroup=J2b2. This is a mix of language families and Y-haplogroups, but not locations. Please remove the language groups and move the haplogroup info into the right column, without the "Y.haplogroup=" prefix.
  • Do we have a source table where the right location info can be extracted?

@93Boy
Copy link
Contributor Author

93Boy commented Apr 23, 2024

@stschiff Regarding the 2021_Wang_EastAsia , when I took over this package there were 383 entries that were incomplete starting from the poseidon_ID GN01 and ending from Dong007. They are the modern samples according to my knowledge. Later on, I extracted 191 ancient entries from AADR v50. They are complete and in the correct format. This issue has been there even before the commit you mentioned above.

@stschiff
Copy link
Member

@93Boy I've updated my comment. The history how this came about isn't too important. The issues are relatively clear. Can you work on fixing them, please, and let us know where you couldn't? Thanks.

@93Boy
Copy link
Contributor Author

93Boy commented May 1, 2024

I have made the following recommended changes to 2021_Wang_EastAsia Janno,

  1. Y_haplo group information was extracted from the Location column for all the 383 modern samples and updated.
  2. After examining the Group names, they are more likely locations. Therefore I swapped group names and location data. (Please let me know , if Im wrong I can easily change it)

I have thoroughly reviewed supplementary documents, but I could not find the information for 383 modern samples. I appreciate it if you could guide me about this. Herewith I have attached the supplementary documents.
41586_2021_3336_MOESM3_ESM (2).zip

@stschiff
Copy link
Member

stschiff commented May 3, 2024

Notes from Meeting on May 3rd:

For Wang et al. 2021:

  • please remove quotes
  • What you currently added as "Group_Name" for the modern samples, are in fact languages. Please remove them. The original group names from the Master branch were correct, please put them back in for the modern samples. They are now in "location", where they can also stay. So you can - for the modern samples only - have same entries in group and location.
  • Probably fine to leave "Collection_ID" as "n/a" for the modern samples.
  • For the ancient samples, the "Collection_ID" is redundantly copied from the Poseidon_ID. Please use column "Skeletal Code" from Online Table 1.
  • For ancient samples: Please put back the location info from the master branch. It got lost.

@93Boy
Copy link
Contributor Author

93Boy commented May 8, 2024

Wang et al. 2021:
I have done the requested changes. Furthermore I have notices several (30+) poseidon_IDs were actually Library_ids according to the supplementary materials and I could have extracted a proper ID key from those supplementary materials. But now trident rectify can not recognize these new poseidon_Ids. May I change fam files too?

@93Boy
Copy link
Contributor Author

93Boy commented May 16, 2024

suspicious_IDs.csv
I have done a small analysis regarding the problematic poseidon_IDs of Wang et.al 2021 and finding are given below,

  • There are 28 poseidon_IDs which are most likely library codes.
  • I was able to match a valid poseidon_IDs for all of these entries from the old collection_IDs which we replaced with skeletal codes under the commit "e0b4031"
  • AADR v54.1 also contain 18 of these suspicious_IDs, they also follow the format of old collection_IDs

@stschiff
Copy link
Member

stschiff commented May 17, 2024

Notes from May 17th:

Thanks for catching these. As discussed. Please replace library names like "S15162.Y1.E1.L1" with the Master ID in Supplementary Table 2, in this case it would be "I15162", wherever this is possible. Of course, these changes should be made in both the fam- and the janno-file.

@stschiff
Copy link
Member

Update, @93Boy ?

@93Boy
Copy link
Contributor Author

93Boy commented Jun 11, 2024

I completed all the requested changes during the last review. I think now this is ready for the second review

@stschiff stschiff requested review from stschiff and nevrome June 14, 2024 08:02
@nevrome
Copy link
Member

nevrome commented Jun 14, 2024

2021_Kilinc_northeastAsia looks good to me now.

@nevrome
Copy link
Member

nevrome commented Jun 14, 2024

2021_Wang_EastAsia also looks good now, imho.

Copy link
Member

@stschiff stschiff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super. Good to go from my end.

@nevrome nevrome merged commit a97d5ff into master Jun 22, 2024
1 check passed
@nevrome nevrome deleted the missing_location_info branch June 22, 2024 09:44
@nevrome
Copy link
Member

nevrome commented Jun 22, 2024

After an epic seven month long quest this PR got finally merged 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants