Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reevaluate haplogroup .janno columns #76

Open
nevrome opened this issue Apr 19, 2024 · 6 comments
Open

Reevaluate haplogroup .janno columns #76

nevrome opened this issue Apr 19, 2024 · 6 comments

Comments

@nevrome
Copy link
Member

nevrome commented Apr 19, 2024

As discussed in today's meeting with @stschiff and @AyGhal the variables MT_Haplogroup and Y_Haplogroup could be specified a bit further. I see two possible ToDos:

  1. Come up with a minimal validation scheme for entries in these columns to ensure their machine-readability (to be specified in the schema and then later implemented in trident). Maybe experts like @wolfgangaroo and @BenRohrlach could help us out here. My naive understanding is that we could enforce things like e.g.:
    • Every entry has to start with a capital letter
    • Entries can only contain contain capital letters, numbers and - signs
    • ...?
  2. Add specific _Note fields for both haplogroup columns, so that supplementary free-text information has a clear place to go.
@nevrome
Copy link
Member Author

nevrome commented Apr 19, 2024

I got some good input from Ben for the Y-chromosome haplogroups:

Then (usually) names via the terminal start with a capital letter, followed by a hyphen, and then the terminal SNP. There are exceptions, i.e.:

  • Some sub-clades start with two capital letters, such as NO. These are rare, but exist.
  • Some sub-clades are further split, i.e., we say J1-L255 and J2-M172, but others are further split, such as R1a-Z645 and R1b-Z2103. This is because R2 is super rare, so the further split at R1a/R1b is more informative.
  • Terminal SNPs don't follow a fantastic naming rule set, but mostly look like [captial_letters:numbers]. However, there exists SNPs like "Page13".
  • Some people prefer to use one terminal SNP over another, i.e., some people call R1b1 "R1b-L754", and others call it "R1b-PF6269", and the actual SNP is called "L754/PF6269/YSC0000022" (with the different SNP names given in alphabetic order I think). Sadly, both forms can be found in the literature, so there's no consensus.

@stschiff
Copy link
Member

Oh boy, did I really want to know that? 😅

But thanks for finding out, @nevrome. Do you think we can work with that in terms of validation?

@nevrome
Copy link
Member Author

nevrome commented Apr 24, 2024

Well - in the archives there are a lot of odd things that don't fit this description of the Y_Haplogroup syntax. For example:

C2a1a(xC2a1a1,C2a1a2,C2a1a3)
Q1a2a1a4a~ (YP817)
R1a1a1b2a2a3b~ (YP1456, YP1710)
J2b2a1a1a1a1a1a1~
J2a-M410 (J2a7-Z2397) 
G2a3-F1193-F2291
R1b1’5-P312
R1a5-YP1301 (under YP1272)
R1a1'2-Z645(xZ283)
R1a?
R1a*
R1a - Z645
G2a2a1a2a2a1~-Z31430
n/a (female)
...

I know that the Poseidon schema is independent of the archives, but they still serve as an indicator what users expect.

Maybe there is a number of semantics we can extract from what we see in the data and come up with our own specified syntax. Things like (YP1456, YP1710) for example look like an OR to my untrained eyes. n/a (female) is a very specific, reoccurring expression. R1a? indicates that it should be possible to encode uncertainty. And so forth...

@stschiff
Copy link
Member

phew... OK, I think this requires a meeting with the Y-chrom experts. I can imagine that we may introduce a "Y_Haplogroup_Strict" column, or something like it, in which only secure haplogroups following a clear schema are entered. Not sure though.

@nevrome
Copy link
Member Author

nevrome commented Apr 30, 2024

I also talked to Luka about this now. He made me aware that the x in C2a1a(xC2a1a1,C2a1a2,C2a1a3) means NOT. This further confirms me in my suspicion that there is some semantic structure we can encode in a reliable, documented DSL.

Maybe we should indeed introduce a couple of new columns:

  1. One for a representation of Y-haplogroups in Terminal SNP notation with a custom, strictly enforced DSL
  2. (Maybe) One for a representation of Y-haplogroups in YCC longhand nomenclature with another DSL
  3. One to document the Y-Chromosome reference tree that was used
  4. One arbitrary, free text column for further notes on the Y-haplogroup

The current and then old column can just stick around, but we discourage its use.

@stschiff
Copy link
Member

stschiff commented May 6, 2024

Hmm, indeed. Thanks for this succinct suggestion, @nevrome, I think I like this suggestion! Of course, in some sense the same applies for the MT haplogroup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants