Completing 2020_Margaryan_Viking with jannocoalesce #5

nevrome · 2024-02-07T11:27:19Z

I ran the following command. This finds a match for every sample in the target file from the source file.

trident jannocoalesce \
  -s ../../community-archive/2020_Margaryan_Viking/Margaryan_Viking.janno \
  -t 2020_Margaryan_Viking.janno \
  --stripIdRegex "(\.SG$)|(_MNT$)"

Here are my observations:

jannocoalesce seems to work reasonably well. I only see two potential changes we could consider: As it is very verbose for large files we may want to reduce the amount of [Info] output, and --fillColumns could get negative selection. Maybe I propose something in Jannocoalesce poseidon-hs#282
The minotaur pipeline does not seem to add Group_Names, but instead only fills the columns with Unknown. I think I understand why this is the case and it's a pretty big issue. We can not fix this with jannocoalesce as things stand right now.
We should try to specify Main_ID, RateErrX, RateErrY, RateX, RateY.
I'm not sure if we still need the AADR and AADRv443 citations here.
Same applies for the decision in the Note field (e.g. PASS (literature))
jannocoalesce does not fill .bib files, so the MargaryanWillerslevNature2020 citation is missing.

(2) and (6) somehow point to the need for a packagecoalesce command 🤔

nevrome · 2024-02-21T12:12:28Z

@stschiff and I had a very productive discussion about these issues yesterday. Here's what we came up with:

I will apply the desired changes to jannocoalesce asap.
Instead of adding a custom jannocoalesce option to overwrite Unknown in the Group_Name column we can just use the --force flag to do this for us:

trident jannocoalesce  \
   -s ../../community-archive/2020_Margaryan_Viking/Margaryan_Viking.janno  \
   -t 2020_Margaryan_Viking.janno \
   --stripIdRegex "(\.SG$)|(_MNT$)" \
   --fillColumns "Group_Name" \
   --force

So this problem can be solved just by running jannocoalesce twice.

I opened an issue for the sex rate columns: Specify sex determination variables RateX, RateY, RateErrX, RateErrY poseidon-schema#73. The Main_ID is a bigger challenge and I'm still not sure if and how to approach this.
Stephan and I agree that it's not necessary to cite the AADR for each sample in the minotaur-archive when we only take the context info from the .janno files from the community archive. These citations should be removed from the Publication column (or not even added in the first place, see 6.).
We didn't really discuss what to do with this AADR data quality levels. What I'm wondering now: @TCLamnidis would it be possible to come up with a sample quality metric in the minotaur workflow? Some broad summary statistic that indicates how trustworthy the genomic data in a given sample may be? Quantifying and defining such a statistic could be a major achievement 🤔
Stephan suggested not to fill the Publication column at all from the community-archive .janno files, but instead leave this entirely to the human editor.

2., 4. and 6. all point to the need for a checklist for the data editor and reviewers. Maybe this is the next step in specifying this process: We have to come up with a concrete list of ToDos and then write it up for the documentation. I imagine this list to be available on GitHub as well through a GitHub action: When the PR gets opened, @delphis-bot posts the checklist there for the human editor to go through.

TCLamnidis · 2024-03-13T15:32:33Z

We would then need to update the fam/ind files with these changes as well, as they all need to match. That is something I have already implemented in other workflows, so could just port that over to the poseidon-eager repo as an additional helper script.
The background implementation of the Main_ID is a huge undertaking ofc, but keeping this column around is simple enough. And then it is there when needed. Similarly, I wonder if columns like Eager_ID should be dropped. It is useful for debugging but there is a sweet spot between not losing any meta-info and over-encumbering a package janno with "irrelevant" columns. Thoughts?
I think we should have a clear selection of columns that we fill-in anyway. There are plenty of values that might be missing in Minotaur but not the community archive that we might not want to port over. e.g. Endogenous DNA has a pretty clear specification, but in cases where the SG data was not uploaded to the ENA, it would potentially be misleading to fill the information in with the reported values in PMA packages 🤔
That is quite an undertaking. Would happily discuss this further, but given the heterogeneity in the input for minotaur, I am not sure this goal is very realistic as an automated task.
I agree. At present, Minotaur processes data by PUBLICATION (though one could easily add raw data from other papers in the SSF 🤔 ). As such, we have an issue with the doi of the relevant paper, and could use that to fill in the bib file and publication entries.

TCLamnidis · 2024-03-13T15:37:20Z

Specifically, the Main_ID column:
I think we should not remove the column from MNT jannos. Re-inferring the correct values of the column is a larger task than removing it later on. Plus, since the column does not currently exist in the PCA/PAA, we can simply tranfer the Main_IDs from MNT to the other archives when/if we start using the column for things.

Could at minimum rename it, and remove the Eager_ID column instead.

TCLamnidis

The following columns should be filled in by Minotaur, not the PCA!
(This is a TODO for my implementation, not sth for here)

Capture_Type (SSF)
UDG (TSV)
Library_Built (SSF)
Genotype_Ploidy (hard-coded)
Coverage_on_Target_SNPs (pyEager)
Genetic_Source_Accession_IDs (SSF)

stschiff · 2024-03-15T17:22:59Z

Thanks for all this, @TCLamnidis. Some comments

We would then need to update the fam/ind files with these changes as well, as they all need to match. That is something I have already implemented in other workflows, so could just port that over to the poseidon-eager repo as an additional helper script.

Yes, great! But again, not necessarily something that needs to work automatically. Our validation CI would catch this and provide useful pointers to the user to fix it manually within the PR after minotaur!

The background implementation of the Main_ID is a huge undertaking ofc, but keeping this column around is simple enough. And then it is there when needed. Similarly, I wonder if columns like Eager_ID should be dropped. It is useful for debugging but there is a sweet spot between not losing any meta-info and over-encumbering a package janno with "irrelevant" columns. Thoughts?

I like keeping Eager_ID. I have no problem if adding a whole bunch of Minotaur/Eager-related columns that are not in the schema. It's exactly why we made this flexible.

I think we should have a clear selection of columns that we fill-in anyway. There are plenty of values that might be missing in Minotaur but not the community archive that we might not want to port over. e.g. Endogenous DNA has a pretty clear specification, but in cases where the SG data was not uploaded to the ENA, it would potentially be misleading to fill the information in with the reported values in PMA packages 🤔

Hmm, so do you want to check whether it's SG and then fill it, but not when it's capture?

That is quite an undertaking. Would happily discuss this further, but given the heterogeneity in the input for minotaur, I am not sure this goal is very realistic as an automated task.

Yes, fine to not do this for now.

I agree. At present, Minotaur processes data by PUBLICATION (though one could easily add raw data from other papers in the SSF 🤔 ). As such, we have an issue with the doi of the relevant paper, and could use that to fill in the bib file and publication entries.

Yes, but let's please not spend weeks on getting a doi->bib converter ready. Again something the user / we could fill manually.

stschiff · 2024-04-12T07:53:48Z

As of April 12, new processing is on the way.

TCLamnidis · 2024-04-29T08:24:57Z

This PR and discussion was very helpful. I have now recreated the 2020_Margaryan_Viking minotaur package, and used jannocoalesce to fill in metadata.
I will therefore close this PR, as it is no longer relevant.

simple jannocoalesce run

a882e0d

nevrome requested review from stschiff and TCLamnidis February 7, 2024 11:27

stschiff mentioned this pull request Feb 21, 2024

Jannocoalesce poseidon-framework/poseidon-hs#282

Merged

TCLamnidis reviewed Mar 13, 2024

View reviewed changes

TCLamnidis closed this Apr 29, 2024

TCLamnidis deleted the 2020_Margaryan_Viking_jannocoalesce branch April 30, 2024 11:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Completing 2020_Margaryan_Viking with jannocoalesce #5

Completing 2020_Margaryan_Viking with jannocoalesce #5

nevrome commented Feb 7, 2024

nevrome commented Feb 21, 2024

TCLamnidis commented Mar 13, 2024

TCLamnidis commented Mar 13, 2024

TCLamnidis left a comment •

edited

Loading

stschiff commented Mar 15, 2024

stschiff commented Apr 12, 2024

TCLamnidis commented Apr 29, 2024

Completing 2020_Margaryan_Viking with jannocoalesce #5

Completing 2020_Margaryan_Viking with jannocoalesce #5

Conversation

nevrome commented Feb 7, 2024

nevrome commented Feb 21, 2024

TCLamnidis commented Mar 13, 2024

TCLamnidis commented Mar 13, 2024

TCLamnidis left a comment • edited Loading

Choose a reason for hiding this comment

stschiff commented Mar 15, 2024

stschiff commented Apr 12, 2024

TCLamnidis commented Apr 29, 2024

TCLamnidis left a comment •

edited

Loading