Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completing 2020_Margaryan_Viking with jannocoalesce #5

Closed

Conversation

nevrome
Copy link
Member

@nevrome nevrome commented Feb 7, 2024

I ran the following command. This finds a match for every sample in the target file from the source file.

trident jannocoalesce \
  -s ../../community-archive/2020_Margaryan_Viking/Margaryan_Viking.janno \
  -t 2020_Margaryan_Viking.janno \
  --stripIdRegex "(\.SG$)|(_MNT$)"

Here are my observations:

  1. jannocoalesce seems to work reasonably well. I only see two potential changes we could consider: As it is very verbose for large files we may want to reduce the amount of [Info] output, and --fillColumns could get negative selection. Maybe I propose something in Jannocoalesce poseidon-hs#282
  2. The minotaur pipeline does not seem to add Group_Names, but instead only fills the columns with Unknown. I think I understand why this is the case and it's a pretty big issue. We can not fix this with jannocoalesce as things stand right now.
  3. We should try to specify Main_ID, RateErrX, RateErrY, RateX, RateY.
  4. I'm not sure if we still need the AADR and AADRv443 citations here.
  5. Same applies for the decision in the Note field (e.g. PASS (literature))
  6. jannocoalesce does not fill .bib files, so the MargaryanWillerslevNature2020 citation is missing.

(2) and (6) somehow point to the need for a packagecoalesce command 🤔

@nevrome
Copy link
Member Author

nevrome commented Feb 21, 2024

@stschiff and I had a very productive discussion about these issues yesterday. Here's what we came up with:

  1. I will apply the desired changes to jannocoalesce asap.
  2. Instead of adding a custom jannocoalesce option to overwrite Unknown in the Group_Name column we can just use the --force flag to do this for us:
trident jannocoalesce  \
   -s ../../community-archive/2020_Margaryan_Viking/Margaryan_Viking.janno  \
   -t 2020_Margaryan_Viking.janno \
   --stripIdRegex "(\.SG$)|(_MNT$)" \
   --fillColumns "Group_Name" \
   --force

So this problem can be solved just by running jannocoalesce twice.

  1. I opened an issue for the sex rate columns: Specify sex determination variables RateX, RateY, RateErrX, RateErrY poseidon-schema#73. The Main_ID is a bigger challenge and I'm still not sure if and how to approach this.
  2. Stephan and I agree that it's not necessary to cite the AADR for each sample in the minotaur-archive when we only take the context info from the .janno files from the community archive. These citations should be removed from the Publication column (or not even added in the first place, see 6.).
  3. We didn't really discuss what to do with this AADR data quality levels. What I'm wondering now: @TCLamnidis would it be possible to come up with a sample quality metric in the minotaur workflow? Some broad summary statistic that indicates how trustworthy the genomic data in a given sample may be? Quantifying and defining such a statistic could be a major achievement 🤔
  4. Stephan suggested not to fill the Publication column at all from the community-archive .janno files, but instead leave this entirely to the human editor.

2., 4. and 6. all point to the need for a checklist for the data editor and reviewers. Maybe this is the next step in specifying this process: We have to come up with a concrete list of ToDos and then write it up for the documentation. I imagine this list to be available on GitHub as well through a GitHub action: When the PR gets opened, @delphis-bot posts the checklist there for the human editor to go through.

@TCLamnidis
Copy link
Member

  1. We would then need to update the fam/ind files with these changes as well, as they all need to match. That is something I have already implemented in other workflows, so could just port that over to the poseidon-eager repo as an additional helper script.
  2. The background implementation of the Main_ID is a huge undertaking ofc, but keeping this column around is simple enough. And then it is there when needed. Similarly, I wonder if columns like Eager_ID should be dropped. It is useful for debugging but there is a sweet spot between not losing any meta-info and over-encumbering a package janno with "irrelevant" columns. Thoughts?
  3. I think we should have a clear selection of columns that we fill-in anyway. There are plenty of values that might be missing in Minotaur but not the community archive that we might not want to port over. e.g. Endogenous DNA has a pretty clear specification, but in cases where the SG data was not uploaded to the ENA, it would potentially be misleading to fill the information in with the reported values in PMA packages 🤔
  4. That is quite an undertaking. Would happily discuss this further, but given the heterogeneity in the input for minotaur, I am not sure this goal is very realistic as an automated task.
  5. I agree. At present, Minotaur processes data by PUBLICATION (though one could easily add raw data from other papers in the SSF 🤔 ). As such, we have an issue with the doi of the relevant paper, and could use that to fill in the bib file and publication entries.

@TCLamnidis
Copy link
Member

Specifically, the Main_ID column:
I think we should not remove the column from MNT jannos. Re-inferring the correct values of the column is a larger task than removing it later on. Plus, since the column does not currently exist in the PCA/PAA, we can simply tranfer the Main_IDs from MNT to the other archives when/if we start using the column for things.

Could at minimum rename it, and remove the Eager_ID column instead.

Copy link
Member

@TCLamnidis TCLamnidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following columns should be filled in by Minotaur, not the PCA!
(This is a TODO for my implementation, not sth for here)

Capture_Type (SSF)
UDG (TSV)
Library_Built (SSF)
Genotype_Ploidy (hard-coded)
Coverage_on_Target_SNPs (pyEager)
Genetic_Source_Accession_IDs (SSF)

@stschiff
Copy link
Member

Thanks for all this, @TCLamnidis. Some comments

  1. We would then need to update the fam/ind files with these changes as well, as they all need to match. That is something I have already implemented in other workflows, so could just port that over to the poseidon-eager repo as an additional helper script.

Yes, great! But again, not necessarily something that needs to work automatically. Our validation CI would catch this and provide useful pointers to the user to fix it manually within the PR after minotaur!

  1. The background implementation of the Main_ID is a huge undertaking ofc, but keeping this column around is simple enough. And then it is there when needed. Similarly, I wonder if columns like Eager_ID should be dropped. It is useful for debugging but there is a sweet spot between not losing any meta-info and over-encumbering a package janno with "irrelevant" columns. Thoughts?

I like keeping Eager_ID. I have no problem if adding a whole bunch of Minotaur/Eager-related columns that are not in the schema. It's exactly why we made this flexible.

  1. I think we should have a clear selection of columns that we fill-in anyway. There are plenty of values that might be missing in Minotaur but not the community archive that we might not want to port over. e.g. Endogenous DNA has a pretty clear specification, but in cases where the SG data was not uploaded to the ENA, it would potentially be misleading to fill the information in with the reported values in PMA packages 🤔

Hmm, so do you want to check whether it's SG and then fill it, but not when it's capture?

  1. That is quite an undertaking. Would happily discuss this further, but given the heterogeneity in the input for minotaur, I am not sure this goal is very realistic as an automated task.

Yes, fine to not do this for now.

  1. I agree. At present, Minotaur processes data by PUBLICATION (though one could easily add raw data from other papers in the SSF 🤔 ). As such, we have an issue with the doi of the relevant paper, and could use that to fill in the bib file and publication entries.

Yes, but let's please not spend weeks on getting a doi->bib converter ready. Again something the user / we could fill manually.

@stschiff
Copy link
Member

As of April 12, new processing is on the way.

@TCLamnidis
Copy link
Member

This PR and discussion was very helpful. I have now recreated the 2020_Margaryan_Viking minotaur package, and used jannocoalesce to fill in metadata.
I will therefore close this PR, as it is no longer relevant.

@TCLamnidis TCLamnidis closed this Apr 29, 2024
@TCLamnidis TCLamnidis deleted the 2020_Margaryan_Viking_jannocoalesce branch April 30, 2024 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants