-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Completing 2020_Margaryan_Viking with jannocoalesce #5
Conversation
@stschiff and I had a very productive discussion about these issues yesterday. Here's what we came up with:
So this problem can be solved just by running
2., 4. and 6. all point to the need for a checklist for the data editor and reviewers. Maybe this is the next step in specifying this process: We have to come up with a concrete list of ToDos and then write it up for the documentation. I imagine this list to be available on GitHub as well through a GitHub action: When the PR gets opened, @delphis-bot posts the checklist there for the human editor to go through. |
|
Specifically, the Could at minimum rename it, and remove the Eager_ID column instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following columns should be filled in by Minotaur, not the PCA!
(This is a TODO for my implementation, not sth for here)
Capture_Type (SSF)
UDG (TSV)
Library_Built (SSF)
Genotype_Ploidy (hard-coded)
Coverage_on_Target_SNPs (pyEager)
Genetic_Source_Accession_IDs (SSF)
Thanks for all this, @TCLamnidis. Some comments
Yes, great! But again, not necessarily something that needs to work automatically. Our validation CI would catch this and provide useful pointers to the user to fix it manually within the PR after minotaur!
I like keeping Eager_ID. I have no problem if adding a whole bunch of Minotaur/Eager-related columns that are not in the schema. It's exactly why we made this flexible.
Hmm, so do you want to check whether it's SG and then fill it, but not when it's capture?
Yes, fine to not do this for now.
Yes, but let's please not spend weeks on getting a doi->bib converter ready. Again something the user / we could fill manually. |
As of April 12, new processing is on the way. |
This PR and discussion was very helpful. I have now recreated the |
I ran the following command. This finds a match for every sample in the target file from the source file.
trident jannocoalesce \ -s ../../community-archive/2020_Margaryan_Viking/Margaryan_Viking.janno \ -t 2020_Margaryan_Viking.janno \ --stripIdRegex "(\.SG$)|(_MNT$)"
Here are my observations:
jannocoalesce
seems to work reasonably well. I only see two potential changes we could consider: As it is very verbose for large files we may want to reduce the amount of [Info] output, and--fillColumns
could get negative selection. Maybe I propose something in Jannocoalesce poseidon-hs#282Group_Name
s, but instead only fills the columns withUnknown
. I think I understand why this is the case and it's a pretty big issue. We can not fix this with jannocoalesce as things stand right now.Main_ID
,RateErrX
,RateErrY
,RateX
,RateY
.AADR
andAADRv443
citations here.Note
field (e.g.PASS (literature)
)jannocoalesce
does not fill .bib files, so theMargaryanWillerslevNature2020
citation is missing.(2) and (6) somehow point to the need for a
packagecoalesce
command 🤔