-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Additional Profiles and Clusters #29
Conversation
…lusters for reference samples from database parameters
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
main.nf
Outdated
|
||
if ((params.db_profiles && !params.db_clusters) || (!params.db_profiles && params.db_clusters)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to change this, because it's probably easier to understand the way you have it, but you could do this with an XOR: true only if one of the two is true.
I don't think Python (edit: Nextflow/Groovy) has an actual logical XOR operator, so it might reduce to something like:
if bool(params.db_profiles) != bool(params.db_clusters):
but again, just a comment, not necessary to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion Eric - played around a bit and got it working here: 7d54942
It looks good nothing to change. Always have suggestions though! It could be made more clear in the README.md (or with additional error reporting but more work). Basically, it should be emphasized that any address levels in the additional databases that are not in the samplesheet address will be dropped. The error could be triggered if max address size in the samplesheet is smaller than number of columns/levels in the database. As those will be will be dropped (I believe based on how csvtk concat works). For the README.md you could emphasize headers must match and do something like this for the example reference database.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for making these changes Kyla. Great work 😄
A few in-line comments below
…laps and loci mismatches with samples from input
…d_profiles and append_clusters functions
…ust be provided together
Thank you everyone for your review and suggestions! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much Kyla for addressing all my comments. And for adding all those tests 😄. Amazing work.
I just have one more question given in-line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for all the great work you've done with this Kyla. It looks amazing and handles so many more situations with sample names. I really appreciate it 😄
# Calculate the frequency of each sample_id across both sources | ||
csvtk freq -t -f id combined_profiles.tsv > sample_counts.tsv | ||
|
||
# For any sample_id that appears in both the reference and database, add a 'db_' prefix to the sample_id from the database |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's really cool. This would solve the issue with duplicates in all situations then 😄 . Thanks so much.
} | ||
} | ||
|
||
test("Test pipeline when appended profiles or clusters have sample_id overlap") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this test 😄
This update aims to enhance the pipeline by integrating additional reference profiles and clusters from user-provided database parameters:
--db_profiles
will be incorporated through theAPPEND_PROFILES
process (which followsLOCIDEX_MERGE_REF
).--db_clusters
will be integrated via theAPPEND_CLUSTERS
process (which followsCLUSTER_FILE
).Both parameters are required for their respective processes, and users must provide both; it is not possible to supply only one.
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).