-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: classify mature miRNAs #146
base: dev
Are you sure you want to change the base?
Conversation
Merge with dev branch
…olanlab/mirflowz into 126-docs-describe-workflow-rationale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some places, the documentation of the script is written as if the script is a general purpose script for adding intersecting features to SAM files as a tag. But in other places, it becomes clear that there are several assumptions that limit the scope of the script (e.g., miRNA_ID
, shifts/extensions) to miRNAs and isomiRs. I think you could clarify that a bit better and perhaps also make a reference to the scripts that produce valid inputs to this file, because it is highly unlikely that someone would create the inputs for this script manually.
Btw, it would have actually been nice to design this script such that it actually is a general purpose script for adding "name" tags for intersecting features to SAM files and do all the other stuff (dealing with alignments that don't have an intersecting feature, dealing with maximum extensions etc) elsewhere. As it is, this script is quite complicated and has basically zero chance of reuse outside of this workflow.
Anyway, not important now - just lessons for the future :)
So please just clarify the scope of the script in the module-level docstring and I think we are ready to go.
…zavolanlab/mirflowz into 143-fix-classify-correctly-mature-mirna
I believe the script per se is pretty general: it adds a custom tag showing which features an alignment intersects with. I think the problem comes when defining variables and how it is documented. In this sense, the script appears to be only for miRNAs whose annotations might or might not been previously extended. But if the word I suggest to try and make the descriptions and names more general and if you do not see it clear, I will just revert the commit and document a more restricted scope. |
Yes, you can do that if you like. The shift stuff is still quite specific, but I think it's best to keep it, so that we can come to an end on this soon and publish the workflow :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the documentation issues, it should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes here should be overridden by #150, once merged, so I won't review
assert captured.out == out_file.read() | ||
|
||
def test_main_bed_sam_extension_file( | ||
self, monkeypatch, capsys, bed_sam_extension | ||
): | ||
"""Test main function with extension equals 6.""" | ||
"""Test main function with extension and allowed shit equal to 6.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean shift
- though I understand how you could mix those up 🤣
@@ -52,30 +56,47 @@ | |||
hsa-miR-1323|0|0|21M|21|TCAAAACTGAGGGGCATTTTC | |||
out SAM record: | |||
48-1_1 0 19 5338 255 21M * 0 0 TCAAAACTGAGGGGCATTTTC * MD:Z:21 NH:i:1 NM:i:0 YW:Z:hsa-miR-1323|0|0|21M|21|TCAAAACTGAGGGGCATTTTC | |||
explanation: | |||
The aligned read and the annotated featrue have the same start and end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feature
instead of featrue
explanation: | ||
The aligned read and the annotated featrue have the same start and end | ||
positions. Given that no extension are provided in the script call, no | ||
coordinates adjustments are made. And there is no shift on ether end, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either
instead of ether
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think the functionality is truly generic now, please give the script a more appropriate name.
has the features start and end coordinates extended, the number of additional | ||
nucleotides must be specified using the CLI option `--extension`. The SAM file | ||
must contain only the reads that have an intersecting feature. | ||
GFF3 file and `-b` a BAM file. The SAM file must only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are jumping from BAM to SAM, it's a bit confusing. Make clear that the BAM file is an input to bedtools intersect
in the previous call and the SAM file is an input to this script. Probably jus add a line break and use list notation:
- The BED file [...] BAM file.
/ line break
- The SAM file [...]
nucleotides must be specified using the CLI option `--extension`. The SAM file | ||
must contain only the reads that have an intersecting feature. | ||
GFF3 file and `-b` a BAM file. The SAM file must only | ||
contain alignments with an intersecting feature. If either the BED or the SAM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean?
must only contain alignments with an intersecting feature
I mean, I think I kinda know what it means, but it's not very clear. Intersecting with what? And how can the user create such a file, i.e., how can they be sure? And why don't we just ignore any alignments that do not intersect with any feature in BED?
For each alignment, the name of the intersecting feature follows the | ||
format miRNA_ID|5'-shift|3'-shift|CIGAR|MD|READ_SEQ. The CLI option `--id` | ||
specifies the feature identifier to be used as miRNA_ID from the attributes | ||
format FEATURE_ID|5'-shift|3'-shift|CIGAR|MD|READ_SEQ. The CLI option `--id` | ||
specifies the feature identifier to be used as FEATURE_ID from the attributes | ||
column in the BED file. The 5' and 3' shift values are the difference between | ||
the feature (extended) start and end coordinates and the alignment ones. If | ||
`--extension` is provided, the feature start and end positions are adjusted by | ||
adding and subtracting respectively the given value. If both, the 5' and | ||
3'-end shifts, are within the range +/- extension (or equal 0 if no value is | ||
provided) the feature name is added to the alignment as the new tag "YW". | ||
Multiple intersecting feature names are separated by a semi-colon. | ||
the alignment and its intersecting feature(s) start and end coordinates | ||
respectively. If `--extension` is provided, features start and end positions | ||
are adjusted by adding and subtracting respectively the given value. If | ||
`--shift` is provided, and both, the 5' and 3'-end shifts, are within the | ||
range +/- `--shift` the feature name is added to the alignment as the new tag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still find this hard to follow. The examples help, of course, but I think the description on its own should give the users a clear understanding of what --shift
and --extension
are for.
I would suggest that you use ChatGPT or sth. to improve this (but of course double check).
By the way, might be a good idea to use a GenAI chat bot to improve the docstrings, example explanations and CLI arg descriptions in this script as well, while you are at it, I've seen a few typos without reading into them too much.
This PR closes #143 .
The isomiR notation used until now was not unambiguous and lead to incorrect counts provided that the CIGAR and MD strings could be the same for different isomiR sequences. To account for this fact, the read sequence is added to the isomiR name.
The changes required to accomplish that are:
iso_name_tagging.py
script to add the read sequence on the isomiR nameiso_name_tagging.py
to account for the isomiR new name format, and the corresponding files. Given that the previous unit test file was not testing for the functions alone, those tests have been added.mirna_quantification.py
script to account for the new name format.mirna_quantification.py
to account for the new name format, and the corresponding files.In addition and to generalize the script scope:
iso_name_tagging.py
(--shift
)iso_name_tagging.py
in a more general wayquantify.smk
workflow