fix: classify mature miRNAs #146

deliaBlue · 2024-08-18T17:29:07Z

This PR closes #143 .

The isomiR notation used until now was not unambiguous and lead to incorrect counts provided that the CIGAR and MD strings could be the same for different isomiR sequences. To account for this fact, the read sequence is added to the isomiR name.

The changes required to accomplish that are:

Modify the iso_name_tagging.py script to add the read sequence on the isomiR name
Modify the unit test for the script iso_name_tagging.py to account for the isomiR new name format, and the corresponding files. Given that the previous unit test file was not testing for the functions alone, those tests have been added.
Modify the mirna_quantification.py script to account for the new name format.
Modify the unit tests for the script mirna_quantification.py to account for the new name format, and the corresponding files.
Update the expected output of the pipeline to include the read sequence on the isomiR names

In addition and to generalize the script scope:

Add new CLI argument in iso_name_tagging.py (--shift)
Document iso_name_tagging.py in a more general way
Modify unit tests to account for the new CLI argument
Add new argument to the quantify.smk workflow

Merge with dev branch

merge with dev

…olanlab/mirflowz into 126-docs-describe-workflow-rationale

uniqueg

In some places, the documentation of the script is written as if the script is a general purpose script for adding intersecting features to SAM files as a tag. But in other places, it becomes clear that there are several assumptions that limit the scope of the script (e.g., miRNA_ID, shifts/extensions) to miRNAs and isomiRs. I think you could clarify that a bit better and perhaps also make a reference to the scripts that produce valid inputs to this file, because it is highly unlikely that someone would create the inputs for this script manually.

Btw, it would have actually been nice to design this script such that it actually is a general purpose script for adding "name" tags for intersecting features to SAM files and do all the other stuff (dealing with alignments that don't have an intersecting feature, dealing with maximum extensions etc) elsewhere. As it is, this script is quite complicated and has basically zero chance of reuse outside of this workflow.

Anyway, not important now - just lessons for the future :)

So please just clarify the scope of the script in the module-level docstring and I think we are ready to go.

scripts/iso_name_tagging.py

…zavolanlab/mirflowz into 143-fix-classify-correctly-mature-mirna

deliaBlue · 2024-08-31T20:54:18Z

In some places, the documentation of the script is written as if the script is a general purpose script for adding intersecting features to SAM files as a tag. But in other places, it becomes clear that there are several assumptions that limit the scope of the script (e.g., miRNA_ID, shifts/extensions) to miRNAs and isomiRs. I think you could clarify that a bit better and perhaps also make a reference to the scripts that produce valid inputs to this file, because it is highly unlikely that someone would create the inputs for this script manually.

Btw, it would have actually been nice to design this script such that it actually is a general purpose script for adding "name" tags for intersecting features to SAM files and do all the other stuff (dealing with alignments that don't have an intersecting feature, dealing with maximum extensions etc) elsewhere. As it is, this script is quite complicated and has basically zero chance of reuse outside of this workflow.

Anyway, not important now - just lessons for the future :)

So please just clarify the scope of the script in the module-level docstring and I think we are ready to go.

I believe the script per se is pretty general: it adds a custom tag showing which features an alignment intersects with. I think the problem comes when defining variables and how it is documented. In this sense, the script appears to be only for miRNAs whose annotations might or might not been previously extended. But if the word extension is changed to shift or range and the description changes to "Allowed shift range between either end of the alignment and the intersecting feature" (or maybe a better description but just for you to get the idea) then it gets more general and the script does not have to change that much.
Another example would be the way the tag format is specified. If instead of using miRNA_ID I use intersecting_feature any kind of sequence can be used as long as it has been intersected using Bedtools intersect.

I suggest to try and make the descriptions and names more general and if you do not see it clear, I will just revert the commit and document a more restricted scope.

uniqueg · 2024-09-02T09:46:04Z

Yes, you can do that if you like. The shift stuff is still quite specific, but I think it's best to keep it, so that we can come to an end on this soon and publish the workflow :)

uniqueg

Apart from the documentation issues, it should be fine.

uniqueg · 2025-02-20T02:14:36Z

.github/workflows/tests.yml

Changes here should be overridden by #150, once merged, so I won't review

uniqueg · 2025-02-20T02:17:14Z

scripts/tests/test_iso_name_tagging.py

            assert captured.out == out_file.read()

    def test_main_bed_sam_extension_file(
        self, monkeypatch, capsys, bed_sam_extension
    ):
-        """Test main function with extension equals 6."""
+        """Test main function with extension and allowed shit equal to 6."""


I think you mean shift - though I understand how you could mix those up 🤣

uniqueg · 2025-02-20T02:20:24Z

scripts/iso_name_tagging.py

@@ -52,30 +56,47 @@
        hsa-miR-1323|0|0|21M|21|TCAAAACTGAGGGGCATTTTC
    out SAM record:
        48-1_1	0	19	5338	255	21M	*	0	0	TCAAAACTGAGGGGCATTTTC	*	MD:Z:21	NH:i:1	NM:i:0	YW:Z:hsa-miR-1323|0|0|21M|21|TCAAAACTGAGGGGCATTTTC
+    explanation:
+        The aligned read and the annotated featrue have the same start and end


feature instead of featrue

uniqueg · 2025-02-20T02:20:49Z

scripts/iso_name_tagging.py

+    explanation:
+        The aligned read and the annotated featrue have the same start and end
+        positions. Given that no extension are provided in the script call, no
+        coordinates adjustments are made. And there is no shift on ether end,


either instead of ether

uniqueg · 2025-02-20T02:22:31Z

scripts/iso_name_tagging.py

If you think the functionality is truly generic now, please give the script a more appropriate name.

uniqueg · 2025-02-20T02:28:01Z

scripts/iso_name_tagging.py

-has the features start and end coordinates extended, the number of additional
-nucleotides must be specified using the CLI option `--extension`. The SAM file
-must contain only the reads that have an intersecting feature.
+GFF3 file and `-b` a BAM file. The SAM file must only


You are jumping from BAM to SAM, it's a bit confusing. Make clear that the BAM file is an input to bedtools intersect in the previous call and the SAM file is an input to this script. Probably jus add a line break and use list notation:

- The BED file [...] BAM file. / line break
- The SAM file [...]

uniqueg · 2025-02-20T02:34:16Z

scripts/iso_name_tagging.py

-nucleotides must be specified using the CLI option `--extension`. The SAM file
-must contain only the reads that have an intersecting feature.
+GFF3 file and `-b` a BAM file. The SAM file must only
+contain alignments with an intersecting feature. If either the BED or the SAM


What does this mean?

must only contain alignments with an intersecting feature

I mean, I think I kinda know what it means, but it's not very clear. Intersecting with what? And how can the user create such a file, i.e., how can they be sure? And why don't we just ignore any alignments that do not intersect with any feature in BED?

uniqueg · 2025-02-20T02:48:29Z

scripts/iso_name_tagging.py

 For each alignment, the name of the intersecting feature follows the
-format miRNA_ID|5'-shift|3'-shift|CIGAR|MD|READ_SEQ. The CLI option `--id`
-specifies the feature identifier to be used as miRNA_ID from the attributes
+format FEATURE_ID|5'-shift|3'-shift|CIGAR|MD|READ_SEQ. The CLI option `--id`
+specifies the feature identifier to be used as FEATURE_ID from the attributes
 column in the BED file. The 5' and 3' shift values are the difference between
-the feature (extended) start and end coordinates and the alignment ones. If
-`--extension` is provided, the feature start and end positions are adjusted by
-adding and subtracting respectively the given value. If both, the 5' and
-3'-end shifts, are within the range +/- extension (or equal 0 if no value is
-provided) the feature name is added to the alignment as the new tag "YW".
-Multiple intersecting feature names are separated by a semi-colon.
+the alignment and its intersecting feature(s) start and end coordinates
+respectively. If `--extension` is provided, features start and end positions
+are adjusted by adding and subtracting respectively the given value. If
+`--shift` is provided, and both, the 5' and 3'-end shifts, are within the
+range +/- `--shift` the feature name is added to the alignment as the new tag


I still find this hard to follow. The examples help, of course, but I think the description on its own should give the users a clear understanding of what --shift and --extension are for.

I would suggest that you use ChatGPT or sth. to improve this (but of course double check).

By the way, might be a good idea to use a GenAI chat bot to improve the docstrings, example explanations and CLI arg descriptions in this script as well, while you are at it, I've seen a few typos without reading into them too much.

deliaBlue and others added 30 commits December 22, 2023 18:45

docs: expand workflow description

7ed9971

docs: expand rule descriptions

ecc920c

Merge branch 'dev' into 126-docs-describe-workflow-rationale

b71d23b

Merge with dev branch

docs: fix typos

cde4393

Merge branch 'dev' into 126-docs-describe-workflow-rationale

8ce1821

docs: update rules

ce4e5b7

docs: update rules

4dfe8be

Merge branch 'dev' into 126-docs-describe-workflow-rationale

b5c7200

merge with dev

fix: set correct wildcard

9f8f99e

docs: complete rules improvement

1e345dd

revert: undo refactoring

a4049b1

Merge branch 'dev' into 126-docs-describe-workflow-rationale

ed8d595

Merge branch 'dev' into 126-docs-describe-workflow-rationale

8f23697

docs: update main README

1eeb07a

docs: update pipeline documentation

ada71c0

refactor: correct shift assessment

9030112

docs: update pipeline documentation

f37af52

docs: update README

dc5028e

fix typos2

bbb0727

docs: extend workflow description

c1bc82c

refactor: account for new notation

26d7813

Merge branch 'dev' into 126-docs-describe-workflow-rationale

077cf8b

Merge branch 'dev' into 126-docs-describe-workflow-rationale

9e3c639

Merge branch '126-docs-describe-workflow-rationale' of github.com:zav…

cfba29e

…olanlab/mirflowz into 126-docs-describe-workflow-rationale

merge dev branch

adc5833

docs: complete documentation

b57a5aa

docs: complete documentation

7741065

refactor: adapt script to new name format

694a537

refactor: add read seq to isomiR tag

177c7ae

test: update unit test files

c578d9e

deliaBlue added 3 commits August 18, 2024 13:45

test: adjust for isomir new name format

68fb655

revert: undo documetation branch merge

590b93d

revert: undo documentation branch merge

e1643ca

deliaBlue added bug Something isn't working high priority labels Aug 18, 2024

deliaBlue requested a review from uniqueg August 18, 2024 17:29

deliaBlue self-assigned this Aug 18, 2024

deliaBlue linked an issue Aug 18, 2024 that may be closed by this pull request

fix: classify correctly mature miRNA #143

Open

deliaBlue added 4 commits August 18, 2024 19:33

ci: fix static code analysis

032af37

ci: fix static code analysis

55712c7

ci: fix static code analysis

ce0d897

ci: fix static code analysis

fcc5ac4

uniqueg changed the title ~~fix: classify correctly mature mirna~~ fix: classify mature mirnas Aug 19, 2024

deliaBlue added 3 commits August 19, 2024 12:30

Merge branch 'dev' into 143-fix-classify-correctly-mature-mirna

380cf76

docs: add new isomiR notation

012645d

docs: add new isomiR notation

a516667

uniqueg requested changes Aug 19, 2024

View reviewed changes

scripts/iso_name_tagging.py Show resolved Hide resolved

scripts/iso_name_tagging.py Show resolved Hide resolved

deliaBlue and others added 2 commits August 31, 2024 17:37

Merge branch 'dev' into 143-fix-classify-correctly-mature-mirna

8173bf6

Merge branch '143-fix-classify-correctly-mature-mirna' of github.com:…

4fd3f4b

…zavolanlab/mirflowz into 143-fix-classify-correctly-mature-mirna

test: remove duplicated file

ccfd504

deliaBlue added 6 commits January 12, 2025 18:10

refactor: add CLI argument

85090da

refactor: modify tests

c40efce

refactor: generalise script and add CLI

6a384eb

ci: upgrade miniconda to v3

e2e9bf3

ci: pin ubuntu version

3add762

ci: set strict channel priority to true

fbf26bb

uniqueg requested changes Feb 20, 2025

View reviewed changes

uniqueg changed the title ~~fix: classify mature mirnas~~ fix: classify mature miRNAs Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: classify mature miRNAs #146

fix: classify mature miRNAs #146

deliaBlue commented Aug 18, 2024 •

edited

Loading

uniqueg left a comment

deliaBlue commented Aug 31, 2024

uniqueg commented Sep 2, 2024

uniqueg left a comment

uniqueg Feb 20, 2025

uniqueg Feb 20, 2025

uniqueg Feb 20, 2025

uniqueg Feb 20, 2025

uniqueg Feb 20, 2025

uniqueg Feb 20, 2025

uniqueg Feb 20, 2025

uniqueg Feb 20, 2025

fix: classify mature miRNAs #146

Are you sure you want to change the base?

fix: classify mature miRNAs #146

Conversation

deliaBlue commented Aug 18, 2024 • edited Loading

uniqueg left a comment

Choose a reason for hiding this comment

deliaBlue commented Aug 31, 2024

uniqueg commented Sep 2, 2024

uniqueg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deliaBlue commented Aug 18, 2024 •

edited

Loading