fix!: `SequenceLocation` `start`/`end` for `TranscriptSegmentElement` #172

korikuzma · 2024-07-29T14:42:30Z

close #171

The tests for transcript segment elements are kind of confusing IMO. transcript_segments test fixture in test_models leverages sequence_locations. Some of these sequence locations have both start/end, but we only use start or end. I didn't change these because I don't think the values are actually tested. We can revisit in the future.

This reverts commit 24e1a4c.

This reverts commit 3e52ff2.

… obsolete code

into issue-95-take2

Co-authored-by: Kori Kuzma <[email protected]>

into issue-95-take2

jarbesfeld · 2024-07-29T15:52:17Z

@ahwagner My question pertains for how to use start and end when representing a TranscriptSegmentElement. Assume we have a transcript segment that covers exons 1 to 8 for the gene TPM3 which aligns to the negative strand. Does the example below use start and end correctly? I'm assuming this would be reversed when using the positive strand. Additionally, should start and end be described using 0-indexed or 1-indexed values?

def transcript_segment_element():
    """Create transcript segment element test fixture"""
    params = {
        "type": "TranscriptSegmentElement",
        "exonEnd": 8,
        "exonEndOffset": 0,
        "exonStart": 1,
        "exonStartOffset": 0,
        "gene": {
            "id": "hgnc:12012",
            "label": "TPM3",
            "type": "Gene",
        },
        "transcript": "refseq:NM_152263.3",
        "elementGenomicStart": {
            "id": "ga4gh:SL.2K1vML0ofuYrYncrzzXUQOISRFJldZrO",
            "type": "SequenceLocation",
            "sequenceReference": {
                "id": "refseq:NC_000001.11",
                "refgetAccession": "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO",
                "type": "SequenceReference",
            },
            "end": 154192135
        },
        "elementGenomicEnd": {
            "id": "ga4gh:SL.rtR6x2NnJEpROlxiT_DY9C-spf6ijYQi",
            "type": "SequenceLocation",
            "sequenceReference": {
                "id": "refseq:NC_000001.11",
                "refgetAccession": "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO",
                "type": "SequenceReference",
            },
            "start": 154170399
        },
    }
    return TranscriptSegmentElement(**params)

korikuzma · 2024-07-29T16:06:23Z

@ahwagner My question pertains for how to use start and end when representing a TranscriptSegmentElement. Assume we have a transcript segment that covers exons 1 to 8 for the gene TPM3 which aligns to the negative strand. Does the example below use start and end correctly? I'm assuming this would be reversed when using the positive strand. Additionally, should start and end be described using 0-indexed or 1-indexed values?

def transcript_segment_element():
    """Create transcript segment element test fixture"""
    params = {
        "type": "TranscriptSegmentElement",
        "exonEnd": 8,
        "exonEndOffset": 0,
        "exonStart": 1,
        "exonStartOffset": 0,
        "gene": {
            "id": "hgnc:12012",
            "label": "TPM3",
            "type": "Gene",
        },
        "transcript": "refseq:NM_152263.3",
        "elementGenomicStart": {
            "id": "ga4gh:SL.2K1vML0ofuYrYncrzzXUQOISRFJldZrO",
            "type": "SequenceLocation",
            "sequenceReference": {
                "id": "refseq:NC_000001.11",
                "refgetAccession": "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO",
                "type": "SequenceReference",
            },
            "end": 154192135
        },
        "elementGenomicEnd": {
            "id": "ga4gh:SL.rtR6x2NnJEpROlxiT_DY9C-spf6ijYQi",
            "type": "SequenceLocation",
            "sequenceReference": {
                "id": "refseq:NC_000001.11",
                "refgetAccession": "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO",
                "type": "SequenceReference",
            },
            "start": 154170399
        },
    }
    return TranscriptSegmentElement(**params)

@ahwagner here is what UTA provides. Note that UTA is inter-residue based

uta=> select * from tx_exon_aln_v where alt_ac = 'NC_000001.11' and tx_ac = 'NM_152263.3';
+------+-------------+--------------+----------------+------------+-----+------------+----------+-------------+-----------+-------+---------+----------+----------------+-----------------+------------+-------------+-------------+
| hgnc |    tx_ac    |    alt_ac    | alt_aln_method | alt_strand | ord | tx_start_i | tx_end_i | alt_start_i | alt_end_i | cigar | tx_aseq | alt_aseq | tx_exon_set_id | alt_exon_set_id | tx_exon_id | alt_exon_id | exon_aln_id |
+------+-------------+--------------+----------------+------------+-----+------------+----------+-------------+-----------+-------+---------+----------+----------------+-----------------+------------+-------------+-------------+
| TPM3 | NM_152263.3 | NC_000001.11 | splign         |         -1 |   0 |          0 |      234 |   154191901 | 154192135 | 234=  |         |          |         103034 |          789739 |     987226 |     6770368 |     4285931 |
| TPM3 | NM_152263.3 | NC_000001.11 | splign         |         -1 |   1 |        234 |      360 |   154191185 | 154191311 | 126=  |         |          |         103034 |          789739 |     987227 |     6770369 |     4285885 |
| TPM3 | NM_152263.3 | NC_000001.11 | splign         |         -1 |   2 |        360 |      494 |   154176114 | 154176248 | 134=  |         |          |         103034 |          789739 |     987228 |     6770370 |     4285916 |
| TPM3 | NM_152263.3 | NC_000001.11 | splign         |         -1 |   3 |        494 |      612 |   154173083 | 154173201 | 118=  |         |          |         103034 |          789739 |     987229 |     6770371 |     4285871 |
| TPM3 | NM_152263.3 | NC_000001.11 | splign         |         -1 |   4 |        612 |      683 |   154172907 | 154172978 | 71=   |         |          |         103034 |          789739 |     987230 |     6770372 |     4285877 |
| TPM3 | NM_152263.3 | NC_000001.11 | splign         |         -1 |   5 |        683 |      759 |   154171412 | 154171488 | 76=   |         |          |         103034 |          789739 |     987231 |     6770373 |     4285929 |
| TPM3 | NM_152263.3 | NC_000001.11 | splign         |         -1 |   6 |        759 |      822 |   154170648 | 154170711 | 63=   |         |          |         103034 |          789739 |     987232 |     6770374 |     4285883 |
| TPM3 | NM_152263.3 | NC_000001.11 | splign         |         -1 |   7 |        822 |      892 |   154170399 | 154170469 | 70=   |         |          |         103034 |          789739 |     987233 |     6770375 |     4285905 |

jarbesfeld · 2024-07-29T21:43:59Z

@ahwagner Was wondering if you could also comment on the correct use of exonStartOffset and exon_end_offset when describing a genomic coordinate that occurs in the following two situations?:

The genomic coordinate occurs on an exon
The genomic coordinate occurs in intronic space

ahwagner · 2024-07-29T22:58:11Z

Does the example below use start and end correctly?

Yes.

I'm assuming this would be reversed when using the positive strand.

Depending on what is meant by "reversed", yes. Specifically, you should always expect that the lower genomic coordinate is defined by a SequenceLocation start, and the upper genomic coordinate is defined by a SequenceLocation end.

Additionally, should start and end be described using 0-indexed or 1-indexed values?

These are VRS SequenceLocations, so it should be interresidue following VRS conventions. I want to be clear that 0- or 1- indexing generally corresponds to interresidue or residue, respectively, but I don't use these terms interchangeably as one could very well 1-index interresidue coordinates or 0-index residue.

was wondering if you could also comment on the correct use of exonStartOffset and exon_end_offset when describing a genomic coordinate that occurs in the following two situations?:

The genomic coordinate occurs on an exon

The genomic coordinate occurs in intronic space

This follows the conventions outlined in the VICC Fusions Spec. Specifically, this image:

Offset is always with respect to the exon-based representation, and therefore with respect to the transcript (not genomic) sequence. To get here from a genomic location, you need coordinates, a direction (specified by use of start or end in the SequenceLocation), and a target transcript. Depending on the scenario or function you are building, one or more of these things may be inferred from your input.

Also, is the mixed use of camel case and snake case intentional for exonStartOffset and exon_end_offset? I recommend consistency, with a preference for camel case since that is what the GKS specs use.

To answer your question about the correct use of offset, I am assuming the use case is the convert coordinates tool supported by our fusion curation interface:

Note

Incidentally, the interface for this tool might be clarified by replacing "start position" and "end position" with "Coordinate" and "Junction Direction", and removing the toggle for "Strand" (which should be captured from the transcript alignment).

This tool should support whether the returned object represents a transcript segment starting at a given genomic coordinate (genomic SeqLoc uses start and a positive-strand transcript, or end and a negative-strand transcript), or ending at a given genomic coordinate (genomic SeqLoc uses end and a positive-strand transcript, or start and a positive-strand transcript).

Under these conditions, a genomic coordinate falling within an exon represents a positive offset from the beginning of that exon if it is starting at that coordinate, or a negative offset from the end of that exon if it is ending at that coordinate. Likewise, an RNA that includes exon-adjacent intronic sequence would have a genomic coordinate with a negative offset from the beginning of that exon if it is starting at that coordinate, or a positive offset from the end of that exon if it is ending at that coordinate.

github-actions · 2024-07-31T13:31:52Z

This PR is stale because it has been open 1 day(s) with no activity. Please review this PR.

github-actions · 2024-08-02T13:31:55Z

This PR is stale because it has been open 1 day(s) with no activity. Please review this PR.

korikuzma · 2024-08-20T11:05:56Z

If #176 is pulling SequenceLocations directly from Cool-Seq-Tool, then we do not need this. I'm going to be adding @jarbesfeld as a reviewer to #176 just in case and will close this.

katiestahl

looks good to me!

korikuzma · 2024-08-22T14:44:42Z

Ah rip. Didn't realize I had an exclamation in PR title

…172) close #171 The tests for transcript segment elements are kind of confusing IMO. `transcript_segments` test fixture in `test_models` leverages `sequence_locations`. Some of these sequence locations have both start/end, but we only use start or end. I didn't change these because I don't think the values are actually tested. We can revisit in the future.

katiestahl and others added 30 commits July 17, 2024 13:12

build!: remove vrsatile

442083c

wip: remove gene descriptor

4a538bb

wip: remove gene descriptor

c0e8626

progress updating models and adding back gene element wrapper

24e1a4c

adding back gene element

3e52ff2

Revert "progress updating models and adding back gene element wrapper"

12ee931

This reverts commit 24e1a4c.

Revert "adding back gene element"

6120473

This reverts commit 3e52ff2.

converting descriptors

6573780

remove todo

44e0574

wip: adding back GeneElement wrapper, updating to camelCase, removing…

7796732

… obsolete code

updating models

c67d588

fix: gene element type

6ad8e33

wip: update constructors with updated param names from models

c1e8fad

Merge branch 'main' into issue-95-take2

42d1224

update constructors from model changes

7eabc25

Merge branch 'issue-95-take2' of https://github.com/cancervariants/fusor

6cd7cfa

into issue-95-take2

minor fixes

1e18144

fix: updating variable casing

76ef031

updating docstring

6f740e9

fix: variable casing and error messages

c928b35

revert featureId back to string

a2d2e10

Update src/fusor/models.py

fe85297

Co-authored-by: Kori Kuzma <[email protected]>

Update pyproject.toml

eb9da54

Co-authored-by: Kori Kuzma <[email protected]>

Update pyproject.toml

bbae4bc

Co-authored-by: Kori Kuzma <[email protected]>

Update src/fusor/models.py

7634327

Co-authored-by: Kori Kuzma <[email protected]>

fixes from pr comments

a593c3b

fixes from pr comments

1c3959b

Merge branch 'issue-95-take2' of https://github.com/cancervariants/fusor

c8c90a7

into issue-95-take2

wip: updating test examples with new models

838e2a2

adding back unreachable else because ruff will complain otherwise

f5f5689

This was referenced Jul 29, 2024

Updates to genomic coordinate conversion utility cancervariants/fusion-curation#294

Open

GenomicData should include start/end for both start_exon and end_exon GenomicMedLab/cool-seq-tool#327

Closed

github-actions bot added the stale label Jul 31, 2024

korikuzma removed the stale label Jul 31, 2024

jsstevenson added the keep-alive label Jul 31, 2024

github-actions bot added the stale label Aug 2, 2024

Base automatically changed from issue-95-take2 to main August 2, 2024 14:33

korikuzma added stale-exempt and removed stale labels Aug 2, 2024

korikuzma closed this Aug 20, 2024

Merge branch 'main' into issue-171

cccfc21

korikuzma reopened this Aug 22, 2024

revert

6c5c8a3

korikuzma mentioned this pull request Aug 22, 2024

Update examples to VRS 2.0 #151

Draft

korikuzma added 2 commits August 22, 2024 10:09

update transcript segment element location

29ecc8c

revert examples

406b717

korikuzma requested a review from katiestahl August 22, 2024 14:14

korikuzma marked this pull request as ready for review August 22, 2024 14:14

katiestahl approved these changes Aug 22, 2024

View reviewed changes

korikuzma merged commit 7f095e4 into main Aug 22, 2024
4 checks passed

korikuzma deleted the issue-171 branch August 22, 2024 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix!: `SequenceLocation` `start`/`end` for `TranscriptSegmentElement` #172

fix!: `SequenceLocation` `start`/`end` for `TranscriptSegmentElement` #172

korikuzma commented Jul 29, 2024 •

edited

Loading

jarbesfeld commented Jul 29, 2024 •

edited

Loading

korikuzma commented Jul 29, 2024 •

edited

Loading

jarbesfeld commented Jul 29, 2024

ahwagner commented Jul 29, 2024 •

edited

Loading

github-actions bot commented Jul 31, 2024

github-actions bot commented Aug 2, 2024

korikuzma commented Aug 20, 2024

katiestahl left a comment

korikuzma commented Aug 22, 2024

fix!: SequenceLocation start/end for TranscriptSegmentElement #172

fix!: SequenceLocation start/end for TranscriptSegmentElement #172

Conversation

korikuzma commented Jul 29, 2024 • edited Loading

jarbesfeld commented Jul 29, 2024 • edited Loading

korikuzma commented Jul 29, 2024 • edited Loading

jarbesfeld commented Jul 29, 2024

ahwagner commented Jul 29, 2024 • edited Loading

github-actions bot commented Jul 31, 2024

github-actions bot commented Aug 2, 2024

korikuzma commented Aug 20, 2024

katiestahl left a comment

Choose a reason for hiding this comment

korikuzma commented Aug 22, 2024

fix!: `SequenceLocation` `start`/`end` for `TranscriptSegmentElement` #172

fix!: `SequenceLocation` `start`/`end` for `TranscriptSegmentElement` #172

korikuzma commented Jul 29, 2024 •

edited

Loading

jarbesfeld commented Jul 29, 2024 •

edited

Loading

korikuzma commented Jul 29, 2024 •

edited

Loading

ahwagner commented Jul 29, 2024 •

edited

Loading