Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix!: SequenceLocation start/end for TranscriptSegmentElement #172

Merged
merged 94 commits into from
Aug 22, 2024

Conversation

korikuzma
Copy link
Member

@korikuzma korikuzma commented Jul 29, 2024

close #171

The tests for transcript segment elements are kind of confusing IMO. transcript_segments test fixture in test_models leverages sequence_locations. Some of these sequence locations have both start/end, but we only use start or end. I didn't change these because I don't think the values are actually tested. We can revisit in the future.

katiestahl and others added 30 commits July 17, 2024 13:12
Co-authored-by: Kori Kuzma <[email protected]>
Co-authored-by: Kori Kuzma <[email protected]>
@jarbesfeld
Copy link
Contributor

jarbesfeld commented Jul 29, 2024

@ahwagner My question pertains for how to use start and end when representing a TranscriptSegmentElement. Assume we have a transcript segment that covers exons 1 to 8 for the gene TPM3 which aligns to the negative strand. Does the example below use start and end correctly? I'm assuming this would be reversed when using the positive strand. Additionally, should start and end be described using 0-indexed or 1-indexed values?

def transcript_segment_element():
    """Create transcript segment element test fixture"""
    params = {
        "type": "TranscriptSegmentElement",
        "exonEnd": 8,
        "exonEndOffset": 0,
        "exonStart": 1,
        "exonStartOffset": 0,
        "gene": {
            "id": "hgnc:12012",
            "label": "TPM3",
            "type": "Gene",
        },
        "transcript": "refseq:NM_152263.3",
        "elementGenomicStart": {
            "id": "ga4gh:SL.2K1vML0ofuYrYncrzzXUQOISRFJldZrO",
            "type": "SequenceLocation",
            "sequenceReference": {
                "id": "refseq:NC_000001.11",
                "refgetAccession": "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO",
                "type": "SequenceReference",
            },
            "end": 154192135
        },
        "elementGenomicEnd": {
            "id": "ga4gh:SL.rtR6x2NnJEpROlxiT_DY9C-spf6ijYQi",
            "type": "SequenceLocation",
            "sequenceReference": {
                "id": "refseq:NC_000001.11",
                "refgetAccession": "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO",
                "type": "SequenceReference",
            },
            "start": 154170399
        },
    }
    return TranscriptSegmentElement(**params)

@korikuzma
Copy link
Member Author

korikuzma commented Jul 29, 2024

@ahwagner My question pertains for how to use start and end when representing a TranscriptSegmentElement. Assume we have a transcript segment that covers exons 1 to 8 for the gene TPM3 which aligns to the negative strand. Does the example below use start and end correctly? I'm assuming this would be reversed when using the positive strand. Additionally, should start and end be described using 0-indexed or 1-indexed values?

def transcript_segment_element():
    """Create transcript segment element test fixture"""
    params = {
        "type": "TranscriptSegmentElement",
        "exonEnd": 8,
        "exonEndOffset": 0,
        "exonStart": 1,
        "exonStartOffset": 0,
        "gene": {
            "id": "hgnc:12012",
            "label": "TPM3",
            "type": "Gene",
        },
        "transcript": "refseq:NM_152263.3",
        "elementGenomicStart": {
            "id": "ga4gh:SL.2K1vML0ofuYrYncrzzXUQOISRFJldZrO",
            "type": "SequenceLocation",
            "sequenceReference": {
                "id": "refseq:NC_000001.11",
                "refgetAccession": "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO",
                "type": "SequenceReference",
            },
            "end": 154192135
        },
        "elementGenomicEnd": {
            "id": "ga4gh:SL.rtR6x2NnJEpROlxiT_DY9C-spf6ijYQi",
            "type": "SequenceLocation",
            "sequenceReference": {
                "id": "refseq:NC_000001.11",
                "refgetAccession": "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO",
                "type": "SequenceReference",
            },
            "start": 154170399
        },
    }
    return TranscriptSegmentElement(**params)

@ahwagner here is what UTA provides. Note that UTA is inter-residue based

uta=> select * from tx_exon_aln_v where alt_ac = 'NC_000001.11' and tx_ac = 'NM_152263.3';
+------+-------------+--------------+----------------+------------+-----+------------+----------+-------------+-----------+-------+---------+----------+----------------+-----------------+------------+-------------+-------------+
| hgnc |    tx_ac    |    alt_ac    | alt_aln_method | alt_strand | ord | tx_start_i | tx_end_i | alt_start_i | alt_end_i | cigar | tx_aseq | alt_aseq | tx_exon_set_id | alt_exon_set_id | tx_exon_id | alt_exon_id | exon_aln_id |
+------+-------------+--------------+----------------+------------+-----+------------+----------+-------------+-----------+-------+---------+----------+----------------+-----------------+------------+-------------+-------------+
| TPM3 | NM_152263.3 | NC_000001.11 | splign         |         -1 |   0 |          0 |      234 |   154191901 | 154192135 | 234=  |         |          |         103034 |          789739 |     987226 |     6770368 |     4285931 |
| TPM3 | NM_152263.3 | NC_000001.11 | splign         |         -1 |   1 |        234 |      360 |   154191185 | 154191311 | 126=  |         |          |         103034 |          789739 |     987227 |     6770369 |     4285885 |
| TPM3 | NM_152263.3 | NC_000001.11 | splign         |         -1 |   2 |        360 |      494 |   154176114 | 154176248 | 134=  |         |          |         103034 |          789739 |     987228 |     6770370 |     4285916 |
| TPM3 | NM_152263.3 | NC_000001.11 | splign         |         -1 |   3 |        494 |      612 |   154173083 | 154173201 | 118=  |         |          |         103034 |          789739 |     987229 |     6770371 |     4285871 |
| TPM3 | NM_152263.3 | NC_000001.11 | splign         |         -1 |   4 |        612 |      683 |   154172907 | 154172978 | 71=   |         |          |         103034 |          789739 |     987230 |     6770372 |     4285877 |
| TPM3 | NM_152263.3 | NC_000001.11 | splign         |         -1 |   5 |        683 |      759 |   154171412 | 154171488 | 76=   |         |          |         103034 |          789739 |     987231 |     6770373 |     4285929 |
| TPM3 | NM_152263.3 | NC_000001.11 | splign         |         -1 |   6 |        759 |      822 |   154170648 | 154170711 | 63=   |         |          |         103034 |          789739 |     987232 |     6770374 |     4285883 |
| TPM3 | NM_152263.3 | NC_000001.11 | splign         |         -1 |   7 |        822 |      892 |   154170399 | 154170469 | 70=   |         |          |         103034 |          789739 |     987233 |     6770375 |     4285905 |

@jarbesfeld
Copy link
Contributor

@ahwagner Was wondering if you could also comment on the correct use of exonStartOffset and exon_end_offset when describing a genomic coordinate that occurs in the following two situations?:

  1. The genomic coordinate occurs on an exon
  2. The genomic coordinate occurs in intronic space

@ahwagner
Copy link
Member

ahwagner commented Jul 29, 2024

Does the example below use start and end correctly?

Yes.

I'm assuming this would be reversed when using the positive strand.

Depending on what is meant by "reversed", yes. Specifically, you should always expect that the lower genomic coordinate is defined by a SequenceLocation start, and the upper genomic coordinate is defined by a SequenceLocation end.

Additionally, should start and end be described using 0-indexed or 1-indexed values?

These are VRS SequenceLocations, so it should be interresidue following VRS conventions. I want to be clear that 0- or 1- indexing generally corresponds to interresidue or residue, respectively, but I don't use these terms interchangeably as one could very well 1-index interresidue coordinates or 0-index residue.

was wondering if you could also comment on the correct use of exonStartOffset and exon_end_offset when describing a genomic coordinate that occurs in the following two situations?:

  1. The genomic coordinate occurs on an exon
  2. The genomic coordinate occurs in intronic space

This follows the conventions outlined in the VICC Fusions Spec. Specifically, this image:
image

Offset is always with respect to the exon-based representation, and therefore with respect to the transcript (not genomic) sequence. To get here from a genomic location, you need coordinates, a direction (specified by use of start or end in the SequenceLocation), and a target transcript. Depending on the scenario or function you are building, one or more of these things may be inferred from your input.

Also, is the mixed use of camel case and snake case intentional for exonStartOffset and exon_end_offset? I recommend consistency, with a preference for camel case since that is what the GKS specs use.

To answer your question about the correct use of offset, I am assuming the use case is the convert coordinates tool supported by our fusion curation interface:
image

Note

Incidentally, the interface for this tool might be clarified by replacing "start position" and "end position" with "Coordinate" and "Junction Direction", and removing the toggle for "Strand" (which should be captured from the transcript alignment).

This tool should support whether the returned object represents a transcript segment starting at a given genomic coordinate (genomic SeqLoc uses start and a positive-strand transcript, or end and a negative-strand transcript), or ending at a given genomic coordinate (genomic SeqLoc uses end and a positive-strand transcript, or start and a positive-strand transcript).

Under these conditions, a genomic coordinate falling within an exon represents a positive offset from the beginning of that exon if it is starting at that coordinate, or a negative offset from the end of that exon if it is ending at that coordinate. Likewise, an RNA that includes exon-adjacent intronic sequence would have a genomic coordinate with a negative offset from the beginning of that exon if it is starting at that coordinate, or a positive offset from the end of that exon if it is ending at that coordinate.

Copy link

This PR is stale because it has been open 1 day(s) with no activity. Please review this PR.

Copy link

github-actions bot commented Aug 2, 2024

This PR is stale because it has been open 1 day(s) with no activity. Please review this PR.

@github-actions github-actions bot added the stale label Aug 2, 2024
Base automatically changed from issue-95-take2 to main August 2, 2024 14:33
@korikuzma
Copy link
Member Author

If #176 is pulling SequenceLocations directly from Cool-Seq-Tool, then we do not need this. I'm going to be adding @jarbesfeld as a reviewer to #176 just in case and will close this.

@korikuzma korikuzma closed this Aug 20, 2024
@korikuzma korikuzma reopened this Aug 22, 2024
@korikuzma korikuzma requested a review from katiestahl August 22, 2024 14:14
@korikuzma korikuzma marked this pull request as ready for review August 22, 2024 14:14
Copy link
Contributor

@katiestahl katiestahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

@korikuzma korikuzma merged commit 7f095e4 into main Aug 22, 2024
4 checks passed
@korikuzma korikuzma deleted the issue-171 branch August 22, 2024 14:17
@korikuzma
Copy link
Member Author

Ah rip. Didn't realize I had an exclamation in PR title

korikuzma added a commit that referenced this pull request Aug 22, 2024
…172)

close #171

The tests for transcript segment elements are kind of confusing IMO. `transcript_segments` test fixture in `test_models` leverages `sequence_locations`. Some of these sequence locations have both start/end, but we only use start or end. I didn't change these because I don't think the values are actually tested. We can revisit in the future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use latest conceptual framework for transcipt segment start/end
5 participants