[DISCUSSION] Summary workflow: Defining a 'terminal exon ID' for the relative quantification benchmark #466

SamBryce-Smith · 2022-08-26T14:32:05Z

SamBryce-Smith
Aug 26, 2022
Collaborator

This issue formalises a discussion point made by @mrgazzara in #399. In my proof-of-concept notebook (see #399) I have further concluded this is essential.

Partially blocks #301 which intends to implement the definition of this identifier.

In brief, for the relative quantification benchmark we want to filter ground truth polyA site files to two representative sites that overlap terminal exons. As some of Joseph's proposed metrics are at the 'terminal-exon level', we need GT files to contain annotation of the terminal exon. This will also be necessary for the 'prediction'/execution workflow output files in order to define 'false positives' at the terminal exon level.

TL;DR: my proposal is a slight modification to @mrgazzara's. The 'Name' field of the GT & prediction BED files will be populated with a 'terminal exon ID' of the following structure:

<gene_id>|<tx_id1>,<tx_id2>|<prediction_group_identifier>|<pas_number>

<gene_id> - Ensembl gene ID of the union terminal exon
<tx_id1>,<tx_id2> - comma-separated string of Ensembl transcript IDs contributing to union terminal exon
<prediction_group_identifier> - ID that groups together predicted PAS for which relative expression has been calculated.
- Typically this can be an Ensembl transcript ID.
- For ground truth PAS this field can be populated with 'NA'.
- it must not contain | characters as this will lead to inconsistent number of fields in the ID.
- See below for further clarification
<pas_number> - 5'-3' 1..n rank of polyA site on the terminal exon (terminal exon & prediction group for prediction BEDs).
- '1' is always the proximal site, n is the most distal site.
- For ground truth files this will always be 1 & 2 for proximal and distal sites for each terminal exon
- For prediction files this allows > 2 sites to be annotated to a terminal exon whilst maintaining a unique ID.
- For the terminal exon BED file this field can be populated with 'NA'.
- Note: I'm not tied down to labelling with the rank, it's just how I implemented in the proof-of-concept so there is code ready to go!
The first two fields (gene ID & tx ID list) are common between prediction and ground truth files and can be used to match terminal exon IDs (although the transcript ID string on its own should be sufficient...)
Gene ID is also required for proposed expression filtering (e.g. gene is expressed above > x TPM in RNA-seq data).
Remaining fields ('prediction_group_identifier' & 'pas_number') provide the minimal metadata to compute the proposed metrics.

What is a union terminal exon?

It is common for annotated transcripts to share last exon coordinates (because they differ upstream in the transcript model) or to have slightly different 5'/3'ends. To avoid over-duplication of transcripts/PAS within the output, we propose to 'merge'/'collapse' overlapping terminal exons into a single unified region ('union terminal exon'). We will then use these union terminal exons to find overlapping ground truth sites. See graphic below:

- - exon
| - exon boundaries
^ - splice junction

gene1
5’                                                        3’
|--|^          ^|------                                                tx1           TE_1

|--|^          ^|-----------                                           tx2           TE_1

|--|^                                      ^|-----                     tx3           TE_2

|--|^                                      ^|-------------             tx4           TE_2

|--|^                                   ^|--------                     tx5           TE_2


                |-----------             |----------------             union terminal exons

The 'terminal exon IDs' for PAS on these TEs would be as follows:

TE_1
- Ground truth
  - gene1|tx1,tx2|NA|<pas_number>
- Prediction (assuming tool runs tx-by-tx)
  - gene1|tx1,tx2|tx1|<pas_number>
  - gene1|tx1,tx2|tx2|<pas_number>
TE_2
- Ground truth
  - gene1|tx3,tx4,tx5|NA|<pas_number>
- Prediction (assuming tool runs tx-by-tx)
  - gene1|tx3,tx4,tx5|tx3|<pas_number>
  - gene1|tx3,tx4,tx5|tx4|<pas_number>
  - gene1|tx3,tx4,tx5|tx5|<pas_number>

What is a 'prediction group identifier'?

Several tools qualifying for the benchmark (e.g. DaPars, DaPars2) will predict/quantify polyA sites 'transcript by transcript' I.e. for every input transcript model, regardless of whether they share identical or slightly differing but overlapping terminal exons. We therefore need a way to track which transcript, or 'prediction group' polyA sites were quantified, as it is only for this group that the respective polyA sites's fractional relative usages sum to 1. See my proof-of-concept notebook for scenarios where this is practically essential, especially when matching predicted to ground-truth sites.

I do not think we should enforce this to be the transcript ID as other tools in the challenge perform their own 'collapsing' of the reference annotation to avoid this issue. For example, QAPA will report quantifications for polyA sites on a single terminal exon per gene, so this field can be filled with e.g. Ensembl gene_id. The above described problem will not apply in these cases but it will be important to maintain consistency with the format.

The one possible exception to this is IsoSCM which isn't reference annotation based. I would need to double-check how/if we could make its outputs compatible with this format, although I will put that further down on the priority list as we still have question marks around its overall compatibility with the challenge overall.

Feedback would be much appreciated, especially from @mrgazzara. In #301 @yuukiiwa & I going ahead with my proposal to get us started, but we won't merge until come to an agreement on this point.

SamBryce-Smith · 2022-08-26T16:21:33Z

SamBryce-Smith
Aug 26, 2022
Collaborator Author

Note: we may have to consider updating the execution workflow output specifications for the format 04 BED file so we have the 'prediction group' information in the output files. Looking like a massive oversight from me (my bad...).

0 replies

mrgazzara · 2022-08-31T12:19:07Z

mrgazzara
Aug 31, 2022
Maintainer

@SamBryce-Smith I think this sounds great. It's a great way to handle, like you mention, the DaPars outputs with several similar TEs. We will just need to take care in the SWF to chose to do stats on the match between the ground truth union TE with the best prediction tx-by-tx TE.

With SWF prioritizing matches with more PAS matches, this should happen automatically, I think. Perhaps dPAS match should be weighted higher than pPAS matches though...but this is more of a discussion for SWF implementation.

For the above reason I think having some special notation for the dPAS may be beneficial when it comes to the SWF matching. We know pPAS is always 1, perhaps the dPAS should always be -1 or have some non-integer character in it (e.g. 4d).

0 replies

SamBryce-Smith · 2022-08-31T13:44:42Z

SamBryce-Smith
Aug 31, 2022
Collaborator Author

That's a good point RE special notation for dPAS. My thinking is that as we will only ever have 2 sites in the ground truth, we will always know the 'pas number' for the distal site will be 2. So let's say we'd done some kind of 'overlap join' to combine the two BED files according, we'd have 'pas_number_gt' and 'pas_number_pred' columns for matches. We'd then classify according to 'pas_number_gt' (which can only ever be 1 for proximal or 2 for distal) for each ground truth terminal exon (and could then prioritise distal matches as you suggest). I think that is how I did it in my proof-of-concept notebook anyway.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] Summary workflow: Defining a 'terminal exon ID' for the relative quantification benchmark #466

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

[DISCUSSION] Summary workflow: Defining a 'terminal exon ID' for the relative quantification benchmark #466

SamBryce-Smith Aug 26, 2022 Collaborator

What is a union terminal exon?

What is a 'prediction group identifier'?

Replies: 3 comments

SamBryce-Smith Aug 26, 2022 Collaborator Author

mrgazzara Aug 31, 2022 Maintainer

SamBryce-Smith Aug 31, 2022 Collaborator Author

SamBryce-Smith
Aug 26, 2022
Collaborator

SamBryce-Smith
Aug 26, 2022
Collaborator Author

mrgazzara
Aug 31, 2022
Maintainer

SamBryce-Smith
Aug 31, 2022
Collaborator Author