[DISCUSSION] Summary workflow: Defining a 'terminal exon ID' for the relative quantification benchmark #466
Replies: 3 comments
-
Note: we may have to consider updating the execution workflow output specifications for the format 04 BED file so we have the 'prediction group' information in the output files. Looking like a massive oversight from me (my bad...). |
Beta Was this translation helpful? Give feedback.
-
@SamBryce-Smith I think this sounds great. It's a great way to handle, like you mention, the DaPars outputs with several similar TEs. We will just need to take care in the SWF to chose to do stats on the match between the ground truth union TE with the best prediction tx-by-tx TE. With SWF prioritizing matches with more PAS matches, this should happen automatically, I think. Perhaps dPAS match should be weighted higher than pPAS matches though...but this is more of a discussion for SWF implementation. For the above reason I think having some special notation for the dPAS may be beneficial when it comes to the SWF matching. We know pPAS is always 1, perhaps the dPAS should always be -1 or have some non-integer character in it (e.g. 4d). |
Beta Was this translation helpful? Give feedback.
-
That's a good point RE special notation for dPAS. My thinking is that as we will only ever have 2 sites in the ground truth, we will always know the 'pas number' for the distal site will be 2. So let's say we'd done some kind of 'overlap join' to combine the two BED files according, we'd have 'pas_number_gt' and 'pas_number_pred' columns for matches. We'd then classify according to 'pas_number_gt' (which can only ever be 1 for proximal or 2 for distal) for each ground truth terminal exon (and could then prioritise distal matches as you suggest). I think that is how I did it in my proof-of-concept notebook anyway. |
Beta Was this translation helpful? Give feedback.
-
This issue formalises a discussion point made by @mrgazzara in #399. In my proof-of-concept notebook (see #399) I have further concluded this is essential.
Partially blocks #301 which intends to implement the definition of this identifier.
In brief, for the relative quantification benchmark we want to filter ground truth polyA site files to two representative sites that overlap terminal exons. As some of Joseph's proposed metrics are at the 'terminal-exon level', we need GT files to contain annotation of the terminal exon. This will also be necessary for the 'prediction'/execution workflow output files in order to define 'false positives' at the terminal exon level.
TL;DR: my proposal is a slight modification to @mrgazzara's. The 'Name' field of the GT & prediction BED files will be populated with a 'terminal exon ID' of the following structure:
<gene_id>|<tx_id1>,<tx_id2>|<prediction_group_identifier>|<pas_number>
<gene_id> - Ensembl gene ID of the union terminal exon
<tx_id1>,<tx_id2> - comma-separated string of Ensembl transcript IDs contributing to union terminal exon
<prediction_group_identifier> - ID that groups together predicted PAS for which relative expression has been calculated.
|
characters as this will lead to inconsistent number of fields in the ID.<pas_number> - 5'-3' 1..n rank of polyA site on the terminal exon (terminal exon & prediction group for prediction BEDs).
The first two fields (gene ID & tx ID list) are common between prediction and ground truth files and can be used to match terminal exon IDs (although the transcript ID string on its own should be sufficient...)
Gene ID is also required for proposed expression filtering (e.g. gene is expressed above > x TPM in RNA-seq data).
Remaining fields ('prediction_group_identifier' & 'pas_number') provide the minimal metadata to compute the proposed metrics.
What is a union terminal exon?
It is common for annotated transcripts to share last exon coordinates (because they differ upstream in the transcript model) or to have slightly different 5'/3'ends. To avoid over-duplication of transcripts/PAS within the output, we propose to 'merge'/'collapse' overlapping terminal exons into a single unified region ('union terminal exon'). We will then use these union terminal exons to find overlapping ground truth sites. See graphic below:
-
- exon|
- exon boundaries^
- splice junctionThe 'terminal exon IDs' for PAS on these TEs would be as follows:
TE_1
gene1|tx1,tx2|NA|<pas_number>
gene1|tx1,tx2|tx1|<pas_number>
gene1|tx1,tx2|tx2|<pas_number>
TE_2
gene1|tx3,tx4,tx5|NA|<pas_number>
gene1|tx3,tx4,tx5|tx3|<pas_number>
gene1|tx3,tx4,tx5|tx4|<pas_number>
gene1|tx3,tx4,tx5|tx5|<pas_number>
What is a 'prediction group identifier'?
Several tools qualifying for the benchmark (e.g. DaPars, DaPars2) will predict/quantify polyA sites 'transcript by transcript' I.e. for every input transcript model, regardless of whether they share identical or slightly differing but overlapping terminal exons. We therefore need a way to track which transcript, or 'prediction group' polyA sites were quantified, as it is only for this group that the respective polyA sites's fractional relative usages sum to 1. See my proof-of-concept notebook for scenarios where this is practically essential, especially when matching predicted to ground-truth sites.
I do not think we should enforce this to be the transcript ID as other tools in the challenge perform their own 'collapsing' of the reference annotation to avoid this issue. For example, QAPA will report quantifications for polyA sites on a single terminal exon per gene, so this field can be filled with e.g. Ensembl gene_id. The above described problem will not apply in these cases but it will be important to maintain consistency with the format.
The one possible exception to this is IsoSCM which isn't reference annotation based. I would need to double-check how/if we could make its outputs compatible with this format, although I will put that further down on the priority list as we still have question marks around its overall compatibility with the challenge overall.
Feedback would be much appreciated, especially from @mrgazzara. In #301 @yuukiiwa & I going ahead with my proposal to get us started, but we won't merge until come to an agreement on this point.
Beta Was this translation helpful? Give feedback.
All reactions