SR point-wise evaluation with measuring "stitcher" performance #60

keighrim · 2024-07-10T15:55:44Z

New Feature Summary

#55 added an evaluation software for apps like SWT only using TimePoint annotations, but that evaluation can easily expanded to evaluation of "stitcher" component that turns TP annotations into TimeFrame annotations. The idea is based on that all existing stitcher implementation is using "label remapping", exposed as a runtime param and recorded in the view metadata, that enables us to re-construct point-wise but remapped label value list.

So the idea is to update eval.py file so that it

Reads TP annotation as usual, constructing a list of "raw" classification results, let's call it raw
Reads the gold files as usual, constructing a list of gold labels, let's call it gold
Grabs the "latest" view with TimeFrame annotation with the remapper config (map for SWT built-in, labelMap for simple-stitcher)
Uses the remapper to map the raw and gold into secondary remapped lists (these new lists should be shorter then the original ones, since not all of the raw/gold labels are remapped into the secondary (TF) labels. Let's call them raw-remap and gold-remap respectively.
Then iterates through the TF annotations, construct a third list of stitched, remapped labels by using the pointers in targets prop (which must be pointing to TP annotations so the timepoints can be traced), let's call this list stitched
compute P/R/F between
- raw vs. gold (this should already be there in the current eval.py)
- raw-remap vs. gold-remap
- stitched vs. gold-remap,

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

marcverhagen · 2024-07-10T16:17:38Z

Some of this was done in 22bee5c (work in progress that I just pushed up to make it visible), but not quite in the way as described above. One difference is that it uses the output of the updated process.py script from the annotations repository.

marcverhagen · 2024-07-11T12:56:37Z

Is there overlap here with #43?

keighrim · 2024-07-11T13:19:17Z

The evaluation scheme in this is based on point-wise evaluation, hence is not compatibly with "old" interval-level gold data (from 2020 time). I don't think this is a duplicate of #43. Eventually, I believe the proposed method will be evaluation for stitcher components, largely independent from image classification model performance.

I think the most "overlapping" effort to this issue was evaluate.py¹ in clamsproject/app-swt-detection@5590a3e , but that file was

never used to produce a public report or associated with any actual (archived) SWT MMIF output files
written a fair while ago and the current status/compatibility is unknown

so I thought it's not easy to verify if the old code works or not. (plus, this repo is the repo for evaluation code) That led me to start this new issue.

which wasn't easy to dig out, since the most closely related issue - presumably https://github.com/clamsproject/app-swt-detection/issues/61 - was closed without mentioning the file or commit or an explicit PR merge. ↩

keighrim · 2024-07-22T14:51:33Z

Since the existing timepoint evaluation and the proposed timeframe (stitcher) evaluation can be applied to any point-wise classification tasks, I think it'd be more representing if we rename the subdirectory name to pointclassification_eval (or something like that).

kla7 · 2024-08-21T19:21:00Z

To analyze stitcher's performance, I compared the evaluation scores from filtered (corresponding to raw-remap vs. gold-remap) and stitched (corresponding to stitched vs. gold-remap). To optimize the stitched scores, we needed to perform gridsearch on the stitcher. I compiled all results and analyzed the bar charts generated by a new see_results.py script using @keighrim's gridsearch efforts with the following grid:

minTFDuration = {1000, 2000, 3000, 4000, 5000}
minTPScore = {0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5}
minTFScore = {0.001, 0.01, 0.1, 0.5}
labelMapPreset = {swt-v4-4way, swt-v4-6way}

For bars, every configuration produced perfect 1.0 scores across all metrics (F1, Precision, Recall). Here is a sample chart from the output visualization for bars¹:

Across all labels (besides bars), I noticed that minTPScore tended to produce higher scores when its value is higher compared. Because of this, I focused on the subplot where minTPScore = 0.5 so that I can observe differences between the other three parameters.

Aside from higher minTPScore values, higher minTFDuration values also result in better scores. This makes sense because both of these parameters determine how many TPs and how many TFs are allowed to be included at all, and higher values for each means TPs with lower scores and TFs with shorter duration are excluded, leading to a better chance of success.

labelMapPreset=swt-v4-6way tends to result in lower scores than when labelMapPreset=swt-v4-4way, which makes sense because TFs might be misrecognized as other_opening or other_text when these additional labels are available as options to choose from, creating more room for error.

When minTFScore={0.001, 0.01, 0.1} and minTFDuration & labelMapPreset are fixed, the results between corresponding minTPScore values are identical. minTFScore=0.5 tends to result in a slightly higher score compared to when minTFScore is a smaller value.

credits see the most improvement comparing the results from filtered to those of stitched. In particular, Recall scores were already fairly high to begin with but there is an increase of at least 0.02 across all configs. When only considering results where labelMapPreset=swt-v4-4way, F1 scores see an increase of around 0.13 across all configs and Precision sees an increase of 0.2 across all configs. Here is a sample chart from the output visualization for credits, where the metrics are the highest among all configurations²:

From all of my observations, I have concluded that the best scores result from the following configuration:

minTFDuration = 5000
minTPScore = 0.5
minTFScore = 0.5
labelMapPreset = swt-v4-4way

Please note that F, P, and R correspond to average F1, Precision, and Recall scores retrieved from the evaluation script aggregating scores across all GUIDs in the given dataset and they are independent of each other. ↩
When observing the ideal configuration highlighted at the bottom of this comment, the Precision score is actually slightly lower when minTFDuration=4000 for credits, but the difference is practically negligible (0.002). ↩

keighrim · 2024-08-26T13:57:48Z

For future references, the "result" files used for this gridsearch/evaluation;
swt6.1-stitcher3.0-results.zip

keighrim · 2024-08-26T21:28:57Z

A few follow up questions;

how would stitching without binning (remap) perform? (i.e., just using the single letter lables)
why does 6way remap work poorly in general?
5 secs seems to be too long threshold for chyrons, what happened to chyron?

And future directions;

merge stitcher into SWT, using smoother implementation from SDK
- this is based on an assumption that this particular implementation of stitching only work for SWT (or likes) app (other apps, for example https://github.com/clamsproject/app-brandeis-acs-wrapper, might use a completely different algorithm and runtime parameters)
make SWT works in three modes - TP-only, TF-only (stitcher mode), and TP+TF mode
- separate stitcher can add a better modularization, but since SWT is the only app that uses this stitching implementation, the benefit can't outweigh the user-friendliness of having a single app with all related features built-in.

kla7 · 2024-09-14T23:49:30Z

how would stitching without binning (remap) perform? (i.e., just using the single letter lables)

I'm unsure what you mean by this since the current stitching mechanics involve remapping TPs into TFs.

5 secs seems to be too long threshold for chyrons, what happened to chyron?

I don't remember why I did not specifically report on chyron or slate, but both seemed to show great results all around. I will do further reports below to expand on the previous findings and highlight chyron and slate.

why does 6way remap work poorly in general?

My assumption is that some timeframes predicted to be other_text or other_opening should actually be labeled as chyron, credits, or slate and vice versa. I will include visualizations to observe 6way vs. 4way remap while normalizing minTFDuration = 1000 (to better support chyrons by tackling the threshold concern mentioned above) and minTFScore = 0.5 (which I have found previously to produce the best results).

For chyron and slate, the F/P/R scores tend to be around 0.02 points lower using 6way remap compared to 4way remap.

Sample comparison between the two label map presets for chyron to demonstrate these findings:

Sample comparison between the two label map presets for slate to demonstrate these findings:

For credits, R scores are minimally affected (<0.01 difference), but P scores are between 0.15-0.04 points lower using 6way remap compared to 4way remap, while F1 scores are around 0.03-0.04 points lower. From these numbers, I suspect that many timeframes that should not be labeled as credits is predicted as such, judging by the near perfect R scores in contrast with lower P scores (supported by lower F1 scores).

Sample comparison between the two label map presets for credits to demonstrate these findings:

I am now acknowledging my prior misunderstanding of how minTFDuration affects the results, since setting it to 5000ms can lead to the loss of valuable data as not every TF lasts 5000ms. When minTFDuration is set to 1000ms, the difference between the filtered and stitched F/P/R scores is still significant.

When observing results where minTFDuration = 1000 and minTFScore = 0.5, it is interesting to see that F/P/R scores across the three labels (again, bars has been excluded since it scores perfectly across all configs) are slightly higher when minTPScore = 0.001 compared to when minTPScore = 0.5. I can't figure out a reason for why this might be the case. To my understanding, minTPScore determines which TPs are included so I seem to be missing an intuitive piece here.

keighrim · 2024-11-01T19:11:52Z

241101 experiment with raw labels

Experiment setup

The goal of this experiment is

to find explanations to these two questions raised in a previous comment
- how would stitching without binning (remap) perform? (i.e., just using the single letter lables)
- 5 secs seems to be too long threshold for chyrons, what happened to chyron?
to find a "good-enough" set of values that works fairly well for "generally complex" label spaces.
The goal is NOT to find the best set of parameters that fit the optimized binning scheme, instead the goal

Input data

62 videos from aapb-collaboration-27-[abcd] batches
61 videos out of 62 (except for d batch) were used as training set for the scene recognition model for timepoint annotations that were input to the stitcher in this experiment, hence we should expect the accuracy of the SR model on these videos would be better than usual, hence we won't look at the absolute measurement of the model accuracy. We'll discuss the evaluation method used in this experiment later in this report.

TimePoint annotations

(relevant) configurations

        "app": "http://apps.clams.ai/swt-detection/v6.1",        
        "appConfiguration": {
          "sampleRate": 1000, 
          "modelName": "convnext_lg",
          "usePosModel": true, 
        }

TimeFrame annotations

(relevant) configurations


        "appConfiguration": {
          "useStitcher": true,
          "tfMinTPScore": OR(0.001, 0.055, 0.5),
          "tfMinTFScore": OR(0.055, 0.5),
          "tfMinNegTFDuration": OR(1, 1000, 2000),
          "tfMinTFDuration": OR(2000, 3000, 5000),
          "tfAllowOverlap": OR(True, False)
        }

In the above, values in OR() notation is used for "gridsearch" in the experiment. Total number of parameter combinations is 3 x 2 x 3 x 3 x 2 = 108.
Why 0.055 in score thresholds? that's roughly 1/18 (even distribution of all labels)
tfMinNegTFDuration parameter is only used for this experimental setup, and won't be exposed in the release version. This parameter was proposed in https://brandeis-llc.slack.com/archives/C06B0B3TYVB/p1727397593356179

"binning"

no binning for TP (prebin)
- At TP level, we have total 18 (BSWLOMINEPYKUGTFCR) "raw" labels (ignoring subtypes of slates)
- In 6.1, by mistake U was omitted from the model. (will be fixed in 7.1, not 7.0)
- As a result, BSWLOMINEPYKGTFCR- (raw - U + -, total 18) were used in the TP view of the experimental data
no binning for TF (postbin)
- Namely, the labels for time frames after stitching were also the (buggy) "raw" single-letter labels.
some "bin names" used in the rest of this report are simply meant for averaging scores of multiple raw labels.

Results

Evaluation method

Based on the discussion in SR point-wise evaluation with measuring "stitcher" performance #60 (and the implementation in Add new evaluation methods for remapped labels and stitcher #62), we calculated point-wise precision/recall/F-1 scores for "unfiltered" and "stitched" labels (since we didn't use any binning/mapping, "unfiltered" and "filtered" results are identical, and hence ignored)
To measure the contribution of the stitcher, we subtract unfiltered P/R/F from stitched P/R/F scores. In other words, we consider the differences in point-wise P/R/F scores as the contribution of the stitcher (larger position difference = contributing larger improvement, vice versa for negative).
We then visualized the results with their associated parameter values as parallel coordinates using HiPlot python library.

Result files

html files (stitcher-results-241101.zip)
- visualization files are named as stitcher-results-{AVERAGE_BIN}-{[PRF]}-diff.html
- AVBERAGE_BIN is NOT pre- or postbinning used in the SWT/stitcher app. These bins are used just in the visualization to compute average P/R/F scores of certain groups of raw (single-letter) labels

Analysis

All plots and analysis below is based on F-1 scores.
To see plots of each comparison, hit ▶ to expand.

When looking at the all labels

Stitcher contribution by all labels

W label showed the widest variance in the grid search, from top to bottom, but taking that as a outlier, most of the labels showed differences in the range between -0.3 and 0.3 F-1 score.

visuals

Labels showed positive contribution

the top one improvement of W is intentionally excluded as a noisy outlier in the plot below.
82.6% of all parameter combinations showed positive impact.
most of the higher improvements (red colored lines) are observed in T, O, G, F labels, which are not of our immediate interest

visuals

Point-wise softmax threshold vs stitcher contribution

extremely low threshold (0.001) that was frequently used (to handle negative smoothing) in the previous iterations of the SWT app showed the widest impact on the stitcher contribution, including lots of negative ones.
the middle value (even distribution, or "by-chance" baseline) showed mostly positive effect, but with some negatives.
the true majority threshold (0.5) showed only improvements.

visuals
when the threshold is high, minNegTFDuration (used for "smoothing-up" short noisy misclassification in the middle of longer sequences) values didn't make meaningful differences.

visuals
the same goes for minTFScore and minTFDuration.

visuals

Conclusion

How would stitching without binning (remap) perform?

In general, stitching works as intended (82.6% of cases). Using minTPScore=0.5 will make a generally better-working stitcher.

When looking at labels in "relaxed" binning scheme

Again, this is not binning inside any model, but the "binning" used to compute the average P/R/F scores.
The "relaxed" scheme is taken from bring back "pre-binning" app-swt-detection#118 (comment) ("Overall-relaxed")

Stitcher contribution by all labels

Similar to the above (all the raw labels)

visuals

Average of all "interested" labels

The top ranked instances showed ~0.16 F-1 score improvements in average across all the "interesting" labels.

visuals
And we can re-affirm that minTPScore=0.5 resulted in positive impact all the time.

visuals

By bins

When the softmax threshold is fixed at 0.5, "other-text" from the relaxed scheme showed the largest improvement (0.18 - 0.22 increase) after stitching, while "chyron" and "credits" showed ~0.10 difference, and for "slate" and "bars" the stitcher showed only limited increase.

visuals
Lower signs of improvement in "slate" and "bars" labels are expected, as the model performance before stitching was already very high.

visuals

Minimum TF duration threshold vs stitcher contribution

In a previous discussion, we suspected 5-second minTFDuration threshold can be too harsh for chyron types. Interestingly, however, when minTPScore is high, the 5-sec duration threshold showed no signs of performance degradation compared to other shorter duration thresholds.
Note, again, that I and K labels recorded high scores before stitching, and hence the stitcher showed relatively smaller improvements.

visuals
So, we already saw in @kla7 's previous comment that the stitcher showed large improvements in labels other than "chyron" with 5-sec duration threshold. Does it still hold when we use "nobin" stitching?
Yes, the longer threshold showed the largest improvements in all other labels (except "bars", which we can't really improve more).

visuals

Conclusion

As shown in the previous pilot report by @kla7, minTFDuration=5000 will make a generally better-working stitcher (with combination with minTPScore=0.5)

5 secs seems to be too long threshold for chyrons, what happened to chyron?

5 sec threshold wasn't that too long for chyron types.

Conclusion and next steps

I will fix tfMinTPScore=0.5 and tfMinTFDuration=5000 parameters and will conduct a similar experiment with the "relaxed" postbin applied at the stitching time.

keighrim · 2024-11-03T02:22:07Z

241102 experiment with "relaxed" postbin

Experiment setup

This is an experiment following up the findings from 241101 experiment on stitcher contribution without any binning. The goal of this experiment is to find a set of proper defaults values for the remainder of the stitcher configuration that works well for relatively simpler label spaces after a "post-binning" of single-letter raw labels.

Input data

The same TP annotations on 62 videos (aapb-collaboration-27-[abcd]) were used as input to the stitcher.

TimeFrame annotations

(relevant) configurations


        "appConfiguration": {
          "useStitcher": true,
          "tfMinTPScore": 0.5,
          "tfMinTFScore": OR(0.5, 0.75, 0.9),
          "tfLabelMapFn": OR("sum", "max"),
          "tfMinNegTFDuration": OR(1, 1000, 2000),
          "tfMinTFDuration": 5000,
          "tfAllowOverlap": OR(True, False)
        }

In the above, values in OR() notation is used for "gridsearch" in the experiment. Total number of parameter combinations is 3 x 2 x 3 x 2 = 36.
A new experimental parameter is added in this round; tfLabelMapFn. This is the function to use to aggregate a point-wise confidence score when binning two or more sublabels. In the previous versions of SWT, this function is fixed to max, meaning given a binning from [I, N] to chyron, the score of the chyron, or S(chyron) is determined by the maximum number from [S(I), S(N)]. In this experiment, I tried switching it to sum function (i.e., S(chyron) = S(I) + S(N)) to see if that makes any difference.
As we found from the last 241101 experiment, I fixed minTPScore and minTFDuration values to 0.5 and 5-second, respectively.

"binning"

no binning for TP (prebin) - still the same (buggy) raw labels were used as the output space of the SR model.
"relaxed" postbin ("Overall-relaxed" in here) applied on time point annotations before they are stitched into time frames.

Results

Evaluation method

I used essentially the same method from the previous experiment, except that now we applied postbinning, the P/R/F scores of "unfiltered" (before postbinning) and "filtered" (after postbinning) are different. To measure the contribution of the stitching step that only happens after postbinning, I use the difference between "stitched" P/R/F and "filtered" as the measurement.
Then I used the same visualization method (parallel coordinates using HiPlot python library).

Result files

html files (stitcher-results-241102.zip)
- visualization files are named as stitcher-results-!-{[PRF]}-{CONDITION}.html
- CONDITION is one of filtered, stitched, or diff that means point-wise accuracy of "filtered" labels, "stitched" labels, and the difference ("stitched" - "filtered"), respectively.

Analysis

Stitcher contribution by all bins

With all parameter combinations, I couldn't see any one of them negatively impacting the final accuracy. Particularly, "credits" showed a wider range of stitcher contribution, followed by "slate" and "other-text". "chyron" and "bars" seemed already quite stable before stitching.

visuals
As seen in the "filtered" scores, "bars", "slate", and "chyron" all reached pretty high accuracy before stitching (as expected by the "leak" of the training data).

visuals

Softmax aggregation method vs stitcher contribution

No significant difference is found between max and sum functions. This is probably because we are already using a true majority (0.5) as the softmax threshold.

visuals

Frame-wise score (average) threshold

Since we use minTPScore=0.5, I experimented with some values greater than or equal to that (0.5, 0.75, 0.9) for minTFScore parameter.
Overall, the higher threshold, the more positively the stitcher contributed.

visuals
0.9 frame-wise score threshold boost both precision and recall more than 0.75.

visuals
Especially for "credits", and less so for "slate" and "other-text".

visuals

Smoothing negative "noises"

minTFNegTFDuration parameter is used to filter out short negative noises.
Note that in the previous stitcher implementation (SWT v6.1 or older), this "negative smoothing" wasn't implemented, and in the new implementation, minNegTFDuration values less than or equal to the timepoint sampling rate becomes equivalent to the old implementation, or lack thereof, with minTFNegDuration=1. being absolute "kill switch" for the negative smoothing feature.
Since in this experiment, the sampling rate of the timepoints are 1000 ms, I simply ignored minTFNegDuration=1 and compared 1000 and 2000 ms.
Between 1- vs. 2-second thresholds, looking at average of all bins, 2-sec showed a wider range of positive contribution, sometimes more, sometimes less than 1-sec threshold.

visuals
But when combined with minTFScore parameter, clearly the longer negative smoothing threshold benefits from the higher frame-wise score threshold, especially for "credits", without hurting the accuracy on other bins.

visuals

Screencast_20241102_214818.webm

Lastly, allowing frame overlap

When minTFScore and minTFNegDuration parameters are fixed to the best configuration based on the findings so far (0.9 and 2-sec, respectively), allowOverlap parameter didn't show any differences.

visuals
This is probably because of the high frame-wise score threshold, which makes the overlapping time frames less likely to happen.
Since overlapping time frames will pose significant complexity to downstream apps, I recommend using allowOverlap=False as the default value.

Conclusion

In general, stitching works as intended for the smaller label space after postbinning.
For the two "experimental" parameters,
- labelMapFn is likely shadowed by the hight minTPScore=0.5 threshold. I recommend keeping it as max for consistency with the previous versions.
- minTFNegDuration (negative smoothing) alone showed a mixed results (compared to turning it off), but when combined with higher minTFScore threshold, it showed a clear accuracy boost. I recommend hardcoding 2000 ms threshold, and not exposing it to the user to reduce the complexity of the app.
Finally, for the default values for the stitcher parameters, I recommend using minTFScore=0.9, allowOverlap=False.

…clamsproject/aapb-evaluations#60)

keighrim added the ✨N New feature or request label Jul 10, 2024

clams-bot added this to infra Jul 10, 2024

github-project-automation bot moved this to Todo in infra Jul 10, 2024

keighrim assigned kla7 Jul 11, 2024

keighrim mentioned this issue Jul 12, 2024

evaluate SWT as a image classification task, not time frame annotator #49

Open

4 tasks

kla7 mentioned this issue Jul 17, 2024

Add new evaluation methods for remapped labels and stitcher #62

Merged

kla7 mentioned this issue Jul 29, 2024

Add ability to disallow overlapping TFs clamsproject/app-simple-timepoints-stitcher#1

Closed

keighrim mentioned this issue Sep 11, 2024

future of the stitcher module clamsproject/app-swt-detection#117

Closed

2 tasks

keighrim closed this as completed in #62 Nov 3, 2024

github-project-automation bot moved this from Todo to Done in infra Nov 3, 2024

keighrim added a commit to clamsproject/app-swt-detection that referenced this issue Nov 3, 2024

adjusted default values for stitcher params based on 2411 experiments (…

c0aa813

…clamsproject/aapb-evaluations#60)

keighrim mentioned this issue Nov 25, 2024

train a new model with additional "challenging images" annotation data clamsproject/app-swt-detection#116

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SR point-wise evaluation with measuring "stitcher" performance #60

SR point-wise evaluation with measuring "stitcher" performance #60

keighrim commented Jul 10, 2024 •

edited

Loading

marcverhagen commented Jul 10, 2024

marcverhagen commented Jul 11, 2024

keighrim commented Jul 11, 2024

keighrim commented Jul 22, 2024

kla7 commented Aug 21, 2024

keighrim commented Aug 26, 2024

keighrim commented Aug 26, 2024

kla7 commented Sep 14, 2024

keighrim commented Nov 1, 2024 •

edited

Loading

keighrim commented Nov 3, 2024

SR point-wise evaluation with measuring "stitcher" performance #60

SR point-wise evaluation with measuring "stitcher" performance #60

Comments

keighrim commented Jul 10, 2024 • edited Loading

New Feature Summary

Related

Alternatives

Additional context

marcverhagen commented Jul 10, 2024

marcverhagen commented Jul 11, 2024

keighrim commented Jul 11, 2024

Footnotes

keighrim commented Jul 22, 2024

kla7 commented Aug 21, 2024

Footnotes

keighrim commented Aug 26, 2024

keighrim commented Aug 26, 2024

kla7 commented Sep 14, 2024

keighrim commented Nov 1, 2024 • edited Loading

241101 experiment with raw labels

Experiment setup

Input data

TimePoint annotations

TimeFrame annotations

"binning"

Results

Evaluation method

Result files

Analysis

When looking at the all labels

Stitcher contribution by all labels

Labels showed positive contribution

Point-wise softmax threshold vs stitcher contribution

Conclusion

When looking at labels in "relaxed" binning scheme

Stitcher contribution by all labels

Average of all "interested" labels

By bins

Minimum TF duration threshold vs stitcher contribution

Conclusion

Conclusion and next steps

keighrim commented Nov 3, 2024

241102 experiment with "relaxed" postbin

Experiment setup

Input data

TimeFrame annotations

"binning"

Results

Evaluation method

Result files

Analysis

Stitcher contribution by all bins

Softmax aggregation method vs stitcher contribution

Frame-wise score (average) threshold

Smoothing negative "noises"

Lastly, allowing frame overlap

Conclusion

keighrim commented Jul 10, 2024 •

edited

Loading

keighrim commented Nov 1, 2024 •

edited

Loading