Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SR point-wise evaluation with measuring "stitcher" performance #60

Closed
keighrim opened this issue Jul 10, 2024 · 10 comments · Fixed by #62
Closed

SR point-wise evaluation with measuring "stitcher" performance #60

keighrim opened this issue Jul 10, 2024 · 10 comments · Fixed by #62
Assignees
Labels
✨N New feature or request

Comments

@keighrim
Copy link
Member

keighrim commented Jul 10, 2024

New Feature Summary

#55 added an evaluation software for apps like SWT only using TimePoint annotations, but that evaluation can easily expanded to evaluation of "stitcher" component that turns TP annotations into TimeFrame annotations. The idea is based on that all existing stitcher implementation is using "label remapping", exposed as a runtime param and recorded in the view metadata, that enables us to re-construct point-wise but remapped label value list.

So the idea is to update eval.py file so that it

  1. Reads TP annotation as usual, constructing a list of "raw" classification results, let's call it raw
  2. Reads the gold files as usual, constructing a list of gold labels, let's call it gold
  3. Grabs the "latest" view with TimeFrame annotation with the remapper config (map for SWT built-in, labelMap for simple-stitcher)
  4. Uses the remapper to map the raw and gold into secondary remapped lists (these new lists should be shorter then the original ones, since not all of the raw/gold labels are remapped into the secondary (TF) labels. Let's call them raw-remap and gold-remap respectively.
  5. Then iterates through the TF annotations, construct a third list of stitched, remapped labels by using the pointers in targets prop (which must be pointing to TP annotations so the timepoints can be traced), let's call this list stitched
  6. compute P/R/F between
    • raw vs. gold (this should already be there in the current eval.py)
    • raw-remap vs. gold-remap
    • stitched vs. gold-remap,

Related

resolving this issue will also properly address clamsproject/app-swt-detection#61

Alternatives

No response

Additional context

No response

@keighrim keighrim added the ✨N New feature or request label Jul 10, 2024
@clams-bot clams-bot added this to infra Jul 10, 2024
@github-project-automation github-project-automation bot moved this to Todo in infra Jul 10, 2024
@marcverhagen
Copy link

Some of this was done in 22bee5c (work in progress that I just pushed up to make it visible), but not quite in the way as described above. One difference is that it uses the output of the updated process.py script from the annotations repository.

@marcverhagen
Copy link

Is there overlap here with #43?

@keighrim
Copy link
Member Author

The evaluation scheme in this is based on point-wise evaluation, hence is not compatibly with "old" interval-level gold data (from 2020 time). I don't think this is a duplicate of #43. Eventually, I believe the proposed method will be evaluation for stitcher components, largely independent from image classification model performance.

I think the most "overlapping" effort to this issue was evaluate.py1 in clamsproject/app-swt-detection@5590a3e , but that file was

  1. never used to produce a public report or associated with any actual (archived) SWT MMIF output files
  2. written a fair while ago and the current status/compatibility is unknown

so I thought it's not easy to verify if the old code works or not. (plus, this repo is the repo for evaluation code) That led me to start this new issue.

Footnotes

  1. which wasn't easy to dig out, since the most closely related issue - presumably https://github.com/clamsproject/app-swt-detection/issues/61 - was closed without mentioning the file or commit or an explicit PR merge.

@keighrim
Copy link
Member Author

Since the existing timepoint evaluation and the proposed timeframe (stitcher) evaluation can be applied to any point-wise classification tasks, I think it'd be more representing if we rename the subdirectory name to pointclassification_eval (or something like that).

@kla7
Copy link
Contributor

kla7 commented Aug 21, 2024

To analyze stitcher's performance, I compared the evaluation scores from filtered (corresponding to raw-remap vs. gold-remap) and stitched (corresponding to stitched vs. gold-remap). To optimize the stitched scores, we needed to perform gridsearch on the stitcher. I compiled all results and analyzed the bar charts generated by a new see_results.py script using @keighrim's gridsearch efforts with the following grid:

minTFDuration = {1000, 2000, 3000, 4000, 5000}
minTPScore = {0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5}
minTFScore = {0.001, 0.01, 0.1, 0.5}
labelMapPreset = {swt-v4-4way, swt-v4-6way}

For bars, every configuration produced perfect 1.0 scores across all metrics (F1, Precision, Recall). Here is a sample chart from the output visualization for bars1:

image

Across all labels (besides bars), I noticed that minTPScore tended to produce higher scores when its value is higher compared. Because of this, I focused on the subplot where minTPScore = 0.5 so that I can observe differences between the other three parameters.

Aside from higher minTPScore values, higher minTFDuration values also result in better scores. This makes sense because both of these parameters determine how many TPs and how many TFs are allowed to be included at all, and higher values for each means TPs with lower scores and TFs with shorter duration are excluded, leading to a better chance of success.

labelMapPreset=swt-v4-6way tends to result in lower scores than when labelMapPreset=swt-v4-4way, which makes sense because TFs might be misrecognized as other_opening or other_text when these additional labels are available as options to choose from, creating more room for error.

When minTFScore={0.001, 0.01, 0.1} and minTFDuration & labelMapPreset are fixed, the results between corresponding minTPScore values are identical. minTFScore=0.5 tends to result in a slightly higher score compared to when minTFScore is a smaller value.

credits see the most improvement comparing the results from filtered to those of stitched. In particular, Recall scores were already fairly high to begin with but there is an increase of at least 0.02 across all configs. When only considering results where labelMapPreset=swt-v4-4way, F1 scores see an increase of around 0.13 across all configs and Precision sees an increase of 0.2 across all configs. Here is a sample chart from the output visualization for credits, where the metrics are the highest among all configurations2:

image

From all of my observations, I have concluded that the best scores result from the following configuration:

minTFDuration = 5000
minTPScore = 0.5
minTFScore = 0.5
labelMapPreset = swt-v4-4way

Footnotes

  1. Please note that F, P, and R correspond to average F1, Precision, and Recall scores retrieved from the evaluation script aggregating scores across all GUIDs in the given dataset and they are independent of each other.

  2. When observing the ideal configuration highlighted at the bottom of this comment, the Precision score is actually slightly lower when minTFDuration=4000 for credits, but the difference is practically negligible (0.002).

@keighrim
Copy link
Member Author

For future references, the "result" files used for this gridsearch/evaluation;
swt6.1-stitcher3.0-results.zip

@keighrim
Copy link
Member Author

A few follow up questions;

  • how would stitching without binning (remap) perform? (i.e., just using the single letter lables)
  • why does 6way remap work poorly in general?
  • 5 secs seems to be too long threshold for chyrons, what happened to chyron?

And future directions;

  • merge stitcher into SWT, using smoother implementation from SDK
  • make SWT works in three modes - TP-only, TF-only (stitcher mode), and TP+TF mode
    • separate stitcher can add a better modularization, but since SWT is the only app that uses this stitching implementation, the benefit can't outweigh the user-friendliness of having a single app with all related features built-in.

@kla7
Copy link
Contributor

kla7 commented Sep 14, 2024

how would stitching without binning (remap) perform? (i.e., just using the single letter lables)

I'm unsure what you mean by this since the current stitching mechanics involve remapping TPs into TFs.

5 secs seems to be too long threshold for chyrons, what happened to chyron?

I don't remember why I did not specifically report on chyron or slate, but both seemed to show great results all around. I will do further reports below to expand on the previous findings and highlight chyron and slate.

why does 6way remap work poorly in general?

My assumption is that some timeframes predicted to be other_text or other_opening should actually be labeled as chyron, credits, or slate and vice versa. I will include visualizations to observe 6way vs. 4way remap while normalizing minTFDuration = 1000 (to better support chyrons by tackling the threshold concern mentioned above) and minTFScore = 0.5 (which I have found previously to produce the best results).

For chyron and slate, the F/P/R scores tend to be around 0.02 points lower using 6way remap compared to 4way remap.

Sample comparison between the two label map presets for chyron to demonstrate these findings:
image

Sample comparison between the two label map presets for slate to demonstrate these findings:
image

For credits, R scores are minimally affected (<0.01 difference), but P scores are between 0.15-0.04 points lower using 6way remap compared to 4way remap, while F1 scores are around 0.03-0.04 points lower. From these numbers, I suspect that many timeframes that should not be labeled as credits is predicted as such, judging by the near perfect R scores in contrast with lower P scores (supported by lower F1 scores).

Sample comparison between the two label map presets for credits to demonstrate these findings:
image

I am now acknowledging my prior misunderstanding of how minTFDuration affects the results, since setting it to 5000ms can lead to the loss of valuable data as not every TF lasts 5000ms. When minTFDuration is set to 1000ms, the difference between the filtered and stitched F/P/R scores is still significant.

When observing results where minTFDuration = 1000 and minTFScore = 0.5, it is interesting to see that F/P/R scores across the three labels (again, bars has been excluded since it scores perfectly across all configs) are slightly higher when minTPScore = 0.001 compared to when minTPScore = 0.5. I can't figure out a reason for why this might be the case. To my understanding, minTPScore determines which TPs are included so I seem to be missing an intuitive piece here.

@keighrim
Copy link
Member Author

keighrim commented Nov 1, 2024

241101 experiment with raw labels

Experiment setup

The goal of this experiment is

  1. to find explanations to these two questions raised in a previous comment
    • how would stitching without binning (remap) perform? (i.e., just using the single letter lables)
    • 5 secs seems to be too long threshold for chyrons, what happened to chyron?
  2. to find a "good-enough" set of values that works fairly well for "generally complex" label spaces.
    The goal is NOT to find the best set of parameters that fit the optimized binning scheme, instead the goal

Input data

  • 62 videos from aapb-collaboration-27-[abcd] batches
  • 61 videos out of 62 (except for d batch) were used as training set for the scene recognition model for timepoint annotations that were input to the stitcher in this experiment, hence we should expect the accuracy of the SR model on these videos would be better than usual, hence we won't look at the absolute measurement of the model accuracy. We'll discuss the evaluation method used in this experiment later in this report.

TimePoint annotations

  • (relevant) configurations
        "app": "http://apps.clams.ai/swt-detection/v6.1",        
        "appConfiguration": {
          "sampleRate": 1000, 
          "modelName": "convnext_lg",
          "usePosModel": true, 
        }                                                                                                     

TimeFrame annotations

  • (relevant) configurations

        "appConfiguration": {
          "useStitcher": true,
          "tfMinTPScore": OR(0.001, 0.055, 0.5),
          "tfMinTFScore": OR(0.055, 0.5),
          "tfMinNegTFDuration": OR(1, 1000, 2000),
          "tfMinTFDuration": OR(2000, 3000, 5000),
          "tfAllowOverlap": OR(True, False)
        }

  • In the above, values in OR() notation is used for "gridsearch" in the experiment. Total number of parameter combinations is 3 x 2 x 3 x 3 x 2 = 108.
  • Why 0.055 in score thresholds? that's roughly 1/18 (even distribution of all labels)
  • tfMinNegTFDuration parameter is only used for this experimental setup, and won't be exposed in the release version. This parameter was proposed in https://brandeis-llc.slack.com/archives/C06B0B3TYVB/p1727397593356179

"binning"

  • no binning for TP (prebin)
    • At TP level, we have total 18 (BSWLOMINEPYKUGTFCR) "raw" labels (ignoring subtypes of slates)
    • In 6.1, by mistake U was omitted from the model. (will be fixed in 7.1, not 7.0)
    • As a result, BSWLOMINEPYKGTFCR- (raw - U + -, total 18) were used in the TP view of the experimental data
  • no binning for TF (postbin)
    • Namely, the labels for time frames after stitching were also the (buggy) "raw" single-letter labels.
  • some "bin names" used in the rest of this report are simply meant for averaging scores of multiple raw labels.

Results

Evaluation method

  • Based on the discussion in SR point-wise evaluation with measuring "stitcher" performance  #60 (and the implementation in Add new evaluation methods for remapped labels and stitcher #62), we calculated point-wise precision/recall/F-1 scores for "unfiltered" and "stitched" labels (since we didn't use any binning/mapping, "unfiltered" and "filtered" results are identical, and hence ignored)
  • To measure the contribution of the stitcher, we subtract unfiltered P/R/F from stitched P/R/F scores. In other words, we consider the differences in point-wise P/R/F scores as the contribution of the stitcher (larger position difference = contributing larger improvement, vice versa for negative).
  • We then visualized the results with their associated parameter values as parallel coordinates using HiPlot python library.

Result files

  • html files (stitcher-results-241101.zip)

    • visualization files are named as stitcher-results-{AVERAGE_BIN}-{[PRF]}-diff.html
    • AVBERAGE_BIN is NOT pre- or postbinning used in the SWT/stitcher app. These bins are used just in the visualization to compute average P/R/F scores of certain groups of raw (single-letter) labels

Analysis

  • All plots and analysis below is based on F-1 scores.
  • To see plots of each comparison, hit to expand.

When looking at the all labels

Stitcher contribution by all labels

  • W label showed the widest variance in the grid search, from top to bottom, but taking that as a outlier, most of the labels showed differences in the range between -0.3 and 0.3 F-1 score.

    visuals

    Screenshot_20241102_091811

Labels showed positive contribution

  • the top one improvement of W is intentionally excluded as a noisy outlier in the plot below.

  • 82.6% of all parameter combinations showed positive impact.

  • most of the higher improvements (red colored lines) are observed in T, O, G, F labels, which are not of our immediate interest

    visuals

    Screenshot_20241102_095023

Point-wise softmax threshold vs stitcher contribution

  • extremely low threshold (0.001) that was frequently used (to handle negative smoothing) in the previous iterations of the SWT app showed the widest impact on the stitcher contribution, including lots of negative ones.

  • the middle value (even distribution, or "by-chance" baseline) showed mostly positive effect, but with some negatives.

  • the true majority threshold (0.5) showed only improvements.

    visuals

    Screenshot_20241102_092230
    Screenshot_20241102_092354

  • when the threshold is high, minNegTFDuration (used for "smoothing-up" short noisy misclassification in the middle of longer sequences) values didn't make meaningful differences.

    visuals

    Screenshot_20241102_092520

  • the same goes for minTFScore and minTFDuration.

    visuals

    Screenshot_20241102_092553
    Screenshot_20241102_092625

Conclusion

How would stitching without binning (remap) perform?

In general, stitching works as intended (82.6% of cases). Using minTPScore=0.5 will make a generally better-working stitcher.

When looking at labels in "relaxed" binning scheme

Stitcher contribution by all labels

  • Similar to the above (all the raw labels)

    visuals

    Screenshot_20241102_100914
    Screenshot_20241102_100232

Average of all "interested" labels

  • The top ranked instances showed ~0.16 F-1 score improvements in average across all the "interesting" labels.

    visuals

    Screenshot_20241102_092727

  • And we can re-affirm that minTPScore=0.5 resulted in positive impact all the time.

    visuals

    Screenshot_20241102_092812

By bins

  • When the softmax threshold is fixed at 0.5, "other-text" from the relaxed scheme showed the largest improvement (0.18 - 0.22 increase) after stitching, while "chyron" and "credits" showed ~0.10 difference, and for "slate" and "bars" the stitcher showed only limited increase.

    visuals

    Screenshot_20241102_092824
    Screenshot_20241102_092829
    Screenshot_20241102_092836
    Screenshot_20241102_092841
    Screenshot_20241102_092845

  • Lower signs of improvement in "slate" and "bars" labels are expected, as the model performance before stitching was already very high.

    visuals

    Screenshot_20241102_093612

Minimum TF duration threshold vs stitcher contribution

  • In a previous discussion, we suspected 5-second minTFDuration threshold can be too harsh for chyron types. Interestingly, however, when minTPScore is high, the 5-sec duration threshold showed no signs of performance degradation compared to other shorter duration thresholds.

  • Note, again, that I and K labels recorded high scores before stitching, and hence the stitcher showed relatively smaller improvements.

    visuals

    Screenshot_20241102_093734
    Screenshot_20241102_093740

  • So, we already saw in @kla7 's previous comment that the stitcher showed large improvements in labels other than "chyron" with 5-sec duration threshold. Does it still hold when we use "nobin" stitching?

  • Yes, the longer threshold showed the largest improvements in all other labels (except "bars", which we can't really improve more).

    visuals

    Screenshot_20241102_093910
    Screenshot_20241102_093851
    Screenshot_20241102_093855
    Screenshot_20241102_093900
    Screenshot_20241102_093904

Conclusion

As shown in the previous pilot report by @kla7, minTFDuration=5000 will make a generally better-working stitcher (with combination with minTPScore=0.5)

  • 5 secs seems to be too long threshold for chyrons, what happened to chyron?

5 sec threshold wasn't that too long for chyron types.

Conclusion and next steps

I will fix tfMinTPScore=0.5 and tfMinTFDuration=5000 parameters and will conduct a similar experiment with the "relaxed" postbin applied at the stitching time.

@keighrim
Copy link
Member Author

keighrim commented Nov 3, 2024

241102 experiment with "relaxed" postbin

Experiment setup

This is an experiment following up the findings from 241101 experiment on stitcher contribution without any binning. The goal of this experiment is to find a set of proper defaults values for the remainder of the stitcher configuration that works well for relatively simpler label spaces after a "post-binning" of single-letter raw labels.

Input data

TimeFrame annotations

  • (relevant) configurations

        "appConfiguration": {
          "useStitcher": true,
          "tfMinTPScore": 0.5,
          "tfMinTFScore": OR(0.5, 0.75, 0.9),
          "tfLabelMapFn": OR("sum", "max"),
          "tfMinNegTFDuration": OR(1, 1000, 2000),
          "tfMinTFDuration": 5000,
          "tfAllowOverlap": OR(True, False)
        }

  • In the above, values in OR() notation is used for "gridsearch" in the experiment. Total number of parameter combinations is 3 x 2 x 3 x 2 = 36.
  • A new experimental parameter is added in this round; tfLabelMapFn. This is the function to use to aggregate a point-wise confidence score when binning two or more sublabels. In the previous versions of SWT, this function is fixed to max, meaning given a binning from [I, N] to chyron, the score of the chyron, or S(chyron) is determined by the maximum number from [S(I), S(N)]. In this experiment, I tried switching it to sum function (i.e., S(chyron) = S(I) + S(N)) to see if that makes any difference.
  • As we found from the last 241101 experiment, I fixed minTPScore and minTFDuration values to 0.5 and 5-second, respectively.

"binning"

  • no binning for TP (prebin) - still the same (buggy) raw labels were used as the output space of the SR model.
  • "relaxed" postbin ("Overall-relaxed" in here) applied on time point annotations before they are stitched into time frames.

Results

Evaluation method

  • I used essentially the same method from the previous experiment, except that now we applied postbinning, the P/R/F scores of "unfiltered" (before postbinning) and "filtered" (after postbinning) are different. To measure the contribution of the stitching step that only happens after postbinning, I use the difference between "stitched" P/R/F and "filtered" as the measurement.
  • Then I used the same visualization method (parallel coordinates using HiPlot python library).

Result files

  • html files (stitcher-results-241102.zip)

    • visualization files are named as stitcher-results-!-{[PRF]}-{CONDITION}.html
    • CONDITION is one of filtered, stitched, or diff that means point-wise accuracy of "filtered" labels, "stitched" labels, and the difference ("stitched" - "filtered"), respectively.

Analysis

Stitcher contribution by all bins

  • With all parameter combinations, I couldn't see any one of them negatively impacting the final accuracy. Particularly, "credits" showed a wider range of stitcher contribution, followed by "slate" and "other-text". "chyron" and "bars" seemed already quite stable before stitching.

    visuals

    Screenshot_20241102_211948

  • As seen in the "filtered" scores, "bars", "slate", and "chyron" all reached pretty high accuracy before stitching (as expected by the "leak" of the training data).

    visuals

    Screenshot_20241102_212256

Softmax aggregation method vs stitcher contribution

  • No significant difference is found between max and sum functions. This is probably because we are already using a true majority (0.5) as the softmax threshold.

    visuals

    Screenshot_20241102_212434
    Screenshot_20241102_212437

Frame-wise score (average) threshold

  • Since we use minTPScore=0.5, I experimented with some values greater than or equal to that (0.5, 0.75, 0.9) for minTFScore parameter.

  • Overall, the higher threshold, the more positively the stitcher contributed.

    visuals

    Screenshot_20241102_212955
    Screenshot_20241102_212615

  • 0.9 frame-wise score threshold boost both precision and recall more than 0.75.

    visuals

    Screenshot_20241102_220601
    Screenshot_20241102_220534
    Screenshot_20241102_220612

  • Especially for "credits", and less so for "slate" and "other-text".

    visuals

    Screenshot_20241102_213050
    Screenshot_20241102_213048
    Screenshot_20241102_213046
    Screenshot_20241102_213044
    Screenshot_20241102_213042

Smoothing negative "noises"

  • minTFNegTFDuration parameter is used to filter out short negative noises.

  • Note that in the previous stitcher implementation (SWT v6.1 or older), this "negative smoothing" wasn't implemented, and in the new implementation, minNegTFDuration values less than or equal to the timepoint sampling rate becomes equivalent to the old implementation, or lack thereof, with minTFNegDuration=1. being absolute "kill switch" for the negative smoothing feature.

  • Since in this experiment, the sampling rate of the timepoints are 1000 ms, I simply ignored minTFNegDuration=1 and compared 1000 and 2000 ms.

  • Between 1- vs. 2-second thresholds, looking at average of all bins, 2-sec showed a wider range of positive contribution, sometimes more, sometimes less than 1-sec threshold.

    visuals

    Screenshot_20241102_213719

  • But when combined with minTFScore parameter, clearly the longer negative smoothing threshold benefits from the higher frame-wise score threshold, especially for "credits", without hurting the accuracy on other bins.

    visuals
    Screencast_20241102_214818.webm

Lastly, allowing frame overlap

  • When minTFScore and minTFNegDuration parameters are fixed to the best configuration based on the findings so far (0.9 and 2-sec, respectively), allowOverlap parameter didn't show any differences.

    visuals

    Screenshot_20241102_174057

  • This is probably because of the high frame-wise score threshold, which makes the overlapping time frames less likely to happen.

  • Since overlapping time frames will pose significant complexity to downstream apps, I recommend using allowOverlap=False as the default value.

Conclusion

  • In general, stitching works as intended for the smaller label space after postbinning.
  • For the two "experimental" parameters,
    • labelMapFn is likely shadowed by the hight minTPScore=0.5 threshold. I recommend keeping it as max for consistency with the previous versions.
    • minTFNegDuration (negative smoothing) alone showed a mixed results (compared to turning it off), but when combined with higher minTFScore threshold, it showed a clear accuracy boost. I recommend hardcoding 2000 ms threshold, and not exposing it to the user to reduce the complexity of the app.
  • Finally, for the default values for the stitcher parameters, I recommend using minTFScore=0.9, allowOverlap=False.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨N New feature or request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants