evaluation of stitcher #61

keighrim · 2024-01-29T18:08:42Z

Because

We want to see if the stitcher/smoothing added via #33 is doing well, independent from the accuracy of image-level classification model.

Done when

Controlled evaluation is done to measure the effectiveness of the stitcher. The evaluation at high-level, should be measuring the performance difference between raw image classification results and re-constructed image classification results from TimeFrame annotations.

Additional context

Original idea of having this evaluated was proposed by @owencking in his email on 12/15/2023. Here's an excerpt from it.

Suppose we have a set of still frames F from a full video. And suppose we have human/gold labels of all those frames. Suppose we have a CV model M and a stitching algorithm S. Then a full SWT app is the composite M+S and can take a video as input.

Generate two sets of image classification predictions for all the frames in F:

The first prediction set P1 just uses M (choosing the label with max score across the output categories) to predict one label for each frame.

The second prediction set P2 is generated by using M+S on the original video to generate time-based annotations, then producing a label prediction for each frame in F according to the label in the annotation for the time period containing the frame.

Compare both P1 and P2 to the gold labels of the frames in F. Then we can evaluate how good S is according to how much P2 improves over P1. This gives us a way of evaluating different stitching algorithms (somewhat) independently of the CV model. So it will allow us to tell whether performance improvements for time-based eval metrics are coming from the image classifier or the stitching algorithm.

The text was updated successfully, but these errors were encountered:

keighrim · 2024-02-07T00:58:44Z

Given the new output format that's being discussed in #41, the evaluation plan is as follow

run SWT on "unseen" video with dense annotation, using the same sample rate that used in the annotation
- the model deployed via Model update for v3 #63 was trained using fold-size of 2, meaning that is only two videos that are unseen to the model (possibly 1)
- hence we might need to re-train a model that has a larger fold size so that we will have more videos to evaluate against
get the output MMIF with TimePoints and TimeFrames.
iterate through targets list and compare the frameType value of the TimeFrame and label value of the target TimePoint, collect pairs that are different
using timePoint value of the TimePoint annotations in the collected "disagreeing" pairs, look for the gold label and judge which one is correct, count scores (correct for 1)
normalize (somehow) the counted scores and return as the evaluation result

marcverhagen · 2024-02-15T22:04:50Z

Done pretty much as described above with one difference. Trying to mimic the sample rate was impossible since the app at the moment only accept milliseconds and the rate used for the annotation was using some number of frames I think. So I just used a frame for the annotation that was within at most 32ms.

keighrim added this to the swt-v3 milestone Jan 29, 2024

clams-bot added this to apps Jan 29, 2024

github-project-automation bot moved this to Todo in apps Jan 29, 2024

marcverhagen closed this as completed Feb 15, 2024

github-project-automation bot moved this from Todo to Done in apps Feb 15, 2024

keighrim mentioned this issue Jul 10, 2024

SR point-wise evaluation with measuring "stitcher" performance clamsproject/aapb-evaluations#60

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation of stitcher #61

evaluation of stitcher #61

keighrim commented Jan 29, 2024

keighrim commented Feb 7, 2024 •

edited

Loading

marcverhagen commented Feb 15, 2024

evaluation of stitcher #61

evaluation of stitcher #61

Comments

keighrim commented Jan 29, 2024

Because

Done when

Additional context

keighrim commented Feb 7, 2024 • edited Loading

marcverhagen commented Feb 15, 2024

keighrim commented Feb 7, 2024 •

edited

Loading