Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evaluation of stitcher #61

Closed
keighrim opened this issue Jan 29, 2024 · 2 comments
Closed

evaluation of stitcher #61

keighrim opened this issue Jan 29, 2024 · 2 comments
Milestone

Comments

@keighrim
Copy link
Member

Because

We want to see if the stitcher/smoothing added via #33 is doing well, independent from the accuracy of image-level classification model.

Done when

Controlled evaluation is done to measure the effectiveness of the stitcher. The evaluation at high-level, should be measuring the performance difference between raw image classification results and re-constructed image classification results from TimeFrame annotations.

Additional context

Original idea of having this evaluated was proposed by @owencking in his email on 12/15/2023. Here's an excerpt from it.


Suppose we have a set of still frames F from a full video. And suppose we have human/gold labels of all those frames. Suppose we have a CV model M and a stitching algorithm S. Then a full SWT app is the composite M+S and can take a video as input.

Generate two sets of image classification predictions for all the frames in F:

  • The first prediction set P1 just uses M (choosing the label with max score across the output categories) to predict one label for each frame.
  • The second prediction set P2 is generated by using M+S on the original video to generate time-based annotations, then producing a label prediction for each frame in F according to the label in the annotation for the time period containing the frame.

Compare both P1 and P2 to the gold labels of the frames in F. Then we can evaluate how good S is according to how much P2 improves over P1. This gives us a way of evaluating different stitching algorithms (somewhat) independently of the CV model. So it will allow us to tell whether performance improvements for time-based eval metrics are coming from the image classifier or the stitching algorithm.

@keighrim keighrim added this to the swt-v3 milestone Jan 29, 2024
@keighrim
Copy link
Member Author

keighrim commented Feb 7, 2024

Given the new output format that's being discussed in #41, the evaluation plan is as follow

  1. run SWT on "unseen" video with dense annotation, using the same sample rate that used in the annotation
    • the model deployed via Model update for v3  #63 was trained using fold-size of 2, meaning that is only two videos that are unseen to the model (possibly 1)
    • hence we might need to re-train a model that has a larger fold size so that we will have more videos to evaluate against
  2. get the output MMIF with TimePoints and TimeFrames.
  3. iterate through targets list and compare the frameType value of the TimeFrame and label value of the target TimePoint, collect pairs that are different
  4. using timePoint value of the TimePoint annotations in the collected "disagreeing" pairs, look for the gold label and judge which one is correct, count scores (correct for 1)
  5. normalize (somehow) the counted scores and return as the evaluation result

@marcverhagen
Copy link
Contributor

Done pretty much as described above with one difference. Trying to mimic the sample rate was impossible since the app at the moment only accept milliseconds and the rate used for the annotation was using some number of frames I think. So I just used a frame for the annotation that was within at most 32ms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants