fix hardcoded 94 min limit on positional encoding #100

keighrim · 2024-05-06T19:45:04Z

New Feature Summary

Since in the first rounds of training, we used 94 min (the length of the longest video in the training data in those rounds) hard-cap on the sinusoidal positional vectors. However, we now realized that

94 min is too short
"absolute" positional encoding intuitively doesn't make a lot of sense

So before moving on to the next rounds of training (with "hard" examples Owen is currently annotating), we'd like to tweak the positional encoding, and make sure the experiment results we saw in the first rounds (absolute encoding performed the best) are reproducible.

Few ideas of other hybrid positional encoding

use "absolute" encoding for the first 5 mins (and last N mins) and use relative encoding via linear mapping
precompute many sinusoidal matrices with some bins (5min, 10min, 30min, 60min, 120min, ...)
use all images in the training sets twice, once with positional vectors, second without
different granularity of positions (we used minute and second in the past, but try for example 15 secs)
positional encoding delegated to the "stitcher" part

Alternatives

No response

Additional context

No response

keighrim · 2024-06-17T19:33:51Z

So after recent discussion, we decided to first try to implement the first "hybrid" approach.
Concretely, this will

updating this method

app-swt-detection/modeling/data_loader.py

Line 95 in dede3ef

def get_sinusoidal_embeddings(self, n_pos, dim):

to have constant value for n_pos in a high enough number (we can empirically obtains the best size for this, by hyperparameterizing it)

then, updating

app-swt-detection/modeling/data_loader.py

Lines 83 to 85 in dede3ef

    
           self.pos_encoder = pos_enc_name 
        
           self.pos_dim = pos_enc_dim 
        
           self.pos_unit = pos_unit

to add self.pos_abs_th_front and self_pos_abs_th_end attributes to configure first N mins (and last M mins) of absolute lookup

and finally, updating

app-swt-detection/modeling/data_loader.py

Line 119 in dede3ef

def encode_position(self, cur_time, tot_time, img_vec):

to look up positional vector using self.pos_abs_... attirbutes. Specifically, something like this

if cur_time < self.pos_abs_th_front or tot_tim - cur_time < self.pos_abs_th_end:
    pos_lookup_col = cur_time
else:
    pos_lookup_col = cur_time / tot_time * self.pos_vec_loopup.shape[1]
pos_vec = self.pos_vec_lookup[pos_lookup_col])

In addition to that, it'd be a good idea to add one more argument to encode_position method to regularize the impact of positional encoding. Something like this;

def encode_position(self, cur_time, tot_time, img_vec, pos_vec_coeff):
    ...
    pos_vec = self.pos_vec_lookup[pos_lookup_col]) * pos_vec_coeff

kla7 · 2024-06-26T23:23:15Z

With the three new hyperparameters pos_abs_th_front, pos_abs_th_end, and pos_enc_coeff, I conducted gridsearch with the following hyperparameter values:

num_splits = {20}
num_epochs = {10}
num_layers = {4}
pos_enc_name = {"sinusoidal-add"}
input_length = {6000000}
pos_unit = {60000}
pos_enc_dim = {256}
dropouts = {0.1}
img_enc_name = {"convnext_lg"}
pos_abs_th_front = {0, 3, 5, 10}
pos_abs_th_end = {0, 3, 5, 10}
pos_enc_coeff = {1, 0.75, 0.5, 0.25}

Using see_results.py to retrieve visualizations of every possible hyperparameter configuration, I looked through each label's results to determine which configuration gives the best F1-score. Some of the labels had particularly low F1-scores so I decided to focus on B and I. Below are the compiled F1-score results from labels B and I put into spreadsheets.

Label B results

From the image above, it seems that some of the highest F1-scores result for label B when pos_abs_th_front is 0 or 10 and pos_abs_th_end is 3 or 10. Among those scores, pos_enc_coeff seems to result in higher scores when its value is set to 0.5, 0.75, or 1.

Here are the plots for the above-mentioned configurations retrieved from running see_results.py on label B:

`pos_abs_th_end` = 3	`pos_abs_th_end` = 10
The F1-score is 0.9296 when `pos_abs_th_front` = 0 and 0.9357 when `pos_abs_th_front` = 10. `pos_enc_coeff` = 0.5.	The F1-score is 0.9394 when `pos_abs_th_front` = 0 and 0.9359 when `pos_abs_th_front` = 10. `pos_enc_coeff` = 0.5.
The F1-score is 0.9444 when `pos_abs_th_front` = 0 and 0.9458 when `pos_abs_th_front` = 10. `pos_enc_coeff` = 0.75.	The F1-score is 0.9450 when `pos_abs_th_front` = 0 and 0.9417 when `pos_abs_th_front` = 10. `pos_enc_coeff` = 0.75.
The F1-score is 0.9356 when `pos_abs_th_front` = 0 and 0.9428 when `pos_abs_th_front` = 10. `pos_enc_coeff` = 1.	The F1-score is 0.9384 when `pos_abs_th_front` = 0 and 0.9400 when `pos_abs_th_front` = 10. `pos_enc_coeff` = 1.

Label I results

From the image above, it seems that some of the highest F1-scores result for label I when pos_abs_th_front is 0 and pos_abs_th_end is 5 or 10. Among those scores, pos_enc_coeff seems to result in higher scores when its value is set to 0.75.

Here are the plots for the above-mentioned configurations retrieved from running see_results.py on label I:

`pos_abs_th_end` = 5	`pos_abs_th_end` = 10
The F1-score is 0.7654 when `pos_abs_th_front` = 0 and `pos_enc_coeff` = 0.75.	The F1-score is 0.7851 when `pos_abs_th_front` = 0 and `pos_enc_coeff` = 0.75.

keighrim · 2024-06-27T01:34:54Z

Wow, this is helpful way of analyzing the results. From our domain knowledge, I suspect the labels that are most impacted by the positional information would be slates (S) (almost always occur in first a few mins) and credits (C) (always toward the end). Considering that, here's some requests

Can you apply the same heat-map based analysis on C and S labels?
Can you also try pos_enc_name = {"none", "sinusoidal-add"} with all pos_end related parameters are fixed?

kla7 · 2024-06-27T19:17:24Z

Here are the F1-score results for labels S and C compiled in spreadsheets after performing gridsearch previously.

Label S results

From the image above, it seems that some of the highest F1-scores result for label S when pos_abs_th_front is 0 or 10 and pos_abs_th_end is 5. Among those scores, pos_enc_coeff seems to result in higher scores when its value is set to 0.5 or 1.

Here are the plots for the above-mentioned configurations retrieved from running see_results.py on label S:

`pos_abs_th_end` = 5
The F1-score is 0.5830 when `pos_abs_th_front` = 0 and 0.6666 when `pos_abs_th_front` = 10. `pos_enc_coeff` = 0.5.
The F1-score is 0.5939 when `pos_abs_th_front` = 0 and 0.6009 when `pos_abs_th_front` = 10. `pos_enc_coeff` = 1.

Label C results

From the image above, it seems that some of the highest F1-scores result for label C when pos_abs_th_front is 0 or 5 and pos_abs_th_end is 5. Among those scores, pos_enc_coeff seems to result in higher scores when its value is set to 0.25 or 0.75.

`pos_abs_th_end` = 5
The F1-score is 0.5153 when `pos_abs_th_front` = 0 and 0.5044 when `pos_abs_th_front` = 5. `pos_enc_coeff` = 0.25.
The F1-score is 0.4998 when `pos_abs_th_front` = 0 and 0.5346 when `pos_abs_th_front` = 5. `pos_enc_coeff` = 0.75.

kla7 · 2024-06-27T19:48:27Z

Using fixed values found from the previous observations for the three hyperparameters pos_abs_th_front, pos_abs_th_end, and pos_enc_coeff, I conducted gridsearch with the following hyperparameter values:

num_splits = {20}
num_epochs = {10}
num_layers = {4}
pos_enc_name = {"none", "sinusoidal-add"}
input_length = {6000000}
pos_unit = {60000}
pos_enc_dim = {256}
dropouts = {0.1}
img_enc_name = {"convnext_lg"}
pos_abs_th_front = {0}
pos_abs_th_end = {5}
pos_enc_coeff = {0.75}

I once again used see_results.py to retrieve visualizations of the two possible hyperparameter configurations, each using a different pos_encoder set by pos_enc_name. Below are the plots retrieved from see_results.py for labels B, C, I, and S to observe the performance difference of using the sinusoidal-add or no using no pos_encoder.

Label B results

The results seem about equal when using sinusoidal-add or no pos_encoder.

Label C results

While these numbers are quite low on their own, it is interesting to note that the recall for sinusoidal-add is 0.1 points higher than using no pos_encoder. This may suggest that sinusoidal-add is slightly better at detecting credit scenes than using no pos_encoder.

Label I results

For label I, it seems that using sinusoidal-add is no better than using no pos_encoder. After observing the heat map for label I from the previous comment, it may be worth performing gridsearch again but with pos_abs_th_end = 10 instead, since the resulting F1-score for the relevant configuration was about as high as using no pos_encoder this time around.

Label S results

Again, these numbers are quite low to begin with and there is no significant difference, but it is interesting to note that sinusoidal-add performs better than using no pos_encoder across precision, recall, and F1-score by about a 0.07 point difference. This may suggest that sinusoidal-add is slightly better at detecting scenes containing slates than using no pos_encoder.

keighrim · 2024-06-27T22:58:42Z

So it looks like our hypothesis proves most true here, except that positional encoding "hurts" prediction performance for I (chyron) class - I hypothesized that the positional encoding won't make significant differences for those classes usually occur in the middle of input stream.

keighrim · 2024-07-01T18:27:51Z

Maybe for the upcoming rounds of experiment, we can also try to see the impact of pos_enc in terms of the input videos lengths (duration) i.e., does pos_enc works better for 30-min videos than 60-min ones?

marcverhagen · 2024-07-01T18:35:08Z

The f1-scores worry me a bit since some of them seem to be very close to the lowest of precision and recall. For example for label S we have P=0.7083 and R=0.5910, but F=0.5964, where ny back-of-the-napkin calculations has it at 0.64, which intuitively makes more sense to me.

In one case, label C, it is even below the lowest of P&R.

kla7 · 2024-07-01T23:26:04Z

The line of code to retrieve the relative position was incorrect, so I have altered it and re-ran gridsearch using the same hyperparameters as in #100 (comment).

The following line

app-swt-detection/modeling/data_loader.py

Line 141 in 43cc4d5

pos_lookup_col = cur_time // tot_time * self.pos_vec_lookup.shape[1]

was changed to be pos_lookup_col = cur_time * self.pos_vec_lookup.shape[0] // tot_time.

I have also opted to recreate the visualizations with the correct F1-scores using Google Sheets following the F1 calculation issue found by @marcverhagen (I wasn't able to determine the source of the issue).

Label B Results

The results still suggest that sinusoidal-add can detect bars about as well as using no pos_enc.

Label C Results

The results now show a greater difference in favor of using sinusoidal-add, with an increase of 0.1 points compared to no pos_enc, suggesting that closing credits can be detected better with positional encoding.

Label I Results

The results still show that sinusoidal-add does not detect chryons any better than no pos_enc, which is what was found previously.

Label S Results

The results now show a significant difference in favor of using sinusoidal-add, pushing the F1-score to 0.8, nearly a 0.25 point increase from using no pos_enc. This suggests that positional encoding performs fairly well for detecting slates.

My next plan is to perform gridsearch again with the configuration from #100 (comment) to see if there is any improvement following the change in the script. The F1-scores will be more accurate in the next gridsearch report.

keighrim · 2024-07-02T01:04:16Z

Regarding the unexpected range of F-1 scores, this is because the result aggregation/plotting script is calculating arithmetic means of P, R, F numbers from all k-fold rounds independently from each other.

app-swt-detection/scripts/see_results.py

Lines 63 to 82 in 23a8576

    
                      for row in csv_reader: 
        
                          macro_avg[row['Label']]['Accuracy'] += float(row['Accuracy']) 
        
                          macro_avg[row['Label']]['Precision'] += float(row['Precision']) 
        
                          macro_avg[row['Label']]['Recall'] += float(row['Recall']) 
        
                          macro_avg[row['Label']]['F1-Score'] += float(row['F1-Score']) 
        
               if file.endswith(".yml"): 
        
                   with open(file, "r") as f: 
        
                       data = yaml.safe_load(f) 
        
                   data['bins'] = bins.index(data['bins'])  # set bin config as an index of the bin 
        
                   # delete unnecessary items 
        
                   del data['block_guids_train'] 
        
                   del data['block_guids_valid'] 
        
                   del data['num_splits'] 
        
                   configs[key] = data 
        
           # Calculate macro averages 
        
           for k, v in macro_avg.items(): 
        
               for metric in v: 
        
                   v[metric] = v[metric]/float(i)

kla7 · 2024-07-03T19:22:32Z

Gridsearch Results

The following are results from running gridsearch using the same hyperparameters as in #100 (comment) following the change in the script. The format is the heatmap created in spreadsheets as before and the values shown are the average F1-scores, all retrieved from visualization outputs from see_results.py.

Label B

While the differences are very minimal, it seems that some of the highest F1-scores result when pos_abs_th_front is 3 or 5 and pos_abs_th_end is 5 or 10. Among those scores, pos_enc_coeff seems to result in higher scores when its value is set to 0.5.

Label C

Compared to the results found in #100 (comment), these scores look a lot better; however, they are still fairly low generally. Some of the highest F1-scores result when pos_abs_th_front is 3 or 5, pos_abs_th_end is 5 or 10, and pos_enc_coeff is 0.5.

Label I

These results look fairly similar to the ones found in #100 (comment). Some of the highest F1-scores result when pos_abs_th_front is 0 or 3, pos_abs_th_end is 3 or 10, and pos_enc_coeff is 0.5 or 1.

Label S

This label seems to have the most drastic (positive) change compared the previous gridsearch results in #100 (comment). Some of the highest F1-scores result when pos_abs_th_front is 3 or 10, pos_abs_th_end is 5 or 10, and pos_enc_coeff is 0.75 or 1.

Conclusion

With these findings, I believe that an ideal configuration for the three hyperparameters is as follows:

pos_abs_th_front: 3
pos_abs_th_end: 10
pos_enc_coeff: 0.5

Comparing `pos_enc` performances

Using the configuration mentioned above, we can compare the performance of using sinusoidal-add as opposed to no pos_enc for the model.

Label B

As found previously, there doesn't seem to be much of a difference in performance between using positional encoding compared to not using a pos_enc for detecting bars.

Label C

These results show that using positional encoding may allow for the model to detect closing credits better than not using a pos_enc.

Label I

Once again, these results show that using positional encoding does not perform any better than using no pos_enc for detecting chyrons and in fact might be slightly worse.

Label S

These results are as drastic as in #100 (comment) where it appears that positional encoding performs nearly 0.3 points better than using no pos_enc, which suggests that the model can detect slates fairly well using sinusoidal-add.

keighrim added the ✨N New feature or request label May 6, 2024

keighrim mentioned this issue Jun 17, 2024

new implementation for positional encoding #103

Closed

clams-bot assigned kla7 Jun 17, 2024

kla7 mentioned this issue Jun 26, 2024

Add relative positional encoding #108

Merged

kla7 closed this as completed in #108 Jul 3, 2024

clams-bot unassigned kla7 Jul 3, 2024

keighrim mentioned this issue Jul 22, 2024

positional encoding is not working #113

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix hardcoded 94 min limit on positional encoding #100

fix hardcoded 94 min limit on positional encoding #100

keighrim commented May 6, 2024 •

edited

Loading

keighrim commented Jun 17, 2024

kla7 commented Jun 26, 2024 •

edited

Loading

keighrim commented Jun 27, 2024

kla7 commented Jun 27, 2024

kla7 commented Jun 27, 2024

keighrim commented Jun 27, 2024

keighrim commented Jul 1, 2024

marcverhagen commented Jul 1, 2024

kla7 commented Jul 1, 2024 •

edited

Loading

keighrim commented Jul 2, 2024

kla7 commented Jul 3, 2024

fix hardcoded 94 min limit on positional encoding #100

fix hardcoded 94 min limit on positional encoding #100

Comments

keighrim commented May 6, 2024 • edited Loading

New Feature Summary

Related

Alternatives

Additional context

keighrim commented Jun 17, 2024

kla7 commented Jun 26, 2024 • edited Loading

Label B results

Label I results

keighrim commented Jun 27, 2024

kla7 commented Jun 27, 2024

Label S results

Label C results

kla7 commented Jun 27, 2024

Label B results

Label C results

Label I results

Label S results

keighrim commented Jun 27, 2024

keighrim commented Jul 1, 2024

marcverhagen commented Jul 1, 2024

kla7 commented Jul 1, 2024 • edited Loading

Label B Results

Label C Results

Label I Results

Label S Results

keighrim commented Jul 2, 2024

kla7 commented Jul 3, 2024

Gridsearch Results

Label B

Label C

Label I

Label S

Conclusion

Comparing pos_enc performances

Label B

Label C

Label I

Label S

keighrim commented May 6, 2024 •

edited

Loading

kla7 commented Jun 26, 2024 •

edited

Loading

kla7 commented Jul 1, 2024 •

edited

Loading

Comparing `pos_enc` performances