Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix hardcoded 94 min limit on positional encoding #100

Closed
keighrim opened this issue May 6, 2024 · 11 comments · Fixed by #108
Closed

fix hardcoded 94 min limit on positional encoding #100

keighrim opened this issue May 6, 2024 · 11 comments · Fixed by #108
Labels
✨N New feature or request

Comments

@keighrim
Copy link
Member

keighrim commented May 6, 2024

New Feature Summary

Since in the first rounds of training, we used 94 min (the length of the longest video in the training data in those rounds) hard-cap on the sinusoidal positional vectors. However, we now realized that

  1. 94 min is too short
  2. "absolute" positional encoding intuitively doesn't make a lot of sense

So before moving on to the next rounds of training (with "hard" examples Owen is currently annotating), we'd like to tweak the positional encoding, and make sure the experiment results we saw in the first rounds (absolute encoding performed the best) are reproducible.

Few ideas of other hybrid positional encoding

  1. use "absolute" encoding for the first 5 mins (and last N mins) and use relative encoding via linear mapping
  2. precompute many sinusoidal matrices with some bins (5min, 10min, 30min, 60min, 120min, ...)
  3. use all images in the training sets twice, once with positional vectors, second without
  4. different granularity of positions (we used minute and second in the past, but try for example 15 secs)
  5. positional encoding delegated to the "stitcher" part

Related

No response

Alternatives

No response

Additional context

No response

@keighrim keighrim added the ✨N New feature or request label May 6, 2024
@keighrim
Copy link
Member Author

So after recent discussion, we decided to first try to implement the first "hybrid" approach.
Concretely, this will

  1. updating this method

    def get_sinusoidal_embeddings(self, n_pos, dim):

    to have constant value for n_pos in a high enough number (we can empirically obtains the best size for this, by hyperparameterizing it)

  2. then, updating

    self.pos_encoder = pos_enc_name
    self.pos_dim = pos_enc_dim
    self.pos_unit = pos_unit

    to add self.pos_abs_th_front and self_pos_abs_th_end attributes to configure first N mins (and last M mins) of absolute lookup

  3. and finally, updating

    def encode_position(self, cur_time, tot_time, img_vec):

    to look up positional vector using self.pos_abs_... attirbutes. Specifically, something like this

    if cur_time < self.pos_abs_th_front or tot_tim - cur_time < self.pos_abs_th_end:
        pos_lookup_col = cur_time
    else:
        pos_lookup_col = cur_time / tot_time * self.pos_vec_loopup.shape[1]
    pos_vec = self.pos_vec_lookup[pos_lookup_col])
  4. In addition to that, it'd be a good idea to add one more argument to encode_position method to regularize the impact of positional encoding. Something like this;

    def encode_position(self, cur_time, tot_time, img_vec, pos_vec_coeff):
        ...
        pos_vec = self.pos_vec_lookup[pos_lookup_col]) * pos_vec_coeff

@kla7
Copy link
Contributor

kla7 commented Jun 26, 2024

With the three new hyperparameters pos_abs_th_front, pos_abs_th_end, and pos_enc_coeff, I conducted gridsearch with the following hyperparameter values:

num_splits = {20}
num_epochs = {10}
num_layers = {4}
pos_enc_name = {"sinusoidal-add"}
input_length = {6000000}
pos_unit = {60000}
pos_enc_dim = {256}
dropouts = {0.1}
img_enc_name = {"convnext_lg"}
pos_abs_th_front = {0, 3, 5, 10}
pos_abs_th_end = {0, 3, 5, 10}
pos_enc_coeff = {1, 0.75, 0.5, 0.25}

Using see_results.py to retrieve visualizations of every possible hyperparameter configuration, I looked through each label's results to determine which configuration gives the best F1-score. Some of the labels had particularly low F1-scores so I decided to focus on B and I. Below are the compiled F1-score results from labels B and I put into spreadsheets.

Label B results

image

From the image above, it seems that some of the highest F1-scores result for label B when pos_abs_th_front is 0 or 10 and pos_abs_th_end is 3 or 10. Among those scores, pos_enc_coeff seems to result in higher scores when its value is set to 0.5, 0.75, or 1.

Here are the plots for the above-mentioned configurations retrieved from running see_results.py on label B:

pos_abs_th_end = 3 pos_abs_th_end = 10
image The F1-score is 0.9296 when pos_abs_th_front = 0 and 0.9357 when pos_abs_th_front = 10. pos_enc_coeff = 0.5. image The F1-score is 0.9394 when pos_abs_th_front = 0 and 0.9359 when pos_abs_th_front = 10. pos_enc_coeff = 0.5.
image The F1-score is 0.9444 when pos_abs_th_front = 0 and 0.9458 when pos_abs_th_front = 10. pos_enc_coeff = 0.75. image The F1-score is 0.9450 when pos_abs_th_front = 0 and 0.9417 when pos_abs_th_front = 10. pos_enc_coeff = 0.75.
image The F1-score is 0.9356 when pos_abs_th_front = 0 and 0.9428 when pos_abs_th_front = 10. pos_enc_coeff = 1. image The F1-score is 0.9384 when pos_abs_th_front = 0 and 0.9400 when pos_abs_th_front = 10. pos_enc_coeff = 1.

Label I results

Screenshot 2024-06-26 153111

From the image above, it seems that some of the highest F1-scores result for label I when pos_abs_th_front is 0 and pos_abs_th_end is 5 or 10. Among those scores, pos_enc_coeff seems to result in higher scores when its value is set to 0.75.

Here are the plots for the above-mentioned configurations retrieved from running see_results.py on label I:

pos_abs_th_end = 5 pos_abs_th_end = 10
image The F1-score is 0.7654 when pos_abs_th_front = 0 and pos_enc_coeff = 0.75. image The F1-score is 0.7851 when pos_abs_th_front = 0 and pos_enc_coeff = 0.75.

@keighrim
Copy link
Member Author

Wow, this is helpful way of analyzing the results. From our domain knowledge, I suspect the labels that are most impacted by the positional information would be slates (S) (almost always occur in first a few mins) and credits (C) (always toward the end). Considering that, here's some requests

  • Can you apply the same heat-map based analysis on C and S labels?
  • Can you also try pos_enc_name = {"none", "sinusoidal-add"} with all pos_end related parameters are fixed?

@kla7
Copy link
Contributor

kla7 commented Jun 27, 2024

Here are the F1-score results for labels S and C compiled in spreadsheets after performing gridsearch previously.

Label S results

image

From the image above, it seems that some of the highest F1-scores result for label S when pos_abs_th_front is 0 or 10 and pos_abs_th_end is 5. Among those scores, pos_enc_coeff seems to result in higher scores when its value is set to 0.5 or 1.

Here are the plots for the above-mentioned configurations retrieved from running see_results.py on label S:

pos_abs_th_end = 5
image The F1-score is 0.5830 when pos_abs_th_front = 0 and 0.6666 when pos_abs_th_front = 10. pos_enc_coeff = 0.5.
image The F1-score is 0.5939 when pos_abs_th_front = 0 and 0.6009 when pos_abs_th_front = 10. pos_enc_coeff = 1.

Label C results

image

From the image above, it seems that some of the highest F1-scores result for label C when pos_abs_th_front is 0 or 5 and pos_abs_th_end is 5. Among those scores, pos_enc_coeff seems to result in higher scores when its value is set to 0.25 or 0.75.

pos_abs_th_end = 5
image The F1-score is 0.5153 when pos_abs_th_front = 0 and 0.5044 when pos_abs_th_front = 5. pos_enc_coeff = 0.25.
image The F1-score is 0.4998 when pos_abs_th_front = 0 and 0.5346 when pos_abs_th_front = 5. pos_enc_coeff = 0.75.

@kla7
Copy link
Contributor

kla7 commented Jun 27, 2024

Using fixed values found from the previous observations for the three hyperparameters pos_abs_th_front, pos_abs_th_end, and pos_enc_coeff, I conducted gridsearch with the following hyperparameter values:

num_splits = {20}
num_epochs = {10}
num_layers = {4}
pos_enc_name = {"none", "sinusoidal-add"}
input_length = {6000000}
pos_unit = {60000}
pos_enc_dim = {256}
dropouts = {0.1}
img_enc_name = {"convnext_lg"}
pos_abs_th_front = {0}
pos_abs_th_end = {5}
pos_enc_coeff = {0.75}

I once again used see_results.py to retrieve visualizations of the two possible hyperparameter configurations, each using a different pos_encoder set by pos_enc_name. Below are the plots retrieved from see_results.py for labels B, C, I, and S to observe the performance difference of using the sinusoidal-add or no using no pos_encoder.

Label B results

image

The results seem about equal when using sinusoidal-add or no pos_encoder.

Label C results

image

While these numbers are quite low on their own, it is interesting to note that the recall for sinusoidal-add is 0.1 points higher than using no pos_encoder. This may suggest that sinusoidal-add is slightly better at detecting credit scenes than using no pos_encoder.

Label I results

image

For label I, it seems that using sinusoidal-add is no better than using no pos_encoder. After observing the heat map for label I from the previous comment, it may be worth performing gridsearch again but with pos_abs_th_end = 10 instead, since the resulting F1-score for the relevant configuration was about as high as using no pos_encoder this time around.

Label S results

image

Again, these numbers are quite low to begin with and there is no significant difference, but it is interesting to note that sinusoidal-add performs better than using no pos_encoder across precision, recall, and F1-score by about a 0.07 point difference. This may suggest that sinusoidal-add is slightly better at detecting scenes containing slates than using no pos_encoder.

@keighrim
Copy link
Member Author

So it looks like our hypothesis proves most true here, except that positional encoding "hurts" prediction performance for I (chyron) class - I hypothesized that the positional encoding won't make significant differences for those classes usually occur in the middle of input stream.

@keighrim
Copy link
Member Author

keighrim commented Jul 1, 2024

Maybe for the upcoming rounds of experiment, we can also try to see the impact of pos_enc in terms of the input videos lengths (duration) i.e., does pos_enc works better for 30-min videos than 60-min ones?

@marcverhagen
Copy link
Contributor

The f1-scores worry me a bit since some of them seem to be very close to the lowest of precision and recall. For example for label S we have P=0.7083 and R=0.5910, but F=0.5964, where ny back-of-the-napkin calculations has it at 0.64, which intuitively makes more sense to me.

In one case, label C, it is even below the lowest of P&R.

@kla7
Copy link
Contributor

kla7 commented Jul 1, 2024

The line of code to retrieve the relative position was incorrect, so I have altered it and re-ran gridsearch using the same hyperparameters as in #100 (comment).

The following line

pos_lookup_col = cur_time // tot_time * self.pos_vec_lookup.shape[1]

was changed to be pos_lookup_col = cur_time * self.pos_vec_lookup.shape[0] // tot_time.

I have also opted to recreate the visualizations with the correct F1-scores using Google Sheets following the F1 calculation issue found by @marcverhagen (I wasn't able to determine the source of the issue).

Label B Results

B
The results still suggest that sinusoidal-add can detect bars about as well as using no pos_enc.

Label C Results

C
The results now show a greater difference in favor of using sinusoidal-add, with an increase of 0.1 points compared to no pos_enc, suggesting that closing credits can be detected better with positional encoding.

Label I Results

I
The results still show that sinusoidal-add does not detect chryons any better than no pos_enc, which is what was found previously.

Label S Results

S
The results now show a significant difference in favor of using sinusoidal-add, pushing the F1-score to 0.8, nearly a 0.25 point increase from using no pos_enc. This suggests that positional encoding performs fairly well for detecting slates.

My next plan is to perform gridsearch again with the configuration from #100 (comment) to see if there is any improvement following the change in the script. The F1-scores will be more accurate in the next gridsearch report.

@keighrim
Copy link
Member Author

keighrim commented Jul 2, 2024

Regarding the unexpected range of F-1 scores, this is because the result aggregation/plotting script is calculating arithmetic means of P, R, F numbers from all k-fold rounds independently from each other.

for row in csv_reader:
macro_avg[row['Label']]['Accuracy'] += float(row['Accuracy'])
macro_avg[row['Label']]['Precision'] += float(row['Precision'])
macro_avg[row['Label']]['Recall'] += float(row['Recall'])
macro_avg[row['Label']]['F1-Score'] += float(row['F1-Score'])
if file.endswith(".yml"):
with open(file, "r") as f:
data = yaml.safe_load(f)
data['bins'] = bins.index(data['bins']) # set bin config as an index of the bin
# delete unnecessary items
del data['block_guids_train']
del data['block_guids_valid']
del data['num_splits']
configs[key] = data
# Calculate macro averages
for k, v in macro_avg.items():
for metric in v:
v[metric] = v[metric]/float(i)

@kla7
Copy link
Contributor

kla7 commented Jul 3, 2024

Gridsearch Results

The following are results from running gridsearch using the same hyperparameters as in #100 (comment) following the change in the script. The format is the heatmap created in spreadsheets as before and the values shown are the average F1-scores, all retrieved from visualization outputs from see_results.py.

Label B

image
While the differences are very minimal, it seems that some of the highest F1-scores result when pos_abs_th_front is 3 or 5 and pos_abs_th_end is 5 or 10. Among those scores, pos_enc_coeff seems to result in higher scores when its value is set to 0.5.

Label C

image
Compared to the results found in #100 (comment), these scores look a lot better; however, they are still fairly low generally. Some of the highest F1-scores result when pos_abs_th_front is 3 or 5, pos_abs_th_end is 5 or 10, and pos_enc_coeff is 0.5.

Label I

image
These results look fairly similar to the ones found in #100 (comment). Some of the highest F1-scores result when pos_abs_th_front is 0 or 3, pos_abs_th_end is 3 or 10, and pos_enc_coeff is 0.5 or 1.

Label S

image
This label seems to have the most drastic (positive) change compared the previous gridsearch results in #100 (comment). Some of the highest F1-scores result when pos_abs_th_front is 3 or 10, pos_abs_th_end is 5 or 10, and pos_enc_coeff is 0.75 or 1.

Conclusion

With these findings, I believe that an ideal configuration for the three hyperparameters is as follows:

pos_abs_th_front: 3
pos_abs_th_end: 10
pos_enc_coeff: 0.5

Comparing pos_enc performances

Using the configuration mentioned above, we can compare the performance of using sinusoidal-add as opposed to no pos_enc for the model.

Label B

image
As found previously, there doesn't seem to be much of a difference in performance between using positional encoding compared to not using a pos_enc for detecting bars.

Label C

image
These results show that using positional encoding may allow for the model to detect closing credits better than not using a pos_enc.

Label I

image
Once again, these results show that using positional encoding does not perform any better than using no pos_enc for detecting chyrons and in fact might be slightly worse.

Label S

image
These results are as drastic as in #100 (comment) where it appears that positional encoding performs nearly 0.3 points better than using no pos_enc, which suggests that the model can detect slates fairly well using sinusoidal-add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨N New feature or request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants