Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new implementation for positional encoding #103

Closed
keighrim opened this issue Jun 17, 2024 · 1 comment · Fixed by #109
Closed

new implementation for positional encoding #103

keighrim opened this issue Jun 17, 2024 · 1 comment · Fixed by #109
Labels
✨N New feature or request

Comments

@keighrim
Copy link
Member

New Feature Summary

ATM we have implemented three positional encoding scheme in the trainer code, but the way it's implemented for training as well as shipped in the app package (as checkpoint files) is very confusing for both devs and users. I'd like to address this problem by re-implementing the positional encoding in a simpler, robust way.

Current status (#26) and problems

  1. we have three schemes (fractional, sinusoidal-concat, sinusoidal-add) for positional encoding but they behave different in different ways, which adds
    1. confusion for model developers in terms different dimensions of vector representation,
      • fractional will add 1 more dimension to the backbone CNN vectors
      • sinusoidal-concat will add arbitrary number of dimension (some powers of 2, usually 256 or 512) to the backbone CNN vectors
      • sinusoidal-add does not add additional dimension
    2. confusion between concepts of "absolute" and "relative" encoding. Sinusoidal encodings are inspired by the transfomer paper, where the original problem of "position" was token position in natural language sentences. That made sense since the main purpose of positional encoding in transformer architecture is to provide relative difference "between tokens", and to proxy a sequential representation. However, current CNN-based classification SWT model is not at all sequential, hence the absoluteness of the sinusoidal encoding makes not much sense here.
    3. difficulties in grid search and result probing - this is simply by adding too much variation to the equation.
    4. unnecessary limitation on input data format (fix hardcoded 94 min limit on positional encoding #100)

Proposal

I'm proposing

  1. we always use "sinusoidal-add": this will not change the dimension of feature vectors, and we will keep the CNN-backbone dims regardless of usage of positional encoding
  2. we always use "relative" position: since absolute different between time 00:00:07 and 00:00:08 is not relevant information for current isolated image-based classifier.
  3. we expose "on-off" switch of positional encoding in the runtime parameter: this implies that we always train two models given a set of hyperparams - one with sinusoidal-add, and another without - and the evaluation will be delegated to a separate evaluation module, possibly as a part of evaluate the new SWT app aapb-evaluations#43

Related

No response

Alternatives

No response

Additional context

No response

@keighrim keighrim added the ✨N New feature or request label Jun 17, 2024
@keighrim
Copy link
Member Author

keighrim commented Jul 5, 2024

As we can approximate pos_enc_name="none" by using pos_enc_coeff=0, we decided to completely drop the pos_enc_name config key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨N New feature or request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants