You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
what's the meaning of media in input to PerceiverResampler ? Why give time embedding to different media ? Shouldn't given time embedding to different frames?
#301
Open
Yang-bug-star opened this issue
Jun 4, 2024
· 0 comments
According to the original paper, the input shape to PerceiverResampler should be (b, T, v, d),T means the number of frames in video in time and v means the number of visual tokens of one frame. But I'm confused about the concept of media
The text was updated successfully, but these errors were encountered:
According to the original paper, the input shape to PerceiverResampler should be (b, T, v, d),T means the number of frames in video in time and v means the number of visual tokens of one frame. But I'm confused about the concept of media
The text was updated successfully, but these errors were encountered: