what's the meaning of media in input to PerceiverResampler ? Why give time embedding to different media ? Shouldn't given time embedding to different frames? #301

Yang-bug-star · 2024-06-04T06:55:12Z

According to the original paper, the input shape to PerceiverResampler should be (b, T, v, d)，T means the number of frames in video in time and v means the number of visual tokens of one frame. But I'm confused about the concept of media

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what's the meaning of media in input to PerceiverResampler ? Why give time embedding to different media ? Shouldn't given time embedding to different frames? #301

what's the meaning of media in input to PerceiverResampler ? Why give time embedding to different media ? Shouldn't given time embedding to different frames? #301

Yang-bug-star commented Jun 4, 2024

what's the meaning of media in input to PerceiverResampler ? Why give time embedding to different media ? Shouldn't given time embedding to different frames? #301

what's the meaning of media in input to PerceiverResampler ? Why give time embedding to different media ? Shouldn't given time embedding to different frames? #301

Comments

Yang-bug-star commented Jun 4, 2024