Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POS_EMB #15

Closed
polarbear55688 opened this issue Jul 10, 2024 · 3 comments
Closed

POS_EMB #15

polarbear55688 opened this issue Jul 10, 2024 · 3 comments
Assignees

Comments

@polarbear55688
Copy link

Hello, when I was looking at the config.yaml file, I saw that POS_EMB on line 54 was commented out. I wanted to know how the pos_emb.npy file was formed.

@polarbear55688
Copy link
Author

There is another question. I plan to input my own data set (rgb video) into this model to run. In addition to converting the data to 30fps first, do any corrections need to be made?

@polarbear55688
Copy link
Author

The last question. The paper mentions using a method similar to vit to cut the image into multiple patches and throw them into the model. However, the patch size in line 52 of the config.yaml configuration is annotated. I would like to know your patch division. Is the principle of treating each frame of a movie as a patch? Still, the image of each frame is cut into a fixed size and used as a patch. If this is the case, why is the value of patch not defined?
Sorry to bother you with so many questions.

@simoneangarano simoneangarano self-assigned this Jul 18, 2024
@simoneangarano
Copy link
Member

simoneangarano commented Jul 20, 2024

Hi @polarbear55688
Let me try to answer your questions:

Hello, when I was looking at the config.yaml file, I saw that POS_EMB on line 54 was commented out. I wanted to know how the pos_emb.npy file was formed.

That file was part of a new development about a smarter positional embedding, but ultimately, we decided to discard that idea. Just ignore it

There is another question. I plan to input my own data set (RGB video) into this model to run. In addition to converting the data to 30fps first, do any corrections need to be made?

There are no other corrections needed. As a remark, the AcT model takes human poses (skeletal data) as input, so you need to process your videos to extract human poses before using AcT.

The last question. The paper mentions using a method similar to vit to cut the image into multiple patches and throw them into the model. However, the patch size in line 52 of the config.yaml configuration is annotated. I would like to know your patch division. Is the principle of treating each frame of a movie as a patch? Still, the image of each frame is cut into a fixed size and used as a patch. If this is the case, why is the value of patch not defined? Sorry to bother you with so many questions.

Yeah, that shouldn't be commented out, but we found out that 1 as patch dimension works best, so nothing changes as it's the default value.

Hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants