Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The generation of the sub-task 【fine-grained action】 in MVBench #252

Open
yxsysu opened this issue Dec 16, 2024 · 1 comment
Open

The generation of the sub-task 【fine-grained action】 in MVBench #252

yxsysu opened this issue Dec 16, 2024 · 1 comment
Assignees

Comments

@yxsysu
Copy link

yxsysu commented Dec 16, 2024

Hello authors,

In your paper, you mention that the candidates of the question in the sub-task【fine-grained action】 are generated using UMT-L. Could you please clarify whether you use a pre-trained UMT-L model to encode the videos and the 339 categories (the total number of categories in Moments in Time dataset), and then compute the text-visual similarity?

Thank you!

@Andy1621
Copy link
Collaborator

Yes, we use the UMT-L model to encode the video, and then select the top-10 similar types based on the prediction score.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants