You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Has anyone used MusicGen to try and generate embeddings for audio/music datasets? Specifically the language model part, not just EnCodec. I have been trying to do this myself for a research project, and I am struggling to achieve any meaningful separation, even between dramatically different datasets.
The text was updated successfully, but these errors were encountered:
Generally, causal (left-to right, autoregressive) models don't make great embeddings cause the first tokens missing a lot of context due to attention structure. Masked language models are better suited for embeddings. That's the reason why many projects (including audiocraft) use T5 for text embeddings despite that larger and newer (but autoregressive) models are available.
Perhaps MagNET would be better for what you're trying to achieve since its non-autoregressive
Has anyone used MusicGen to try and generate embeddings for audio/music datasets? Specifically the language model part, not just EnCodec. I have been trying to do this myself for a research project, and I am struggling to achieve any meaningful separation, even between dramatically different datasets.
The text was updated successfully, but these errors were encountered: