Skip to content

AI generated text captions solely by referencing known named entities

Notifications You must be signed in to change notification settings

erl-j/Gold-Caps

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gold-Caps

This is a growing repository of AI-generated caption datasets. A caption is a short descriptive or explanatory text that accompanies content. We intend Gold-Caps to be used for research in topics such as cross-modal modelling.

At this time, Gold-Caps contains captions for the Lakh MIDI Dataset-matched music dataset (~30,000 tracks with accompanying MIDI files). These captions were generated by the gpt-4-1106-preview chat endpoint prompted to describe each track based on the track title and artist. The captions have not been filtered or post-processed in any way.

Give a general description of the track <title> by <artist_name> in one sentence.
Don't mention the title or artist.

Check out some example captions on the demo page

(The demo page also includes example captions from alternative prompts.)

Use the datasets

The dataset is hosted on Zenodo:

Comparison with Related Datasets

[1] Noise2Music: Text-conditioned Music Generation with Diffusion Models Qingqing Huang et al.

In order to build their text conditioning the authors “[…] take a pseudo-labeling approach via leveraging MuLan (Huang et al., 2022), a pre-trained text and music audio joint embedding model, together with LaMDA (Thoppilan et al., 2022), a pre-trained large language model, to assign pseudo labels with finegrained semantic to unlabeled music audio clips.”. The process involves creating a large number of pseudo-captions using LaMDA and filtering them according to similarity to the audio computed through MuLAN. Models and datasets are not publicly available.

[2] LP-MusicCaps: LLM-Based Pseudo Music Captioning SeungHeon Doh et al.

The authors use GPT-3.5-Turbo to turn a set of tags associated with the songs in three datasets (MusicCaps, Magna-tag-a-tune, Million Songs Dataset) into captions. This is achieved using various prompting strategies and evaluated using both objective and subjective metrics. Models and datasets are released on this repository

[3] LLark: A Multimodal Foundation Model for Music Josh Gardner et al.

The authors built a model capable of addressing many tasks in music understanding, including captioning. The model features a pretrained generative audio encoder, a pretrained language model, and a simple multimodal projection module that maps encoded audio into the LLM embedding space. Variants of ChatGPT were used to merge the heterogenous information in various datasets into uniform inputs for instruction tuning. The resulting capitions are not made available by the authors.

Acknowledgments

To cite this project, use the following entry:

@dataset{jonason_2023_10178563,
  author       = {Jonason, Nicolas and
                  Casini, Luca and
                  Sturm, Bob},
  title        = {Gold-Caps\_LMD-Matched\_General},
  month        = nov,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {0.0.0},
  doi          = {10.5281/zenodo.10178563},
  url          = {https://doi.org/10.5281/zenodo.10178563}
}

This work was supported in part by the grant ERC-2019-COG No. 864189 MUSAiC: Music at the Frontiers of Artificial Creativity and Criticism.

About

AI generated text captions solely by referencing known named entities

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published