Skip to content

Latest commit

 

History

History
67 lines (53 loc) · 10.2 KB

README.md

File metadata and controls

67 lines (53 loc) · 10.2 KB

LAION-Audio-630K Dataset

LAION-Audio-630K is a large-scale audio-text dataset consisting of 633,526 pairs with the total duration of 4,325.39 hours. It contains audios of human activities, natural sounds and audio effects, consisting of 8 data sources (see the data source table below) from publicly available websites. We collect these datasets by downloading audios and relevant text descriptions. Based on our current knowledge, LAION-Audio-630K is the largest audio-text dataset publicly available and a magnitude larger than previous audio-text datasets (by 2022-11-05).

Content

Among the 8 datasets, we only release 4 of them (BBC sound effects, Epidemic Sound, Audiostock and Freesound). The first 3 datasets are available under csv format , since they are public available by anyone through URL links provided by correspondent websites. As to Freesound, we released the whole dataset (audio files + text caption) to Hugging Face. However, as to the others, i.e. Free To Use Sounds, Sonniss Game Effects, We Sound Effects and Paramount Motion Sound Effects, we would not release them because they are pruchased by LAION.

CSV Format

CSV files are of the following structure:

url caption1 caption2 ... caption_t5 {metadata1} {metadata2} ...
  • url: The URL of the audio file
  • caption_i: the i-th caption of the audio file
  • caption_t5: For Epidemic Sound, we adopted keywords-to-caption data augmentation using T5 model. Details could be found in the datacard of Epidemic Sound.
  • {metadata_i}: Metadata could be the freesound id of the audio etc.

Datacards

We provide a datacard for each dataset we processed, which record how we process it. If you want to learn more about caption generation as well as details of keywords-to-caption data augmentation, please read datacards available here (for Epidemic Sound dataset).

About Freesound

We provide two version of Freesound dataset.

  • Freesound (full): The original Freesound dataset. Details could be found in its datacard.
  • Freesound (no overlap): Made based on Freesound(full), with samples from ESC50, FSD50K, Urbansound8K and Clotho removed.

We have released the processed freesound dataset in Webdataset format to a Hugging Face repository

Data Sources

Name Duration Number of Samples Data Type Source Data Card
Freesound (no overlap) 2817.31hrs 460801 1-2 captions per audio, audio website
licenses file
Hugging Face repository
Freesound (full) 3033.38hrs 515581 1-2 captions per audio, audio website
licenses file
Hugging Face repository
data card
Epidemic Sound 220.41hrs 75645 2 captions per audio, audio website
csv (Including T5-generated de-biased captions)
data card
Audiostock 453.36hrs 10000 1 caption per audio, audio website
csv
data card
Audiostock (raw) 11305.42hrs 251618 1 caption per audio, audio website
csv
data card
BBC Sound Effects 463.48hrs 15973 1 caption per audio, audio website
csv*(no longer available, click to see explication below)
data card
Free To Use Sounds 175.73hrs 6370 Filename as caption, audio website(need purchasing)
Sonniss Game effects 84.6hrs 5049 Filename as caption, audio website(need purchasing)
We Sound Effects 12.00hrs 488 Filename as caption, audio website(need purchasing)
Paramount Motion Sound Effects 19.49hrs 4420 Filename as caption, audio website(need purchasing)

*About BBC Sound Effects

Recently, BBC sound effects have modified their website structure. In consequence, only 300 samples are available for download. So, unfortunately, we are no longer able to generate csv file using our old scripts. In the meantime, many scrappers exist on GitHub, such as https://github.com/alisomay/bbc-sound-effects-downloader. You may try them to see if they work.

Keyword-to-Caption Augmentation

We employ the keyword-to-caption model to augment labels of AudioSet and Epidemic Sound into corresponding captions with aid of a pre-trained language model T5. We also de-bias these captions by replacing, for example, "woman" and "man" with "person", aiming to eliminate potential gender discrimination. We hereby release the augmented captions for Epidemic Sound and AudioSet (in csv format).

Epidemic Sound AudioSet
Epidemic_all_debiased.csv csv files for AudioSet balanced_train, unbalanced_train, and eval splits

Credits & Licence

  • !!!TERM OF USE!!!: By downloading audios through the links provided in the csv files, you agree that you will use the audios for research purposes only, unless you get the permission from owners of the Datasource that you can use it for other purposes.

Acknowledgement

The whole collection process as well as all usage of the LAION-Audio-630K are conducted by Germany non-profit pure research organization LAION. All contributors and collectors of the dataset are considered as open source contributors affiliated to LAION. These community contributors (Discord ids) include but not limited to: @marianna13#7139, @Chr0my#0173, @PiEquals4#1909, @Yuchen Hui#8574, @Antoniooooo#4758, @IYWO#9072, krishna#1648, @dicknascarsixtynine#3885, and @turian#1607. We would like to appreciate all of them for their efforts on the LAION-Audio-630k dataset.