Audio Dataset for training CLAP and other models. In this readme, we define the standard and method to store and process the audio data. Please feel free to propose idea or comments for this documentation. We will iterate several rounds to have a final version. If something confuses you while reading this doc, you can refer to the end of this doc for a complete example. If you are still confused after that, do please contact us at discord!!!!!!!
For audio dataset, our data process pipline is:
Raw dataset ->
Processed dataset (audio+json) ->
Webdataset (set of.tar
).
The raw dataset refers to the raw form of the dataset as they downloaded (For example, downloaded datasets released on Zenodo). They might have various file format, and might have metadata, captions, or labels, stored in different format. We will take the raw dataset and process them to a unified data storage format.
Please find the list of all datasets we have found here in our github repository.
!! Important !!: For the terminology (3 terms: caption, class label and tag) we will use below, please find their explanation here.
The processed dataset contains only audio files and its labels. The audio is saved in .flac
format with a sample rate
of 48000
. The label of the audio, including captions/class labels/tags/metadata, are stored in a .json
file with
same filename as the .flac
file. The file is renamed in processed dataset, and name format in processed dataset is in
number id (1.wav
, 1.json
), to avoid parsing error in subsequent processing caused by file name.
The label of the audio is saved in a .json
file as a dictionary form. The keys of the dictionary and its format are specified as follows:
The value of key text
should be a list of strings where each string is a caption. The example below demonstrates a text with 5 captions in its list:
{
"text": ["A wooden door creaks open and closed multiple times.",
"A creaky door opens and then is shut.",
"A vent releases air in the background and a hinge creaks close by.",
"An air vent in the background and a hinge creaking in the foreground.",
"The squeaky door gets louder the more it moves, Then it suddenly slams shut."]
}
The captions in the list of text
will be used to form audio-text pair, which is indispensable for the model training. Thus the list must not be empty. Therefore, for those curated datasets who do not offer any caption, we have to make up a caption from their class labels.
- you may refer to the method adopted by us of making up captions for AudioSet and FSD50K: if we have class labels A, B and C, then we let the caption be "The sounds of A, B and C".
With regard to Speech datasets with transcript, we prefer to make up the caption like this: The person is saying "<transcript>"
. If more information are offered by datasets, such as emotions while speaking, please contact Yuchen Hui and discuss with him how to adapt the make-up method so that the caption could include these extra elements.
Note: the entry name is "tag" instead of "tags"!!! The tags of the audio. Its value is a list of stringsi.e. "tag":["str1","str2",...]
where each string could be either a class label (e.g., AudioSet) or a tag (in terms of this definition). Note that even if some class labels are utilized for making up captions for the text
key, they should always be listed here. See the complete example below for a list containing both class labels used in caption fabrication and several tags.
Any form of original data associated with the audio. Can be in arbitrary form as long as consistent inside dataset. For example, if the original data of the audio is not in the form of tag or text description, you could save the original data here. In fact, all metadata other than text
and tag
should be stored here.
{
"text": ["The sounds of Musical instrument, Harp and Music"],// Made up text description from class labels
"tag": ["Music", "Oriental", "Game-development", "Intro", "Film-production", "Harp", "Oriental-harp", "Musical instrument", "Bonus-sound"],
"original_data": { // metadata
"title": "Oriental Harp 3",
"description": "A professional quality sound effect of an Oriental Harp intro/fill. Suited to game development, film production and media use.",
"license": "http://creativecommons.org/licenses/by/3.0/",
"uploader": "Soughtaftersounds", "fname": "145450",
"mids(class_label_id)": ["/m/04szw", "/m/03m5k", "/m/04rlf"]}
}
In data_preprocess
folder, you could find the codes and scripts for each raw dataset. If you contribute to process a
new dataset, please add your scripts to data_preprocess
folder.
You can find the codes to process audio files in utils/audio_utils
. You can begin with learning from the codes if you would love to contribute.
An example of preprocess raw dataset can be found at data_preprocess/preprocess_clotho.py
or data_preprocess/preprocess_MACS_yuchen.py
.
For each raw dataset, we should leave-out part of the dataset as test set. When generating .flac
and .json
files,
please also split the dataset. The .flac
and .json
files should be generated under the folder of the split name.
For datasets that have a split itself (e.g., Clotho or AudioSet), use the dataset split and name it as train/valid/test (for only two splits, name train/test). For datasets that have custom splits (e.g., AudioSet), name the split according to the dataset split. If there is no split of the dataset, please randomly leave-out 10% of the dataset as test set.
Note that we have a tool aiming at doing this, the path is /data_preprocess/split_and_rename.py
.
preprocessed_dataset_dir
├── Dataset_A
│ ├── train
│ └── test
├── Dataset_B (if have train/test/split)
│ ├── train
│ ├── valid
│ └── test
└── Dataset_C (if have custom split)
│ ├── train
│ ├── custom_split_1
│ └── custom_split_2
To get to Webdataset format, we need the auxiliary function tardir
in the make_tar_utils.py
(will be used in file make_tar.py
). This function creates the .tar
files that includes the audio and text files in the same folder. One can indicate how much pairs of files should be in the .tar
file. For example, calling this make_tar_utils.tardir(file_path='PATH\TO\THE\WHERE\AUDIO_TEXT_PAIRS\LOCATE', tar_name='PATH\TO\THE\OUTPUT\FOLDER\TARFILENAME', n_entry_each=some int number)
will give you n_entry_each
pairs of (audio, text) files pairs in each tar files naming like TARFILENAME0
, TARFILENAME1
etc. All the audio .flac
and text .json
files in file_path
will be packed up.
The load_from_tar
load (audio, text, name)
tuples from a specific .tar
file with some choice of audio decoding
parameters. See the documentation of the function in detail. And, of course, we have a different function for
the dataloader
. This function is just for debugging and reading .tar
files temporarily.
We use the webdataset as the final format to save the data for better
data-loading performance when training the models. The webdataset packs all the files in processed dataset into several
.tar
files. Each .tar
files contain a subset of the processed dataset files. These .tar
files would be the one
read by dataloader when we train the models.
The standard of webdataset and ways to create the webdataset:
python make_tar.py --input /mnt/audio_clip/processed_datasets/audiocaps/ --output /mnt/audio_clip/webdataset_tar/audiocaps/ --dataclass all --num_element 512 --filename name
- Note that
make_tar.py
makes use of the fonctiontardir
mentioned above. - It is suggested to use absolute path for --input and --output arguments.
Meaning of this command:
- We are expecting (
.flac
,.json
) file pairs in/mnt/audio_clip/processed_datasets/audiocaps/{}/
where {} could betrain
,test
,valid
which should be indicate indataclass
. (See the example at the end of the document)
......
├──
preprocessed_dataset_dir
├── audiocaps
│ ├── train
│ ├── valid
│ └── test
- We will have outputed tar files like
/mnt/audio_clip/webdataset_tar/audiocaps/train/name0.tar
. Each tar includes 512 (.flac
,.json
) file pairs. - We will have outputed
sizes.json
indicating the size of eachtar
file in the folder.
......
├──
webdataset_tar
├── audiocaps
│ ├── train
| | ├── sizes.json
| | ├── name0.tar
| | ├── name1.tar
| | └── ...
│ ├── valid
| | ├── sizes.json
| | ├── name0.tar
| | ├── name1.tar
| | └── ...
│ └── test
└── ...
The outputed sizes.json
will be like
{
"name0.tar": 512,
"name1.tar": 512,
...
}
To make sure that all datasets uploaded to s3://s-laion-audio
are bug-free and usable, we invite every contributor to follow the three steps of data check below:
-
Before using
FFmpeg
to convert any audio files to.flac
format, please apply methodsoundfile.read()
from pythonsoundfile
module to every audio file in raw dataset. If any exception is threw when applying soundfile.read() to an audio, then we consider that this audio is broken and we just discard it and will not convert it to.flac
format (i.e. it will not be included in the final dataset.). You can usecheck_audio.py
for this. (See the example at the end of this document.) -
After using
FFmpeg
to convert all audio files into.flac
format while before they are packed up with json files: repeat the process mentioned in point 1. Discard those flac audios with exception threw when read bysoundfile.read()
.You can usecheck_audio.py
for this. (See the example at the end of the document.) -
Once a dataset is converted to Webdataset format and uploaded to
s3://s-laion-judio/webdatset_tar/
, we have to check it for the last time: Using the script written described here to check that the tar files are indeed not broken.
Aiming at archiving the methods we used to processed datasets, we decide to make a "data card" for each dataset. As a result, we invite you to make a data card for datasets you processed. You may refer to the data card of freesound dataset as a template.
There are two essential parts in a data card:
- Data collection: In this part we write how we collect the dataset, which includes two aspects:
- the source of the dataset, usually an URL
- the collecting method, could be, for example, downloading directly or scrawling a website, etc.
- Preprocessing principles:
- we hope you can put 2 audio-json pairs as an illustration for the preprocessing principles. However, the audio format have to be
.mp4
or.mov
(so have to be video format, audio format will not work 555) to be supported by github markdown engine. You can refer to this: https://stackoverflow.com/questions/44185716/add-audio-in-github-readme-md - json file generation principle: (most important part)
- What is the content of "text" entry (most important, since we will use this to train the model)
- what is the content of "tag" entry
- what is the content of "original_data" entry
- audio filtering principles and audio format specification: same for all datasets, so just copy that of freesound dataset.
- we hope you can put 2 audio-json pairs as an illustration for the preprocessing principles. However, the audio format have to be
We provide an example to illustrate the entire process. The example is based on the MACS dataset.
#!/bin/bash
# /fsx/MACS/TAU2019/TAU-urban-acoustic-scenes-2019-development/audio raw dataset path
# /fsx/yuchen/macs/processed/ processed dataset path
cd /fsx/yuchen/macs/
#upload raw dataset to s3
aws s3 cp --recursive /fsx/MACS/TAU2019/TAU-urban-acoustic-scenes-2019-development/audio/ s3://s-laion-audio/raw_dataset/MACS/
# data check before processing:
python3 check_audio.py --dir /fsx/MACS/TAU2019/TAU-urban-acoustic-scenes-2019-development/audio --ext wav
# data preprocess:
python3 preprocess_MACS_yuchen.py
# data check after processing
python3 check_audio.py --dir /fsx/yuchen/macs/processed --ext flac
# split (train 90% test 10%)
python3 split_and_rename.py --data_dir /fsx/yuchen/macs/processed/ --output_dir /fsx/yuchen/macs/split/
cd /fsx/yuchen/utils/
# make tar (train)
python3 make_tar.py --input /fsx/yuchen/macs/split --output /fsx/yuchen/macs/webdataset/ --dataclass train --num_element 512
# make tar (test)
python3 make_tar.py --input /fsx/yuchen/macs/split --output /fsx/yuchen/macs/webdataset/ --dataclass test --num_element 512
# upload to s3
aws s3 cp --recursive /fsx/yuchen/macs/webdataset/ s3://s-laion-audio/webdataset_tar/MACS/
cd /fsx/yuchen/check_tar/
# data check after uploading
python3 test_tars.py --tar-path "pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/MACS/train" --start 0 --end 7 --batch-size 1 --order
python3 test_tars.py --tar-path "pipe:aws s3 --cli-connect-timeout 0 cp s3://s-laion-audio/webdataset_tar/MACS/test" --start 0 --end 1 --batch-size 1 --order
To contribute, please make a branch of yourself and make pull requests to the main branch.
If you contribute to process a new dataset, please add your process/scraping scripts to audio-dataset/data_preprocess
folder. If possible, please add the processed (not yet packed up) dataset to S3://s-laion-audio/webdatset_tar
If you contribute to process a new dataset, please move the final webdataset to the AWS S3 bucket: S3://s-laion-audio/webdataset_tar/
.