We introduce an efficient pipeline for collecting ambient audio. It starts by analyzing automatic transcriptions of online videos to identify non-speech parts. Our captioning model, AutoCap, then generates captions and filters out segments with music or speech-related keywords. By using time-aligned transcriptions, we reduce the filtering rate and streamline the process by avoiding the need to download or process the audio files.
For initializing your environment, please refer to the general README.
- We currently provide the following datasets:
- AutoReCapXL: containing more than 47M audio-text pairs, filtered to have LAION CLAP similaity above 0.1
- AutoReCapXL-MQ: containing more than 20.7M audio-text pairs, filtered to have LAION CLAP similaity above 0.4
- AutoReCapXL-MQ-L: containing more than 14.7M audio-text pairs, filtered to have LAION CLAP similaity above 0.4 and audio clips longer than 5 seconds.
- AutoReCapXL-HQ: containing more than 10.7M audio-text pairs, filtered to have LAION CLAP similaity above 0.5.
AutoReCap datasets are derived from Youtube videos. The datasets contain mainly ambinet audio clips and few speech and music clips. Please refer to the paper for more details on this dataset. These datasets can be filtered based on specified CLAP similarity thresholds and minimum audio clip lengths as described below.
python download.py --save_dir <path-to-save-dir> --dataset_name <dataset-subset>
# Example
python download.py --save_dir data/datasets/autocap --dataset_name AutoReCapXL-HQ --audio_only
# Example of filtering according to clap similarity and audio clip length
python download.py --save_dir data/datasets/autocap --dataset_name AutoReCapXL --clap_threshold 0.4 --min_audio_len 5 --audio_only
# Example of downloading only a subset of the datasets
python download.py --save_dir data/datasets/autocap --dataset_name AutoReCapXL-HQ --start_idx 0 --end_idx 100000 --audio_only
By default, the script will download videos along with their metadata.
We provide the following helpful arguments:
--sampling_rate
: Specifies the sampling rate at which the audio files are to be stored.--audio_only
: Download only the audio files and discard the videos. This is helpful to save storage space.--files_per_folder
: Downloaded files will be organized into many folders. This argument specifies how many files to store per folder.--start_idx
,--end_idx
: To download only a subset of the dataset.--proxy
: For large downloads, YouTube might block your address. You may SSH to another machine at a specific port and provide it using this argument.
Once the dataset finishes downloading, run the following script:
python organize_dataset.py --save_dir <path-to-dataset>
--dataset_name <key-to-store-dataset>
--split <split-type>
--files_per_subset <number_of_files_per_subset>
# Example
python organize_dataset.py --save_dir data/datasets/autocap --dataset_name autocap --split train
- Important: Use different dataset_names for different splits.
- If
--files_per_subset
is specified to be more than one, the dataset keys will be named dataset_name_subset_1, dataset_name_subset_2, etc. - The datasets details can be found at
data/metadata/dataset_root.json
. - Add the dataset keys under the
data
attribute in your config file for the audio generation and captioning experiments.
You need to arrange your audio files in one folder using the following structure:
- Folder
- 000000
- Id_1.wav
- Id_1.json
- Id_2.wav
- Id_2.json
- 000001
- Id_3.wav
- Id_3.json
.
.
- In the JSON files, add the metadata such as title, description, video_caption, and gt_audio_caption.
- Organizing your dataset following the instructions in Dataset Organization.
We provide a script for downloading wavcaps datasets. Run the following scripts to download and organize each of these datasets:
python download_external_datasets --save_root <path-to-save-root> \
--dataset_names "dataset_key_1" "dataset_key_2" ...
# Organize each downloaded dataset
python organize_dataset.py --save_dir <path-to-downloaded-dataset> \
--dataset_name <key-to-store-dataset>