AutoReCap Dataset

Introduction

We introduce an efficient pipeline for collecting ambient audio. It starts by analyzing automatic transcriptions of online videos to identify non-speech parts. Our captioning model, AutoCap, then generates captions and filters out segments with music or speech-related keywords. By using time-aligned transcriptions, we reduce the filtering rate and streamline the process by avoiding the need to download or process the audio files.

Environment Initialization

For initializing your environment, please refer to the general README.

Autocap Dataset Download

We currently provide the following datasets:
- AutoReCapXL: containing more than 47M audio-text pairs, filtered to have LAION CLAP similaity above 0.1
- AutoReCapXL-MQ: containing more than 20.7M audio-text pairs, filtered to have LAION CLAP similaity above 0.4
- AutoReCapXL-MQ-L: containing more than 14.7M audio-text pairs, filtered to have LAION CLAP similaity above 0.4 and audio clips longer than 5 seconds.
- AutoReCapXL-HQ: containing more than 10.7M audio-text pairs, filtered to have LAION CLAP similaity above 0.5.

AutoReCap datasets are derived from Youtube videos. The datasets contain mainly ambinet audio clips and few speech and music clips. Please refer to the paper for more details on this dataset. These datasets can be filtered based on specified CLAP similarity thresholds and minimum audio clip lengths as described below.

python download.py --save_dir <path-to-save-dir> --dataset_name <dataset-subset>

# Example
python download.py --save_dir data/datasets/autocap --dataset_name AutoReCapXL-HQ --audio_only 

# Example of filtering according to clap similarity and audio clip length
python download.py --save_dir data/datasets/autocap --dataset_name AutoReCapXL --clap_threshold 0.4 --min_audio_len 5 --audio_only 

# Example of downloading only a subset of the datasets
python download.py --save_dir data/datasets/autocap --dataset_name AutoReCapXL-HQ --start_idx 0 --end_idx 100000 --audio_only

By default, the script will download videos along with their metadata.

We provide the following helpful arguments:

--sampling_rate: Specifies the sampling rate at which the audio files are to be stored.
--audio_only: Download only the audio files and discard the videos. This is helpful to save storage space.
--files_per_folder: Downloaded files will be organized into many folders. This argument specifies how many files to store per folder.
--start_idx, --end_idx: To download only a subset of the dataset.
--proxy: For large downloads, YouTube might block your address. You may SSH to another machine at a specific port and provide it using this argument.

Dataset Organization

Once the dataset finishes downloading, run the following script:

python organize_dataset.py --save_dir <path-to-dataset> 
                           --dataset_name <key-to-store-dataset> 
                           --split <split-type> 
                           --files_per_subset <number_of_files_per_subset>

# Example
python organize_dataset.py --save_dir data/datasets/autocap --dataset_name autocap --split train

Important: Use different dataset_names for different splits.
If --files_per_subset is specified to be more than one, the dataset keys will be named dataset_name_subset_1, dataset_name_subset_2, etc.
The datasets details can be found at data/metadata/dataset_root.json.
Add the dataset keys under the data attribute in your config file for the audio generation and captioning experiments.

Prepare Your Custom Dataset

You need to arrange your audio files in one folder using the following structure:

- Folder
 - 000000
 - Id_1.wav
 - Id_1.json
 - Id_2.wav
 - Id_2.json
 - 000001
 - Id_3.wav
 - Id_3.json
 .
 .

In the JSON files, add the metadata such as title, description, video_caption, and gt_audio_caption.
Organizing your dataset following the instructions in Dataset Organization.

Download External Dataset

We provide a script for downloading wavcaps datasets. Run the following scripts to download and organize each of these datasets:

python download_external_datasets --save_root <path-to-save-root> \
 --dataset_names "dataset_key_1" "dataset_key_2" ...

# Organize each downloaded dataset
python organize_dataset.py --save_dir <path-to-downloaded-dataset> \
 --dataset_name <key-to-store-dataset>

Available datasets are: wavcaps_soundbible, wavcaps_bbc, wavcaps_audioset, wavcaps_freesound
Audiocaps and Cloths: Please refer to the Audiocaps and Clotho official repositories for instructions on downloading these dataset. We are unable to distribute a copy of the dataset due to copyrights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

AutoReCap Dataset

Introduction

Environment Initialization

Autocap Dataset Download

Dataset Organization

Prepare Your Custom Dataset

Download External Dataset

Files

README.md

Latest commit

History

README.md

File metadata and controls

AutoReCap Dataset

Introduction

Environment Initialization

Autocap Dataset Download

Dataset Organization

Prepare Your Custom Dataset

Download External Dataset