For initializing your environment, please refer to the general README.
To quickly generate a caption for an input audio, run
python scripts/audio_to_text.py --wav_path <path-to-wav-file>
# Example inference
python scripts/audio_to_text.py --wav_path samples/ood_samples/loudwhistle-91003.wav
- This will automatically download
autocap-full
model and run the inference with the default parameters. You may change these parameters or provide your cutome model config file and checkpoint path. - For more accurate captioning, provide meta data using
--title
,description
, and--video_caption
arguments.
- Prepare all target audio files in a single folder
- Optionally, provide meta data information in
yaml
file using the following structure
file_name.wav:
title: "video title"
description: "video description"
video_caption: "video caption"
Then run the following script
python scripts/inference_folder.py --folder_path <path-to-audio-folder> --meta_data_file <path-to-metadata-yaml-file>
# Example inference
python scripts/inference_folder.py --folder_path samples/ood_samples --meta_data_file samples/ood_samples/meta_data.yaml
If you want to caption a large dataset, we provide a script that works with multigpus for faster inference.
- Prepare your custom dataset by following the instruction in the dataset preperation README and run
python scripts/caption_dataset.py \
--caption_store_key <key-to-store-generated-captions> \
--beam_size 2 \
--start_idx 0 \
--end_idx 1000000 \
--dataset_keys "dataset_1" "dataset_2" ...
# Example
python scripts/caption_dataset.py \
--caption_store_key autocap_caption \
--beam_size 2 \
--start_idx 0 \
--end_idx 100 \
--dataset_keys “wavcaps_soundbible”
- Provide your dataset keys as registered in the dataset preperation process
- Captions will be generated and stores in each file json file with the specified caption_ store_key
start_idx
andend_idx
arugments can be used to resume or distribute captioning experiments- Add your
caption_store_key
underkeys_synonyms:gt_audio_caption
in the target yaml config file for it to be selected when the ground truth caption is not available in your audio captioning or audio generation experiments.
Please refer to the dataset preperation README for instructions on downloading our dataset or preparing your own dataset.
- Specify your model parameters in a config yaml file. A sample yaml file is given under
settings/pretraining.yaml
- Specify your project name and provide your wandb key in the config file. A wandb key can be obtained from https://wandb.ai/authorize
- Optionally, provide your S3 bucket and folder to save intermediate checkpoints.
- By default, checkpoints will be save under
run_logs/train
python train.py -c settings/pretraining.yaml
- Prepare your finetuning config file in a similar way as the pretraining stage. Typically, you only need to provide
pretrain_path
to your pretraining checkpoint, adjust learning rate, and untoggle the freeze option for thetext_decoder
. - A sample fintuning config is provided under
settings/finetuning.yaml
python train.py -c settings/finetuning.yaml
- By default, the models will be log metrics on the validation set to wandb periodically during training as specified in the config file.
- We exclude the
spice
,spideer
andmeteor
metrics during training as they tend to hang out the training during multigpu training. You man inlcude them by changing the configruation. - A file with the predicted captions during evaluation will be saved under
run_logs/train
and metrics can be found in a file namedoutput.txt
under the logging folder. - To run the evaluation on the test set, after the training finishes, run:
python evaluate.py -c <path-to-config> -ckpt <path-to-checkpoint>
If you found this useful, please consider citing our work
@misc{hajiali2024tamingdatatransformersaudio,
title={Taming Data and Transformers for Audio Generation},
author={Moayed Haji-Ali and Willi Menapace and Aliaksandr Siarohin and Guha Balakrishnan and Sergey Tulyakov and Vicente Ordonez},
year={2024},
eprint={2406.19388},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2406.19388},
}
We sincerely thank the authors of the following work for sharing their code publicly: