diff --git a/AutoCap/README.md b/AutoCap/README.md index 6cdddc4..a08d729 100644 --- a/AutoCap/README.md +++ b/AutoCap/README.md @@ -1,106 +1,119 @@ -[![arXiv](ARXIV ICON)](ARXIV LINK) + -# AutoCap inference, training and evaluation +# GenAU inference, training, and evaluation +- [Introduction](#introduction) +- [Environemnt setup](#environment-initalization) - [Inference](#inference) - * [Audio to text script](#audio-to-text) - * [Gradio demo](#gradio-demo) - * [Caption a list of audio files](#caption-list-of-audio-files) - * [Caption your custom dataset](#caption-a-dataset) + * [Audio to text script](#text-to-audio) + * [Inference a list of promots](#inference-a-list-of-prompts) - [Training](#training) + * [GenAU](#genau) + * [Finetuning GenAU](#finetuning-genau) + * [1D-VAE (optional)](#1d-vae-optional) - [Evaluation](#evaluation) - [Cite this work](#cite-this-work) - [Acknowledgements](#acknowledgements) -# Environment initalization +# Introduction +We introduce GenAU, a transformer-based audio latent diffusion model leveraging the FIT architecture. Our model compresses mel-spectrogram data into a 1D representation and utilizes layered attention processes to achieve state-of-the-art audio generation results among open-source models. +
+ +
+ +
+ +
+ +# Environment initialization For initializing your environment, please refer to the [general README](../README.md). # Inference -## Audio to Text -To quickly generate a caption for an input audio, run +## Text to Audio +To quickly generate an audio based on an input text prompt, run ```shell -python scripts/audio_to_text.py --wav_path - -# Example inference -python scripts/audio_to_text.py --wav_path samples/ood_samples/loudwhistle-91003.wav +python scripts/text_to_audio.py --prompt "Horses growl and clop hooves." --model "genau-full-l" ``` -- This will automatically download `TODO` model and run the inference with the default parameters. You may change these parameters or provide your cutome model config file and checkpoint path. -- For more accurate captioning, provide meta data using `--title`, `description`, and `--video_caption` arguments. +- This will automatically download and use the model `genau-full-l` with default settings. You may change these parameters or provide your custom model config file and checkpoint path. +- Available models include `genau-full-l` (1.25B parameters) and `genau-full-s` (493M parameters) +- These models are trained to generate ambient sounds and is incapable of generating speech or music. +- Outputs will be saved by default at `samples/model_output` using the provided prompt as the file name. -## Gradio Demo -A local Gradio demo is also available by running + -## Caption list of audio files -- Prepare all target audio files in a single folder -- Optionally, provide meta data information in `yaml` file using the following structure -```yaml -file_name.wav: - title: "video title" - description: "video description" - video_caption: "video caption" -``` +## Inference a list of prompts +Optionally, you may prepare a `.txt` file with your target prompts and run -Then run the following script ```shell -python scripts/inference_folder.py --folder_path --meta_data_file +python scripts/inference_file.py --list_inference --model -# Example inference -python scripts/inference_folder.py --folder_path samples/ood_samples --meta_data_file samples/ood_samples/meta_data.yaml +# Example +python scripts/inference_file.py --list_inference samples/prompts_list.txt --model "genau-full-l" ``` -## Caption your custom dataset -If you want to caption a large dataset, we provide a script that works with multigpus for faster inference. -- Prepare your custom dataset by following the instruction in the dataset prepeartion README (TODO) and run +## Training -```shell -python scripts/caption_dataset.py \ - --caption_store_key \ - --beam_size 2 \ - --start_idx 0 \ - --end_idx 1000000 \ - --dataset_keys "dataset_1" "dataset_2" ... +### Dataset +Please refer to the [dataset preparation README](../dataset_preperation/README.md) for instructions on downloading our dataset or preparing your own. + +### GenAU +- Prepare a yaml config file for your experiments. A sample config file is provided at `settings/simple_runs/genau.yaml` +- Specify your project name and provide your Wandb key in the config file. A Wandb key can be obtained from [https://wandb.ai/authorize](https://wandb.ai/authorize) +- Optionally, provide your S3 bucket and folder to save intermediate checkpoints. +- By default, checkpoints will be saved under `run_logs/genau/train` at the same level as the config file. +```shell +# Training GenAU from scratch +python train/genau.py -c settings/simple_runs/genau.yaml ``` -- Provide your dataset keys as registered in the dataset preperation (TODO) -- Captions will be generated and stores in each file json file with the specified caption_ store_key -- `start_idx` and `end_idx` arugments can be used to resume or distribute captioning experiments -- Add your `caption_store_key` under `keys_synonyms:gt_audio_caption` in the target yaml config file for it to be selected when the ground truth caption is not available in your audio captioning or audio generation experiments. +For multinode training, run +```shell +python -m torch.distributed.run --nproc_per_node=8 train/genau.py -c settings/simple_runs/genau.yaml +``` +### Finetuning GenAU -# Training -### Dataset -Please refer to the dataset README (TODO) for instructions on downloading our dataset or preparing your own dataset. +- Prepare your custom dataset and obtain the dataset keys following [dataset preparation README](../dataset_preperation/README.md) +- Make a copy and adjust the default config file of `genau-full-l` which you can find under `pretrained_models/genau/genau-full-l.yaml` +- Add ids for your dataset keys under `dataset2id` attribute in the config file. -### Stage 1 (pretraining) -- Specify your model parameters in a config yaml file. A sample yaml file is given under `settings/pretraining.yaml` -- Specify your project name and provide your wandb key in the config file. A wandb key can be obtained from [https://wandb.ai/authorize](https://wandb.ai/authorize) -- Optionally, provide your S3 bucket and folder to save intermediate checkpoints. -- By default, checkpoints will be save under `run_logs/train` ```shell -python train.py -c settings/pretraining.yaml +# Finetuning GenAU +python train/genau.py --reload_from_ckpt 'genau-full-l' \ + --config \ + --dataset_keys "" "" ... ``` -### Stage 2 (finetuning) -- Prepare your finetuning config file in a similar way as the pretraining stage. Typically, you only need to provide `pretrain_path` to your pretraining checkpoint, adjust learning rate, and untoggle the freeze option for the `text_decoder`. -- A sample fintuning config is provided under `settings/finetuning.yaml` + +### 1D VAE (Optional) +By default, we offer a pre-trained 1D-VAE for GenAU training. If you prefer, you can train your own VAE by following the provided instructions. +- Prepare your own dataset following the instructions in the [dataset preparation README](../dataset_preperation/README.md) +- Prepare your yaml config file in a similar way to the GenAU config file +- A sample config file is provided at `settings/simple_runs/1d_vae.yaml` ```shell -python train.py -c settings/finetuning.yaml +python train/1d_vae.py -c settings/simple_runs/1d_vae.yaml ``` +## Evaluation +- We follow [audioldm](https://github.com/haoheliu/AudioLDM-training-finetuning) to perform our evaulations. +- By default, the models will be evaluated periodically during training as specified in the config file. For each evaluation, a folder with the generated audio will be saved under `run_logs/train' at the same levels as the specified config file. +- The code identifies the test dataset in an already existing folder according to the number of samples. If you would like to test on a new test dataset, register it in `scripts/generate_and_eval` -# Evalution -- By default, the models will be log metrics on the validation set to wandb periodically during training as specified in the config file. -- We exclude the `spice`, `spideer` and `meteor` metrics during training as they tend to hang out the training during multigpu training. You man inlcude them by changing the configruation. -- A file with the predicted captions during evaluation will be saved under `run_logs/train` and metrics can be found in a file named `output.txt` under the logging folder. -- To run the evaluation on the test set, after the training finishes, run: ```shell -python evaluate.py -c -ckpt + +# Evaluate an existing generated folder +python scripts/evaluate.py --log_path + +# Geneate test audios from a pre-trained checkpoint and run evaulation +python scripts/generate_and_eval.py -c -ckpt ``` +The evaluation result will be saved in a JSON file at the same level of the generated audio folder. # Cite this work If you found this useful, please consider citing our work @@ -109,6 +122,5 @@ If you found this useful, please consider citing our work ``` # Acknowledgements -We sincerely thank the authors of the following work for sharing their code publicly: -- [WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research](https://github.com/XinhaoMei/WavCaps) -- [Audio Captioning Transformer](https://github.com/XinhaoMei/ACT/tree/main/coco_caption) +Our audio generation and evaluation codebase relies on [audioldm](https://github.com/haoheliu/AudioLDM-training-finetuning). We sincerely appreciate the authors for sharing their code openly. + diff --git a/GenAU/README.md b/GenAU/README.md index 7408c6d..eedc009 100644 --- a/GenAU/README.md +++ b/GenAU/README.md @@ -1,9 +1,10 @@ -[![arXiv](ARXIV ICON)](ARXIV LINK) + # GenAU inference, training and evaluation +- [Introduction](#introduction) +- [Environemnt setup](#environment-initalization) - [Inference](#inference) - * [Audio to text script](#text-to-audio) - * [Gradio demo](#gradio-demo) + * [Audio to text script](#text-to-audio) * [Inference a list of promots](#inference-a-list-of-prompts) - [Training](#training) * [GenAU](#genau) @@ -13,6 +14,16 @@ - [Cite this work](#cite-this-work) - [Acknowledgements](#acknowledgements) +# Introduction +We introduce GenAU, a transformer-based audio latent diffusion model leveraging the FIT architecture. Our model compresses mel-spectrogram data into a 1D representation and utilizes layered attention processes to achieve state-of-the-art audio generation results among open-source models. +
+ +
+ +
+ +
+ # Environment initalization For initializing your environment, please refer to the [general README](../README.md). @@ -28,11 +39,11 @@ python scripts/text_to_audio.py --prompt "Horses growl and clop hooves." --model - These models are trained to generate ambient sounds and is incapable of generating speech or music. - Outputs will be saved by default at `samples/model_output` using the provided prompt as the file name. -## Gradio Demo + ## Inference a list of prompts Optionally, you may prepare a `.txt` file with your target prompts and run diff --git a/README.md b/README.md index 52da238..01b86bb 100644 --- a/README.md +++ b/README.md @@ -1 +1,61 @@ -# README \ No newline at end of file +

+ + Taming Data and Transformers for Audio Generation +

+ +This is the official GitHub repository of the paper Taming Data and Transformers for Audio Generation. + +**[Taming Data and Transformers for Audio Generation](https://snap-research.github.io/GenAU)** +
+[Moayed Haji-Ali](https://tsaishien-chen.github.io/), +[Willi Menapace](https://www.willimenapace.com/), +[Aliaksandr Siarohin](https://aliaksandrsiarohin.github.io/aliaksandr-siarohin-website/), +[Guha Balakrishnan](https://www.guhabalakrishnan.com), +[Sergey Tulyakov](http://www.stulyakov.com/) +[Vicente Ordonez](https://vislang.ai/), +
+*Arxiv 2024* + +[![Project Page](https://img.shields.io/badge/Project-Page-green.svg)](https://snap-research.github.io/GenAU) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taming-data-and-transformers-for-audio/audio-captioning-on-audiocaps)](https://paperswithcode.com/sota/audio-captioning-on-audiocaps?p=taming-data-and-transformers-for-audio) + + +# Introduction + +
+
+ +
+
+Generating ambient sounds and effects is a challenging problem due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle the problem by introducing two new models. First, we propose AutoCap, a high-quality and efficient automatic audio captioning model. We show that by leveraging metadata available with the audio modality, we can substantially improve the quality of captions. AutoCap reaches a CIDEr score of 83.2, marking a 3.2% improvement from the best available captioning model at four times faster inference speed. We then use AutoCap to caption clips from existing datasets, obtaining 761, 000 audio clips with high-quality captions, forming the largest available audio-text dataset. Second, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters and train with our new dataset. When compared to state-of-the-art audio generators, GenAu obtains significant improvements of 15.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly improved quality of generated audio compared to previous works. This shows that the quality of data is often as important as its quantity. Besides, since AutoCap is fully automatic, new audio samples can be added to the training dataset, unlocking the training of even larger generative models for audio synthesis. For more details, please visit our project webpage. +
+
+ + +# Updates +- **2024.06.28**: Paper and code released! + +# TODOs +- [ ] Add GenAU Gradio demo +- [ ] Add AutoCap Gradio demo + +# Setup +Initialize a [conda](https://docs.conda.io/en/latest) environment named genau by running: +``` +conda env create -f environment.yaml +conda activate genau +``` +# Dataset Preparation +See [Dataset Preparation](./dataset_preperation/README.md) for details on downloading and preparing the AutoCap dataset, as well as more information on organizing your custom dataset. + +# Audio Captioning (AutoCap) +See [GenAU](./AutoCap/README.md) README for details on inference, training, and evaluating our audio captioner AutoCAP. + +# Audio Generation (GenAU) +See [GenAU](./GenAU/README.md) README for details on inference, training, finetuning, and evaluating our audio generator GenAU. + + +## Citation +If you find this paper useful in your research, please consider citing our work: +``` +TODO +``` diff --git a/assets/autocap.png b/assets/autocap.png new file mode 100644 index 0000000..ce03c46 Binary files /dev/null and b/assets/autocap.png differ diff --git a/assets/framework.jpg b/assets/framework.jpg new file mode 100644 index 0000000..c1115bc Binary files /dev/null and b/assets/framework.jpg differ diff --git a/assets/genau.png b/assets/genau.png new file mode 100644 index 0000000..3a025b9 Binary files /dev/null and b/assets/genau.png differ diff --git a/assets/logo.png b/assets/logo.png new file mode 100644 index 0000000..87833f6 Binary files /dev/null and b/assets/logo.png differ diff --git a/docs/index.html b/docs/index.html index 3f01de9..735efb2 100644 --- a/docs/index.html +++ b/docs/index.html @@ -51,7 +51,7 @@

Taming Data and Transformers for Audio Generation< Willi Menapace,  Aliaksandr + href=https://aliaksandrsiarohin.github.io/aliaksandr-siarohin-website/>Aliaksandr Siarohin,  Guha Balakrishnan,