Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
MoayedHajiAli committed Jun 25, 2024
1 parent 9810fce commit 2ecf7af
Show file tree
Hide file tree
Showing 8 changed files with 158 additions and 75 deletions.
148 changes: 80 additions & 68 deletions AutoCap/README.md
Original file line number Diff line number Diff line change
@@ -1,106 +1,119 @@
[![arXiv](ARXIV ICON)](ARXIV LINK)
<!-- [![arXiv](ARXIV ICON)](ARXIV LINK) -->

# AutoCap inference, training and evaluation
# GenAU inference, training, and evaluation
- [Introduction](#introduction)
- [Environemnt setup](#environment-initalization)
- [Inference](#inference)
* [Audio to text script](#audio-to-text)
* [Gradio demo](#gradio-demo)
* [Caption a list of audio files](#caption-list-of-audio-files)
* [Caption your custom dataset](#caption-a-dataset)
* [Audio to text script](#text-to-audio)<!-- * [Gradio demo](#gradio-demo) -->
* [Inference a list of promots](#inference-a-list-of-prompts)
- [Training](#training)
* [GenAU](#genau)
* [Finetuning GenAU](#finetuning-genau)
* [1D-VAE (optional)](#1d-vae-optional)
- [Evaluation](#evaluation)
- [Cite this work](#cite-this-work)
- [Acknowledgements](#acknowledgements)

# Environment initalization
# Introduction
We introduce GenAU, a transformer-based audio latent diffusion model leveraging the FIT architecture. Our model compresses mel-spectrogram data into a 1D representation and utilizes layered attention processes to achieve state-of-the-art audio generation results among open-source models.
<br/>

<div align="center">
<img src="../assets/genau.png" width="900" />
</div>

<br/>

# Environment initialization
For initializing your environment, please refer to the [general README](../README.md).

# Inference

## Audio to Text
To quickly generate a caption for an input audio, run
## Text to Audio
To quickly generate an audio based on an input text prompt, run
```shell
python scripts/audio_to_text.py --wav_path <path-to-wav-file>

# Example inference
python scripts/audio_to_text.py --wav_path samples/ood_samples/loudwhistle-91003.wav
python scripts/text_to_audio.py --prompt "Horses growl and clop hooves." --model "genau-full-l"
```
- This will automatically download `TODO` model and run the inference with the default parameters. You may change these parameters or provide your cutome model config file and checkpoint path.
- For more accurate captioning, provide meta data using `--title`, `description`, and `--video_caption` arguments.
- This will automatically download and use the model `genau-full-l` with default settings. You may change these parameters or provide your custom model config file and checkpoint path.
- Available models include `genau-full-l` (1.25B parameters) and `genau-full-s` (493M parameters)
- These models are trained to generate ambient sounds and is incapable of generating speech or music.
- Outputs will be saved by default at `samples/model_output` using the provided prompt as the file name.

## Gradio Demo
A local Gradio demo is also available by running
<!-- ## Gradio Demo
Run a local interactive demo with Gradio:
```shell
python app_audio2text.py
```
python app_text2audio.py
``` -->

## Caption list of audio files
- Prepare all target audio files in a single folder
- Optionally, provide meta data information in `yaml` file using the following structure
```yaml
file_name.wav:
title: "video title"
description: "video description"
video_caption: "video caption"
```
## Inference a list of prompts
Optionally, you may prepare a `.txt` file with your target prompts and run

Then run the following script
```shell
python scripts/inference_folder.py --folder_path <path-to-audio-folder> --meta_data_file <path-to-metadata-yaml-file>
python scripts/inference_file.py --list_inference <path-to-prompts-file> --model <model_name>

# Example inference
python scripts/inference_folder.py --folder_path samples/ood_samples --meta_data_file samples/ood_samples/meta_data.yaml
# Example
python scripts/inference_file.py --list_inference samples/prompts_list.txt --model "genau-full-l"
```

## Caption your custom dataset

If you want to caption a large dataset, we provide a script that works with multigpus for faster inference.
- Prepare your custom dataset by following the instruction in the dataset prepeartion README (TODO) and run
## Training

```shell
python scripts/caption_dataset.py \
--caption_store_key <key-to-store-generated-captions> \
--beam_size 2 \
--start_idx 0 \
--end_idx 1000000 \
--dataset_keys "dataset_1" "dataset_2" ...
### Dataset
Please refer to the [dataset preparation README](../dataset_preperation/README.md) for instructions on downloading our dataset or preparing your own.

### GenAU
- Prepare a yaml config file for your experiments. A sample config file is provided at `settings/simple_runs/genau.yaml`
- Specify your project name and provide your Wandb key in the config file. A Wandb key can be obtained from [https://wandb.ai/authorize](https://wandb.ai/authorize)
- Optionally, provide your S3 bucket and folder to save intermediate checkpoints.
- By default, checkpoints will be saved under `run_logs/genau/train` at the same level as the config file.

```shell
# Training GenAU from scratch
python train/genau.py -c settings/simple_runs/genau.yaml
```
- Provide your dataset keys as registered in the dataset preperation (TODO)
- Captions will be generated and stores in each file json file with the specified caption_ store_key
- `start_idx` and `end_idx` arugments can be used to resume or distribute captioning experiments
- Add your `caption_store_key` under `keys_synonyms:gt_audio_caption` in the target yaml config file for it to be selected when the ground truth caption is not available in your audio captioning or audio generation experiments.

For multinode training, run
```shell
python -m torch.distributed.run --nproc_per_node=8 train/genau.py -c settings/simple_runs/genau.yaml
```
### Finetuning GenAU

# Training
### Dataset
Please refer to the dataset README (TODO) for instructions on downloading our dataset or preparing your own dataset.
- Prepare your custom dataset and obtain the dataset keys following [dataset preparation README](../dataset_preperation/README.md)
- Make a copy and adjust the default config file of `genau-full-l` which you can find under `pretrained_models/genau/genau-full-l.yaml`
- Add ids for your dataset keys under `dataset2id` attribute in the config file.

### Stage 1 (pretraining)
- Specify your model parameters in a config yaml file. A sample yaml file is given under `settings/pretraining.yaml`
- Specify your project name and provide your wandb key in the config file. A wandb key can be obtained from [https://wandb.ai/authorize](https://wandb.ai/authorize)
- Optionally, provide your S3 bucket and folder to save intermediate checkpoints.
- By default, checkpoints will be save under `run_logs/train`
```shell
python train.py -c settings/pretraining.yaml
# Finetuning GenAU
python train/genau.py --reload_from_ckpt 'genau-full-l' \
--config <path-to-config-file> \
--dataset_keys "<dataset_key_1>" "<dataset_key_2>" ...
```

### Stage 2 (finetuning)
- Prepare your finetuning config file in a similar way as the pretraining stage. Typically, you only need to provide `pretrain_path` to your pretraining checkpoint, adjust learning rate, and untoggle the freeze option for the `text_decoder`.
- A sample fintuning config is provided under `settings/finetuning.yaml`

### 1D VAE (Optional)
By default, we offer a pre-trained 1D-VAE for GenAU training. If you prefer, you can train your own VAE by following the provided instructions.
- Prepare your own dataset following the instructions in the [dataset preparation README](../dataset_preperation/README.md)
- Prepare your yaml config file in a similar way to the GenAU config file
- A sample config file is provided at `settings/simple_runs/1d_vae.yaml`

```shell
python train.py -c settings/finetuning.yaml
python train/1d_vae.py -c settings/simple_runs/1d_vae.yaml
```

## Evaluation
- We follow [audioldm](https://github.com/haoheliu/AudioLDM-training-finetuning) to perform our evaulations.
- By default, the models will be evaluated periodically during training as specified in the config file. For each evaluation, a folder with the generated audio will be saved under `run_logs/train' at the same levels as the specified config file.
- The code identifies the test dataset in an already existing folder according to the number of samples. If you would like to test on a new test dataset, register it in `scripts/generate_and_eval`

# Evalution
- By default, the models will be log metrics on the validation set to wandb periodically during training as specified in the config file.
- We exclude the `spice`, `spideer` and `meteor` metrics during training as they tend to hang out the training during multigpu training. You man inlcude them by changing the configruation.
- A file with the predicted captions during evaluation will be saved under `run_logs/train` and metrics can be found in a file named `output.txt` under the logging folder.
- To run the evaluation on the test set, after the training finishes, run:
```shell
python evaluate.py -c <path-to-config> -ckpt <path-to-checkpoint>

# Evaluate an existing generated folder
python scripts/evaluate.py --log_path <path-to-the-experiment-folder>

# Geneate test audios from a pre-trained checkpoint and run evaulation
python scripts/generate_and_eval.py -c <path-to-config> -ckpt <path-to-pretrained-ckpt>
```
The evaluation result will be saved in a JSON file at the same level of the generated audio folder.

# Cite this work
If you found this useful, please consider citing our work
Expand All @@ -109,6 +122,5 @@ If you found this useful, please consider citing our work
```

# Acknowledgements
We sincerely thank the authors of the following work for sharing their code publicly:
- [WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research](https://github.com/XinhaoMei/WavCaps)
- [Audio Captioning Transformer](https://github.com/XinhaoMei/ACT/tree/main/coco_caption)
Our audio generation and evaluation codebase relies on [audioldm](https://github.com/haoheliu/AudioLDM-training-finetuning). We sincerely appreciate the authors for sharing their code openly.

21 changes: 16 additions & 5 deletions GenAU/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
[![arXiv](ARXIV ICON)](ARXIV LINK)
<!-- [![arXiv](ARXIV ICON)](ARXIV LINK) -->

# GenAU inference, training and evaluation
- [Introduction](#introduction)
- [Environemnt setup](#environment-initalization)
- [Inference](#inference)
* [Audio to text script](#text-to-audio)
* [Gradio demo](#gradio-demo)
* [Audio to text script](#text-to-audio)<!-- * [Gradio demo](#gradio-demo) -->
* [Inference a list of promots](#inference-a-list-of-prompts)
- [Training](#training)
* [GenAU](#genau)
Expand All @@ -13,6 +14,16 @@
- [Cite this work](#cite-this-work)
- [Acknowledgements](#acknowledgements)

# Introduction
We introduce GenAU, a transformer-based audio latent diffusion model leveraging the FIT architecture. Our model compresses mel-spectrogram data into a 1D representation and utilizes layered attention processes to achieve state-of-the-art audio generation results among open-source models.
<br/>

<div align="center">
<img src="../assets/genau.png" width="900" />
</div>

<br/>

# Environment initalization
For initializing your environment, please refer to the [general README](../README.md).

Expand All @@ -28,11 +39,11 @@ python scripts/text_to_audio.py --prompt "Horses growl and clop hooves." --model
- These models are trained to generate ambient sounds and is incapable of generating speech or music.
- Outputs will be saved by default at `samples/model_output` using the provided prompt as the file name.

## Gradio Demo
<!-- ## Gradio Demo
Run a local interactive demo with Gradio:
```shell
python app_text2audio.py
```
``` -->

## Inference a list of prompts
Optionally, you may prepare a `.txt` file with your target prompts and run
Expand Down
62 changes: 61 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,61 @@
# README
<h1 align="center">
<img src="assets/logo.png" width="50" style="vertical-align: middle;"/>
Taming Data and Transformers for Audio Generation
</h1>

This is the official GitHub repository of the paper Taming Data and Transformers for Audio Generation.

**[Taming Data and Transformers for Audio Generation](https://snap-research.github.io/GenAU)**
</br>
[Moayed Haji-Ali](https://tsaishien-chen.github.io/),
[Willi Menapace](https://www.willimenapace.com/),
[Aliaksandr Siarohin](https://aliaksandrsiarohin.github.io/aliaksandr-siarohin-website/),
[Guha Balakrishnan](https://www.guhabalakrishnan.com),
[Sergey Tulyakov](http://www.stulyakov.com/)
[Vicente Ordonez](https://vislang.ai/),
</br>
*Arxiv 2024*

[![Project Page](https://img.shields.io/badge/Project-Page-green.svg)](https://snap-research.github.io/GenAU) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/taming-data-and-transformers-for-audio/audio-captioning-on-audiocaps)](https://paperswithcode.com/sota/audio-captioning-on-audiocaps?p=taming-data-and-transformers-for-audio)


# Introduction

<div align="justify">
<div>
<img src="assets/framework.jpg" width="1000" />
</div>
</br>
Generating ambient sounds and effects is a challenging problem due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle the problem by introducing two new models. First, we propose AutoCap, a high-quality and efficient automatic audio captioning model. We show that by leveraging metadata available with the audio modality, we can substantially improve the quality of captions. AutoCap reaches a CIDEr score of 83.2, marking a 3.2% improvement from the best available captioning model at four times faster inference speed. We then use AutoCap to caption clips from existing datasets, obtaining 761, 000 audio clips with high-quality captions, forming the largest available audio-text dataset. Second, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters and train with our new dataset. When compared to state-of-the-art audio generators, GenAu obtains significant improvements of 15.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly improved quality of generated audio compared to previous works. This shows that the quality of data is often as important as its quantity. Besides, since AutoCap is fully automatic, new audio samples can be added to the training dataset, unlocking the training of even larger generative models for audio synthesis. For more details, please visit our <a href='https://snap-research.github.io/GenAU'>project webpage</a>.
</div>
<br>


# Updates
- **2024.06.28**: Paper and code released!

# TODOs
- [ ] Add GenAU Gradio demo
- [ ] Add AutoCap Gradio demo

# Setup
Initialize a [conda](https://docs.conda.io/en/latest) environment named genau by running:
```
conda env create -f environment.yaml
conda activate genau
```
# Dataset Preparation
See [Dataset Preparation](./dataset_preperation/README.md) for details on downloading and preparing the AutoCap dataset, as well as more information on organizing your custom dataset.

# Audio Captioning (AutoCap)
See [GenAU](./AutoCap/README.md) README for details on inference, training, and evaluating our audio captioner AutoCAP.

# Audio Generation (GenAU)
See [GenAU](./GenAU/README.md) README for details on inference, training, finetuning, and evaluating our audio generator GenAU.


## Citation
If you find this paper useful in your research, please consider citing our work:
```
TODO
```
Binary file added assets/autocap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/framework.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/genau.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ <h1 class="publication-title">Taming Data and Transformers for Audio Generation<
<span class="author"><a target="_blank" href=https://www.willimenapace.com>Willi
Menapace,</a>&nbsp;</span>
<span class="author"><a target="_blank"
href=https://aliaksandrsiarohin.github.io/aliaksandr-siarohin-website />Aliaksandr
href=https://aliaksandrsiarohin.github.io/aliaksandr-siarohin-website/>Aliaksandr
Siarohin,</a>&nbsp;</span>
<span class="author"><a target="_blank" href=https://www.guhabalakrishnan.com>Guha
Balakrishnan,</a>&nbsp;</span>
Expand Down

0 comments on commit 2ecf7af

Please sign in to comment.