Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update: SingVisio citation, resources links, and Emilia TODOs #274

Merged
merged 4 commits into from
Sep 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions README.md
yuantuo666 marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<a href="https://arxiv.org/abs/2312.09911"><img src="https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg"></a>
<a href="https://huggingface.co/amphion"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Amphion-pink"></a>
<a href="https://openxlab.org.cn/usercenter/Amphion"><img src="https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg"></a>
<a href="https://discord.com/invite/ZxxREr3Y"><img src="https://img.shields.io/badge/Discord-Join%20chat-blue.svg">
<a href="https://discord.com/invite/ZxxREr3Y"><img src="https://img.shields.io/badge/Discord-Join%20chat-blue.svg"></a>
<a href="egs/tts/README.md"><img src="https://img.shields.io/badge/README-TTS-blue"></a>
<a href="egs/svc/README.md"><img src="https://img.shields.io/badge/README-SVC-blue"></a>
<a href="egs/tta/README.md"><img src="https://img.shields.io/badge/README-TTA-blue"></a>
Expand All @@ -31,11 +31,12 @@ In addition to the specific generation tasks, Amphion includes several **vocoder
## πŸš€Β News
- **2024/09/01**: [Amphion](https://arxiv.org/abs/2312.09911) and [Emilia](https://arxiv.org/abs/2407.05361) got accepted by IEEE SLT 2024! πŸ€—
- **2024/08/28**: Welcome to join Amphion's [Discord channel](https://discord.com/invite/ZxxREr3Y) to stay connected and engage with our community!
- **2024/08/20**: [SingVisio](https://arxiv.org/abs/2402.12660) got accepted by Computers & Graphics, [available here](https://www.sciencedirect.com/science/article/pii/S0097849324001936)! πŸŽ‰
- **2024/08/27**: *The Emilia dataset is now publicly available!* Discover the most extensive and diverse speech generation dataset with 101k hours of in-the-wild speech data now at [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia-Dataset) or [![OpenDataLab](https://img.shields.io/badge/OpenDataLab-Dataset-blue)](https://opendatalab.com/Amphion/Emilia)! πŸ‘‘πŸ‘‘πŸ‘‘
- **2024/07/01**: Amphion now releases **Emilia**, the first open-source multilingual in-the-wild dataset for speech generation with over 101k hours of speech data, and the **Emilia-Pipe**, the first open-source preprocessing pipeline designed to transform in-the-wild speech data into high-quality training data with annotations for speech generation! [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2407.05361) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia) [![demo](https://img.shields.io/badge/WebPage-Demo-red)](https://emilia-dataset.github.io/Emilia-Demo-Page/) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](preprocessors/Emilia/README.md)
- **2024/06/17**: Amphion has a new release for its **VALL-E** model! It uses Llama as its underlying architecture and has better model performance, faster training speed, and more readable codes compared to our first version. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/tts/VALLE_V2/README.md)
- **2024/03/12**: Amphion now support **NaturalSpeech3 FACodec** and release pretrained checkpoints. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2403.03100) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/naturalspeech3_facodec) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/naturalspeech3_facodec) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/codec/ns3_codec/README.md)
- **2024/02/22**: The first Amphion visualization tool, **SingVisio**, release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://github.com/open-mmlab/Amphion/assets/33707885/0a6e39e8-d5f1-4288-b0f8-32da5a2d6e96) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/visualization/SingVisio/README.md)
- **2024/02/22**: The first Amphion visualization tool, **SingVisio**, release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://drive.google.com/file/d/15097SGhQh-SwUNbdWDYNyWEP--YGLba5/view) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/visualization/SingVisio/README.md)
- **2023/12/18**: Amphion v0.1 release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2312.09911) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Amphion-pink)](https://huggingface.co/amphion) [![youtube](https://img.shields.io/badge/YouTube-Demo-red)](https://www.youtube.com/watch?v=1aw0HhcggvQ) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](https://github.com/open-mmlab/Amphion/pull/39)
- **2023/11/28**: Amphion alpha release. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](https://github.com/open-mmlab/Amphion/pull/2)

Expand Down Expand Up @@ -87,7 +88,7 @@ Amphion provides a comprehensive objective evaluation of the generated audio. Th

Amphion provides visualization tools to interactively illustrate the internal processing mechanism of classic models. This provides an invaluable resource for educational purposes and for facilitating understandable research.

Currently, Amphion supports [SingVisio](egs/visualization/SingVisio/README.md), a visualization tool of the diffusion model for singing voice conversion. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://github.com/open-mmlab/Amphion/assets/33707885/0a6e39e8-d5f1-4288-b0f8-32da5a2d6e96)
Currently, Amphion supports [SingVisio](egs/visualization/SingVisio/README.md), a visualization tool of the diffusion model for singing voice conversion. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://drive.google.com/file/d/15097SGhQh-SwUNbdWDYNyWEP--YGLba5/view)


## πŸ“€ Installation
Expand Down Expand Up @@ -158,9 +159,9 @@ Amphion is under the [MIT License](LICENSE). It is free for both research and co

```bibtex
@inproceedings{amphion,
author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
booktitle={Proc.~of SLT},
year={2024}
author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
yuantuo666 marked this conversation as resolved.
Show resolved Hide resolved
year={2024}
}
```
2 changes: 1 addition & 1 deletion egs/visualization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Quick Start

We provides a **[beginner recipe](SingVisio/)** to demonstrate how to implement interactive visualization for classic audio, music and speech generative models. Specifically, it is also an official implementation of the paper "[SingVisio: SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion](https://arxiv.org/pdf/2402.12660.pdf)". The **SingVisio** can be experienced [here](https://openxlab.org.cn/apps/detail/Amphion/SingVisio).
We provides a **[beginner recipe](SingVisio/)** to demonstrate how to implement interactive visualization for classic audio, music and speech generative models. Specifically, it is also an official implementation of the paper "SingVisio: Visual Analytics of the Diffusion Model for Singing Voice Conversion", which can be accessed via [arXiv](https://arxiv.org/abs/2402.12660) or [Computers & Graphics](https://www.sciencedirect.com/science/article/pii/S0097849324001936). The **SingVisio** can be experienced [here](https://openxlab.org.cn/apps/detail/Amphion/SingVisio).

## Supported Models

Expand Down
36 changes: 29 additions & 7 deletions egs/visualization/SingVisio/README.md
yuantuo666 marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,19 @@

[![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660)
[![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio)
[![Video](https://img.shields.io/badge/Video-Demo-orange)](https://github.com/open-mmlab/Amphion/assets/33707885/0a6e39e8-d5f1-4288-b0f8-32da5a2d6e96)
[![Video](https://img.shields.io/badge/Video-Demo-orange)](https://drive.google.com/file/d/15097SGhQh-SwUNbdWDYNyWEP--YGLba5/view)

<div align="center">
<img src="../../../imgs/visualization/SingVisio_system.png" width="85%">
<img src="../../../imgs/visualization/SingVisio_system.jpg" width="85%">
</div>

This is the official implementation of the paper "[SingVisio: Visual Analytics of the Diffusion Model for Singing Voice Conversion](https://arxiv.org/abs/2402.12660)." **SingVisio** system can be experienced [here](https://openxlab.org.cn/apps/detail/Amphion/SingVisio).
This is the official implementation of the paper "SingVisio: Visual Analytics of the Diffusion Model for Singing Voice Conversion", which can be accessed via [arXiv](https://arxiv.org/abs/2402.12660) or [Computers & Graphics](https://www.sciencedirect.com/science/article/pii/S0097849324001936).

The online **SingVisio** system can be experienced [here](https://openxlab.org.cn/apps/detail/Amphion/SingVisio).

**SingVisio** system comprises two main components: a web-based front-end user interface and a back-end generation model.

- The web-based user interface was developed using [D3.js](https://d3-graph-gallery.com/index.html), a JavaScript library designed for creating dynamic and interactive data visualizations. The code can be accessed [here](../../../visualization/SingVisio/webpage/).
- The web-based user interface was developed using [D3.js](https://d3js.org/), a JavaScript library designed for creating dynamic and interactive data visualizations. The code can be accessed [here](../../../visualization/SingVisio/webpage/).
- The core generative model, [MultipleContentsSVC](https://arxiv.org/abs/2310.11160), is a diffusion-based model tailored for singing voice conversion (SVC). The code for this model is available in Amphion, with the recipe accessible [here](../../svc/MultipleContentsSVC/).

## Development Workflow for Visualization Systems
Expand Down Expand Up @@ -57,12 +59,32 @@ The user inference of **SingVisio** is comprised of five views:

## Detailed System Introduction of SingVisio

For a detailed introduction to **SingVisio** and user instructions, please refer to [this online document](https://x8gvg3n7v3.feishu.cn/docx/IMhUdqIFVo0ZjaxlBf6cpjTEnvf?from=from_copylink) (with animation) or [offline document](../../../visualization/SingVisio/System_Introduction_of_SingVisio.pdf) (without animation).
For a detailed introduction to **SingVisio** and user instructions, please refer to [this document](../../../visualization/SingVisio/System_Introduction_of_SingVisio_V2.pdf).
yuantuo666 marked this conversation as resolved.
Show resolved Hide resolved

Additionally, explore the SingVisio demo to see the system's functionalities and usage in action.

[SingVisio_Demo](https://github.com/open-mmlab/Amphion/assets/33707885/0a6e39e8-d5f1-4288-b0f8-32da5a2d6e96)

## User Study of SingVisio

Participate in the [user study](https://www.wjx.cn/vm/wkIH372.aspx#) of **SingVisio** if you're interested. We encourage you to conduct the study after experiencing the **SingVisio** system. Your valuable feedback is greatly appreciated.

## Citations πŸ“–

Please cite the following papers if you use **SingVisio** in your research:

```bibtex
@article{singvisio,
author={Xue, Liumeng and Wang, Chaoren and Wang, Mingxuan and Zhang, Xueyao and Han, Jun and Wu, Zhizheng},
title={SingVisio: Visual Analytics of the Diffusion Model for Singing Voice Conversion},
journal={Computers & Graphics},
year={2024}
}
```

```bibtex
@inproceedings{amphion,
author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
year={2024}
}
```
Binary file removed imgs/visualization/SingVisio_demo.png
Binary file not shown.
Binary file added imgs/visualization/SingVisio_system.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed imgs/visualization/SingVisio_system.png
Binary file not shown.
17 changes: 16 additions & 1 deletion preprocessors/Emilia/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,21 @@ The processed audio (default 24k sample rate) files will be saved into `input_fo
]
```

## TODOs πŸ“

Here are some potential improvements for the Emilia-Pipe pipeline:

- [x] Optimize the pipeline for better processing speed.
- [ ] Support input audio files larger than 4GB (calculated in WAVE format).
- [ ] Update source separation model to better handle noisy audio (e.g., reverberation).
- [ ] Ensure single speaker in each segment in the speaker diarization step.
- [ ] Move VAD to the first step to filter out non-speech segments. (for better speed)
- [ ] Extend ASR supported max length over 30s while keeping the speed.
- [ ] Fine-tune the ASR model to improve transcription accuracy on puctuation.
- [ ] Adding multimodal features to the pipeline for better transcription accuracy.
- [ ] Filter segments with unclean background noise, speaker overlap, hallucination transcriptions, etc.
- [ ] Labeling the data: speaker info (e.g., gender, age, native language, health), emotion, speaking style (pitch, rate, accent), acoustic features (e.g., fundamental frequency, formants), and environmental factors (background noise, microphone setup). Besides, non-verbal cues (e.g., laughter, coughing, silence, filters) and paralinguistic features could be labeled as well.

## Acknowledgement πŸ””
We acknowledge the wonderful work by these excellent developers!
- Source Separation: [UVR-MDX-NET-Inst_HQ_3](https://github.com/TRvlvr/model_repo/releases/tag/all_public_uvr_models)
Expand All @@ -209,7 +224,7 @@ If you use the Emilia dataset or the Emilia-Pipe pipeline, please cite the follo
@inproceedings{amphion,
author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
booktitle={Proc.~of SLT},
booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
year={2024}
}
```
118 changes: 118 additions & 0 deletions preprocessors/Emilia/main_multi.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Copyright (c) 2024 Amphion.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

import argparse
import multiprocessing
import os
import subprocess
import time

from utils.logger import Logger
from utils.tool import get_gpu_nums


def run_script(args, gpu_id, self_id):
"""
Run the script by passing the GPU ID and self ID to environment variables and execute the main.py script.

Args:
gpu_id (int): ID of the GPU.
self_id (int): ID of the process.

Returns:
None
"""
env = os.environ.copy()
env["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
env["SELF_ID"] = str(self_id)

command = (
f"source {args.conda_path} &&"
'eval "$(conda shell.bash hook)" && '
f"conda activate {args.conda_env_name} && "
"python main.py"
)

try:
process = subprocess.Popen(command, shell=True, env=env, executable="/bin/bash")
process.wait()
logger.info(f"Process for GPU {gpu_id} completed successfully.")
except KeyboardInterrupt:
logger.warning(f"Multi - GPU {gpu_id}: Interrupted by keyboard, exiting...")
except Exception as e:
logger.error(f"Error occurred for GPU {gpu_id}: {e}")


def main(args, self_id):
"""
Start multiple script tasks using multiple processes, each process using one GPU.

Args:
self_id (str): Identifier for the current process.

Returns:
None
"""
disabled_ids = []
if args.disabled_gpu_ids:
disabled_ids = [int(i) for i in args.disabled_gpu_ids.split(",")]
logger.info(f"CUDA_DISABLE_ID is set, not using: {disabled_ids}")

gpus_count = get_gpu_nums()

available_gpus = [i for i in range(gpus_count) if i not in disabled_ids]
processes = []

for gpu_id in available_gpus:
process = multiprocessing.Process(
target=run_script, args=(args, gpu_id, self_id)
)
process.start()
logger.info(f"GPU {gpu_id}: started...")
time.sleep(1)
processes.append(process)

for process in processes:
process.join()


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--self_id", type=str, default="main_multi", help="Log ID")
parser.add_argument(
"--disabled_gpu_ids",
type=str,
default="",
help="Comma-separated list of disabled GPU IDs, default uses all available GPUs",
)
parser.add_argument(
"--conda_path",
type=str,
default="/opt/conda/etc/profile.d/conda.sh",
help="Conda path",
)
parser.add_argument(
"--conda_env_name",
type=str,
default="AudioPipeline",
help="Conda environment name",
)
parser.add_argument(
"--main_command_args",
type=str,
default="",
help="Main command args, check available options by `python main.py --help`",
)
args = parser.parse_args()

self_id = args.self_id
if "SELF_ID" in os.environ:
self_id = f"{self_id}_#{os.environ['SELF_ID']}"

logger = Logger.get_logger(self_id)

logger.info(f"Starting main_multi.py with self_id: {self_id}, args: {vars(args)}.")
main(args, self_id)
logger.info("Exiting main_multi.py...")
Binary file not shown.
Binary file not shown.
2 changes: 1 addition & 1 deletion visualization/SingVisio/webpage/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## SingVisio Webpage

This is the source code for the SingVisio Webpage. This README file will introduce the project and provide an installation guide.
This is the source code for the SingVisio Webpage. This README file will introduce the project and provide an installation guide. For introduction to SingVisio, please check this [README.md](../../../egs/visualization/SingVisio/README.md) file.

### Tech Stack

Expand Down
Loading