-
CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection,
arXiv, 2410.14509
, arxiv, pdf, cication: -1Andrea Appiani, Cigdem Beyan
-
Vocal Track Separation with Encoder-Decoder Architecture
· (keras-io - keras-team)
-
ClearerVoice-Studio - modelscope
-
svoice - facebookresearch
Speaker Voice Separation using Neural Nets · (arxiv)
-
· (vocal-separate - jianchang512) · (bilibili) · (ultimatevocalremovergui - Anjok07)
- 3D-Speaker - modelscope
- DiariZen - BUTSpeechFIT
- Speaker Verification with ECAPA-TDNN embeddings on Voxceleb 🤗
- wavesurfer - pengzhendong
-
🌟 AudioSetCaps: An Enriched Audio-Caption Dataset using Automated Generation Pipeline with Large Audio and Language Models,
arXiv, 2411.18953
, arxiv, pdf, cication: -1Jisheng Bai, Haohe Liu, Mou Wang, ..., Woon-Seng Gan, Jianfeng Chen · (AudioSetCaps - JishengBai)
-
· (huggingface)
-
· (speechbot.github)
-
SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios,
arXiv, 2410.01481
, arxiv, pdf, cication: -1Kai Li, Wendi Sang, Chang Zeng, ..., Guo Chen, Xiaolin Hu · (cslikai) · (SonicSim - JusperLee) · (mp.weixin.qq)
-
EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data Generation,
arXiv, 2410.12028
, arxiv, pdf, cication: -1Mithun Manivannan, Vignesh Nethrapalli, Mark Cartwright
-
versa - shinjiwlab
-
Vision Language Models Are Few-Shot Audio Spectrogram Classifiers,
arXiv, 2411.12058
, arxiv, pdf, cication: -1Satvik Dixit, Laurie M. Heller, Chris Donahue
-
llama-recipes - meta-llama
An Open Source version of NotebookLM