GitHub - Audio-WestlakeU/RealMAN: A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization

A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization

Description

The Real-recorded and annotated Microphone Array speech&Noise (RealMAN) dataset provides annotated multi-channel speech and noise recordings for dynamic speech enhancement and localization:

A 32-channel array with high-fidelity microphones is used for recording
A loudspeaker is used for playing source speech signals
A total of 83-hour speech signals (48 hours for static speaker and 35 hours for moving speaker) are recorded in 32 different scenes, and 144 hours of background noise are recorded in 31 different scenes
Both speech and noise recording scenes cover various common indoor, outdoor, semi-outdoor and transportation environments
The azimuth angle of the loudspeaker is annotated with an omni-direction fisheye camera, and is used for the training of source localization networks
The direct-path signal is obtained by filtering the played speech signal with an estimated direct-path propagation filter, and is used for the training of speech enhancement networks.

The RealMAN dataset is valuable in two aspects:

Benchmark speech enhancement and localization algorithms in real scenarios
Offer a substantial amount of real-world training data for potentially improving the performance of real-world applications

The details of the RealMAN dataset are described in the following paper: [arXiv]

Download

To download the entire dataset, you can access: Origninal data page or AISHELL page. The dataset comprises the following components:

File	Size	Description
`train.rar`	521.76 GB	The training set consisting of 36.6 hours of static speaker speech and 26.6 hours of moving speaker speech (`ma_speech`), 106.3 hours of noise recordings (`ma_noise`), 0-channel direct path speech (`dp_speech`) and sound source location (`train_*_source_location.csv`).
`val_raw.rar`	65.57 GB	The raw validation set consisting of 4.5 hours of static speaker speech and 3.3 hours of moving speaker speech (`ma_speech`), 16.0 hours of noise recordings (`ma_noise`), 0-channel direct path speech (`dp_speech`) and sound source location (`val_*_source_location.csv`).
`val.rar`	25.57 GB	The validation set consisting of mixed noisy speech recordings (`ma_noise`), 0-channel direct path speech (`dp_speech`), sound source location (`val_*_source_location.csv`).
`test_raw.rar`	91.75 GB	The raw test set consisting of 6.9 hours of static speaker speech and 4.8 hours of moving speaker speech (`ma_speech`), 22.2 hours of noise recordings (`ma_noise`), 0-channel direct path speech (`dp_speech`) and sound source location (`test_*_source_location.csv`).
`test.rar`	38.02 GB	The test set consisting of mixed noisy speech recordings (`ma_noise`), 0-channel direct path speech (`dp_speech`), sound source location (`test_*_source_location.csv`).
`dataset_info.rar`	127.9 MB	The dataset information file including scene photos, scene information (T60, recording duration, etc), and speaker information
`transcriptions.trn`	2.4 MB	The transcription file of speech for the dataset

The dataset is organized into the following directory structure:

RealMAN
├── transcriptions.trn
├── dataset_info
│   ├── scene_images
│   ├── scene_info.json
│   └── speaker_info.csv
└── train|val|test|val_raw|test_raw
    ├── train_moving_source_location.csv
    ├── train_static_source_location.csv
    ├── dp_speech
    │   ├── BadmintonCourt2
    │   │   ├── moving
    │   │   │   ├── 0010
    │   │   │   │   ├── TRAIN_M_BAD2_0010_0003.flac
    │   │   │   │   └── ...
    │   │   │   └── ...
    │   │   └── static
    │   └── ...
    ├── ma_speech|ma_noisy_speech
    │   ├── BadmintonCourt2
    │   │   ├── moving
    │   │   │   ├── 0010
    │   │   │   │   ├── TRAIN_M_BAD2_0010_0003_CH0.flac
    │   │   │   │   └── ...
    │   │   │   └── ...
    │   │   ├── static
    │   └── ...
    └── ma_noise

The naming convention is as follows:

# Recorded Signal
[TRAIN|VAL|TEST]_[M|S]_scene_speakerId_utteranceId_channelId.flac

# Direct-Path Signal
[TRAIN|VAL|TEST]_[M|S]_scene_speakerId_utteranceId.flac

# Source Location
[train|val|test]_[moving|static]_source_location.csv

Baseline

License

The dataset is licensed under the Creative Commons Attribution 4.0 International (CC-BY-4.0) license.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
baselines		baselines
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Download

Baseline

License

About

Releases

Packages

Contributors 5

Languages

Audio-WestlakeU/RealMAN

Folders and files

Latest commit

History

Repository files navigation

Description

Download

Baseline

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages