This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States License.
Transform-average-concatenate (TAC) for end-to-end microphone permutation and number invariant multi-channel speech separation
This repository provides the model implementation and dataset generation scripts for the paper "End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation" by Yi Luo, Zhuo Chen, Nima Mesgarani and Takuya Yoshioka. The paper introduces transform-average-concatenate (TAC), a simple module to allow end-to-end multi-channel separation systems to be invariant to microphone permutation (indexing) and number. Although designed for ad-hoc array configuration, TAC also provides significant performance improvement in fixed geometry microphone configuration, showing that it can serve as a general design paradigm for end-to-end multi-channel processing systems.
We implement TAC in the framework of filter-and-sum network (FaSNet), a recently proposed multi-channel speech separation model operated in time-domain. FaSNet is a neural beamformer that performs the standard filter-and-sum beamforming in time domain, while the beamforming coefficients are estimated by a neural network in an end-to-end fashion. For details please refer to the original paper: "FaSNet: Low-latency Adaptive Beamforming for Multi-microphone Audio Processing".
In this paper we make two main modifications to the original FaSNet:
- Instead of the original two-stage architecture, we change it into a single-stage architecture.
- TAC is applied throughout the filter estimation module to synchronize the information in different microphones and allow the model to perform global decision while estimating the filter coeffients.
The figure below shows different designs of FaSNet models.
The building blocks for the filter estimation modules are based on dual-path RNNs (DPRNNs), a simple yet effective method for organizing RNN layers to allow successful modeling of extremely long sequential data. For details about DPRNN please refer to "Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation". The implementation of DPRNN, as well as the combination of DPRNN and TAC, can be found in utility/models.
The evaluation of the model is on both ad-hoc array and fixed geometry array configurations. We simulate two datasets on the public available Librispeech corpus. For data generation please refer to the data folder.