Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement comparison method of Pfeifenberger et al 2017 #22

Open
mim opened this issue May 22, 2017 · 13 comments
Open

Implement comparison method of Pfeifenberger et al 2017 #22

mim opened this issue May 22, 2017 · 13 comments
Assignees

Comments

@mim
Copy link
Contributor

mim commented May 22, 2017

Lukas Pfeifenberger, Matthias Zohrer, Franz Pernkopf. "DNN-BASED SPEECH MASK ESTIMATION FOR EIGENVECTOR BEAMFORMING." in ICASSP 2017. PDF

Slides from their talk at ICASSP

@nateanl nateanl self-assigned this May 24, 2017
@nateanl
Copy link
Contributor

nateanl commented Jun 6, 2017

I'm confused about "kernelized DNN". For each point in the spectrogram, there is a feature vector. But the kernels for different frequency bins are different. Does this mean I need to build 257 different autoencoder layers and merge the output together to feed into the regression layer?

@mim
Copy link
Contributor Author

mim commented Jun 6, 2017

The slide numbered 14 (actually page 29) in the presentation shows a flowchart of the network structure. Does that answer your question?

@mim
Copy link
Contributor Author

mim commented Jun 6, 2017

But yes, it looks like there is a separate small DNN for each frequency channel and their outputs are combined by the final regression layer.

@nateanl
Copy link
Contributor

nateanl commented Jun 6, 2017

Ok, I see.

@nateanl
Copy link
Contributor

nateanl commented Jul 11, 2017

One question, I tried to compute the PSD matrix of clean speech. According to CHiME3's explanation, the reference of the simulated set is in tr05_ORG. However, there are only single-channel audios.
In this case, what is the formula to compute the PSD matrix? Just repeat the spectrogram six times to get the 6*1 vector?

@mim
Copy link
Contributor Author

mim commented Jul 12, 2017

If the power spectral density is supposed to be a 6x6 matrix per frequency, then you need to use the spatial image of the clean speech, not the original clean speech source signal. The spatial image of the clean speech is in the "reverberated" directory. If you need one power per frequency, then you can just average the speech power in the original clean speech source signals across time.

@nateanl
Copy link
Contributor

nateanl commented Sep 3, 2017

I can't find the "reverberated" directory in /home/data/CHiME3...
Also in the paper, when the author calculated the ground truth, the formula is:
image

What is the meaning of "Tr"?

@mim
Copy link
Contributor Author

mim commented Sep 3, 2017

Huh, that's strange, but I can confirm it is.gone. It looks like Felix Las modified the /home/data/CHiME3/data/audio directory at the end of June. I guess just downloaded it again from the chime3 website.

@mim
Copy link
Contributor Author

mim commented Sep 3, 2017

Tr means trace of the matrix, the sum of the diagonal, which is also the sum of the eigenvalues.

@nateanl
Copy link
Contributor

nateanl commented Sep 3, 2017

Got it.

@mim
Copy link
Contributor Author

mim commented Sep 3, 2017

Wait, there is no reverberated directory for CHiME3, that's just CHiME2. CHiME3 has channel 0 as the reference.

@nateanl
Copy link
Contributor

nateanl commented Sep 3, 2017

The PSD matrix of noise can be computed in this way. What about the PSD of speech? Does CHiME3 have 6 channels of speech audio?

@mim
Copy link
Contributor Author

mim commented Sep 4, 2017

Yes, it is available for (some of?) the simulated mixtures. They are different between training, dev, and eval, so check each one. Also read the CHiME3 paper.

Equation (15) in the Pfeifenberger paper is just to show that it works in that visualization (figure 1), you don't actually need it for the deployable version of the algorithm.

When the spatial images (6-channel recordings) of the speech and noise are available separately, you can use those directly to compute the PSD of the speech and noise. For an observed mixture, there are several ways to estimate them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants