title | date | comments | categories | tags | |||
---|---|---|---|---|---|---|---|
Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement(MetricGAN) |
2020-08-10 11:10:00 -0700 |
true |
|
|
MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement
Propose a novel MetricGAN approach with an aim to optimize the generator with respect to one or more MetricGAN.
The metrics score can be specified with your own task.
In this reference, we use the speech enhancement task.
Moreover, these metrics are generally complex and could not be fully optimized by
The adversarial loss should make the generated data indistinguishable from real data.
For the ASR, simpler regression approaches may be preferable to GAN-based enhancement. This is the reason for the discriminator is to judge the real or fake not fully related to the metrics which we consider. And the Adversarial loss still not matched the evaluation metrics. The problem called discriminator-evaluation mismatch(DEM).(Since the more less adversarial loss(
The main advantages of MetricsGAN are as follows:
- The surrogate function(discriminator) : It is still in a black-box setting.(no details of metric function)
- The results of the efficiency to increase metric score is higher than
$L_p$ loss. - MetricGAN has the flexibility to generate speech with specific evaluation scores.
- Under some non-extreme conditions, MetricGAN can even achieve multi-metrics assignments by employing multiple discriminators.
LSGAN with
LSGAN with
For the improve, the optimized metric scores, we thought D should be associated with the metric.
The two difference of (2) and (4):
- The target label is 0 or 1 in CGAN. The target label is 0~1 in MetricsGAN.
- The condition used int the D of CGAN is the noisy speech x, which is different from the condition used in the proposed MetricGAN(clean speech y)
The training of G : Since for the efficiency of adversarial loss in MetricGAN. So we just rely on it.
For all, the G want to cheat D to reach specified score, but D tries to not be cheated by the true score.
model : generator --> BLSTM with two bidirectional layers with 200 nodes. Followed with two fully connected layers, each with 300 LeakyReLU nodes and 257 sigmoid nodes for mask estimation, respectively.
discriminator --> four 2-D convolutional layers : [15,(5,5)],[25,(7,7)],[40,(9,9)],[50,(11,11)], and a global average polling layer to 50 dimension. Followed with three fully connected layers, each with 50, 10, 1.
Adam with
TIMIT Dataset with PESQ and STOI scores. MetricGAN (P) : PESQ optimization (denoted as PE policy grad (P)) MetricGAN (S) : STOI optimization (denoted as ST policy grad (S)). We can see for the s, when we give the higher s, then we will get the more precisely picture in generated speech.