-
Notifications
You must be signed in to change notification settings - Fork 1
/
notes
93 lines (72 loc) · 2.48 KB
/
notes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
solved
batch items should have same shape
Feature concatenation in GS-fusion block. Shapes missmatch
Shufflenet is for image classification, need to take features without the last fc layer
find_unused_parameters=True in DDP constructor. unused parameters in shufflenet and lstm block(typo)
metrics
I HAD TO ADD `export PYTHONPATH="/export/home/7alieksi/ba/src"` to .bashrc to run this
open
shufflenet normalize, input size 224
loss function too close to zero.
wer metric +
rtf metric +
How does real-time work? +
layernorm
dataset lrs3 only from Julius:
train 2 times: 1st with mix and s1 as target speaker, 2nd with mix and s2 as target speaker. s1 is louder as s2.
2 secons gecropped
todo
AO vs AV compare vs Naive AV
new dataset with interfering speakers
Different SNR
chunk input for real-time 160=10ms=4frames (640 = 40ms = 1 video frame) samples 1 frame
LSTM 100 Hz.
dataset is too complicated
add images in BA (input waveform -> input features -> mask -> applied mask -> out features -> output waveform)
add audios to github and write about it in ba
BA structure:
\section{Introduction}
tell story, non-technical
0. General introduction
1. Audio-only with limitations
2. AV SE benefits Video importance
3. AV-SE in real time on CPU
ADD structure of BA
\section{Problem}
Problem formulation source avseOverview chapter 2
real-time capabilities on cpu
try to avoid repetition from introduction
\section{Related work}
add more details, technical
AO-SE problem with interfering speakers
Personalized speech enhancement
Assist of video AV-SE
About deep learning :
Summary of E3Net + ConvTASNET
- Summary of deeper model architecture papers
ResNet
DenseNet
\section{Methodology}
Model overview
Audio:
Conv encoder and conv encoder vs STFT
Layernorm ?
Enhancement network (masking network) what is LSTM (RNN, Seq2Seq models overview 7b)
Audio decoder
Dense connection module genauer angucken!
Video:
ShuffleNetV2 vs alternatives ([22,23]) preprocessing
only crops of the target faces are considered (overview ch 5a)
Fusion module
Describe all metrics (WER, PESQ, SDR, RTF)
\section{Training, Testing and Evaluation}
Preparation of datasets (LRS3_30h)
Results
\section{Conclusion}
pro and contra, limitations and benefits of the method
Compared to knowledge-based approaches
Audio: several targeted speakers (?)
Visual: illumination changes, occlusion and pose variations.
\subsection{Future work}
Assist of text? find source.
Assist of picture only (source overview 5d)