DDSP experiments in Faust.
- What is DDSP?
- DDSP in Faust
- The
diff
library- The Autodiff Environment
- Differentiable Primitives
- Number Primitive
- Identity Function
- Add Primitive
- Subtract Primitive
- Multiply Primitive
- Divide Primitive
- Power Primitive
int
Primitivemem
Primitive@
Primitivesin
Primitivecos
Primitivetan
Primitiveasin
Primitiveacos
Primitiveatan
Primitiveatan2
Primitiveexp
Primitivelog
Primitivelog10
Primitivesqrt
Primitiveabs
Primitivemin
Primitivemax
Primitivefloor
Primitiveceil
Primitive
- Helper Functions
- Neural Networks
- Roadmap
Differentiable programming is a technique whereby a program can be differentiated with respect to its inputs, permitting the computation of the sensitivity of the program's outputs to changes in its inputs. Partial derivatives of a program can be found analytically via automatic differentiation and, coupled with an appropriate loss function, used to perform gradient descent. Differentiable programming has consequently become a key tool in solving machine learning problems.
Differentiable digital signal processing (DDSP) is the specific application of differentiable programming to audio tasks. DDSP has emerged as a key component in machine learning approaches to problems such as source separation, timbre transfer, parameter estimation, etc. DDSP is reliant on a programming language with a supporting framework for automatic differentiation.
Trigger warning: some mild-to-moderate calculus will follow
To write automatically differentiable code we need analytic expressions for the derivatives of the primitive operations in our program.
Let's consider the example of the addition primitive; in Faust one can write:
process = +;
which yields the block diagram:
So, the output signal, the result of Faust's process
, which we'll call
Note that the addition primitive doesn't know anything about its arguments, their origin, provenance, etc., it just consumes them and returns their sum. In Faust's algebra, the addition of two signals (and just about everything in Faust is a signal) is well-defined, and that's that. This idea will be important later.
Now, say
It happens that the derivative of an addition is also an addition, except this time an addition of the derivatives of the arguments with respect to the variable of interest.
In Faust, we could express this fact as follows:
process = +,+;
If we did, we'd be describing, in parallel,
This is a dual number representation, or more accurately, since we're working with Faust, a dual signal representation. Being able to pass around our algorithm and its derivative in parallel, as dual signals, is pretty handy, as we'll see later. Anyway, what we've just defined is a differentiable addition primitive.
Just as the addition primitive has no knowledge of its input signals, nor does
its differentiable counterpart.
The differentiable primitive promises the following: "give me _
.
x = hslider("x", 0, -1, 1, .1);
u = _;
v = x;
dudx = 0;
dvdx = 1;
process = u,v,dudx,dvdx : +,+;
The first output of this program is the result of an expression describing an
input signal with a DC offset
We can generalise things a bit by defining a differentiable input2 and a differentiable slider:
diffInput = _,0;
diffSlider = hslider("x", 0, -1, 1, .1),1;
Simply applying the differentiable addition primitive to these new primitives isn't going to work because inputs to the adder won't arrive in the correct order; we can fix this with a bit of routing however:
diffAdd = route(4,4,(1,1),(2,3),(3,2),(4,4)) : +,+;
Now we can write:
process = diffInput,diffSlider : diffAdd;
The outputs of our program are the same as before, but we've computed the derivative automatically — to be precise, we've implemented forward mode automatic differentiation. Now we have the makings of a modular approach to automatic differentiation based on differentiable primitives and dual signals.
The above works fine for a single variable, but what if our program has more than one variable? Consider the following non-differentiable example featuring a gain control and a DC offset:
x1 = hslider("gain", .5, 0, 1, .1);
x2 = hslider("dc", 0, -1, 1, .1);
process = _,x1 : *,x2 : +;
We can write this as:
Now, rather than being a lone ordinary derivative
Our algorithm takes two parameter inputs, and produces one output signal, so the
resulting Jacobian matrix is of dimension
Returning to dual number representation and applying the chain and product rules of differentiation, we have:
To implement the above in Faust, let's define some multivariate differentiable primitives:
diffInput(nvars) = _,par(i,nvars,0);
diffSlider(nvars,I,init,lo,hi,step) = hslider("x%I",init,lo,hi,step),par(i,nvars,i==I-1);
diffAdd(nvars) = route(nIN,nOUT,
(u,1),(v,2), // u + v
par(i,nvars,
(u+i+1,dx),(v+i+1,dx+1) // du/dx_i + dv/dx_i
with {
dx = 2*i+3; // Start of derivatives wrt ith var
}
)
) with {
nIN = 2+2*nvars;
nOUT = nIN;
u = 1;
v = u+nvars+1;
} : +,par(i, nvars, +);
diffMul(nvars) = route(nIN,nOUT,
(u,1),(v,2), // u * v
par(i,nvars,
(u,dx),(dvdx,dx+1), // u * dv/dx_i
(dudx,dx+2),(v,dx+3) // du/dx_i * v
with {
dx = 4*i+3; // Start of derivatives wrt ith var
dudx = u+i+1;
dvdx = v+i+1;
}
)
) with {
nIN = 2+2*nvars;
nOUT = 2+4*nvars;
u = 1;
v = u+nvars+1;
} : *,par(i, nvars, *,* : +);
The routing for diffAdd
and diffMul
is a bit more involved, but the same
principle applies as for the univariate differentiable addition primitive.
Our dual signal representation now consists, for each primitive, of the
undifferentiated primitive, and, in parallel, nvars
partial derivatives, each
with respect to the
Armed with the above we can write the differentiable equivalent of our gain+DC example:
NVARS = 2;
x1 = diffSlider(NVARS,1,.5,0,1,.1);
x2 = diffSlider(NVARS,2,0,-1,1,.1);
process = diffInput(NVARS),x1 : diffMul(NVARS),x2 : diffAdd(NVARS);
Estimating Hidden Parameters
Assigning the above algorithm to a variable estimate
, we can compare its
first output, groundTruth
algorithm with hard-coded gain and DC values.
We'll use Faust's default sine wave oscillator as input to both algorithms,
and, to perform the comparison, we'll use a time-domain L1-norm loss function:
import("stdfaust.lib"); // For os.osc, si.bus, etc.
process = os.osc(440.) <: groundTruth,estimate : loss,si.bus(NVARS)
with {
groundTruth = _,.5 : *,-.5 : +;
NVARS = 2;
x1 = diffSlider(NVARS,1,1,0,1,.1);
x2 = diffSlider(NVARS,2,0,-1,1,.1);
estimate = diffInput(NVARS),x1 : diffMul(NVARS),x2 : diffAdd(NVARS);
loss = ro.cross(2) : - : abs <: attach(hbargraph("loss",0,2));
};
Running this in the
Faust web IDE,
we can drag the sliders x1
and x2
around until we minimise the value
reported by the loss function, thus discovering the "hidden" parameters of the
ground truth.
TODO: loss gif
So far we haven't made use of our Faust program's partial derivatives. Our next step is to automate parameter estimation by incorporating these derivatives into a gradient descent algorithm.
Gradients are found as the derivative of the loss function with respect to
In Faust, we can't programmatically update the value of a slider.3 What we ought to do at this point, to automate the estimation of parameter values, is invert our approach; we'll use sliders for our "hidden" parameters, and define a differentiable variable to represent their "learnable" counterparts:
diffVar(nvars,I,graph) = -~_ <: attach(graph),par(i,nvars,i+1==I);
diffVar
handles the subtraction of the scaled gradient, and we can pass it a
bargraph to display the current parameter value.
To supply gradients to the learnable parameters the program has to be set up as a rather grand recursion:
import("stdfaust.lib");
process = os.osc(440.)
: hgroup("DDSP",(route(1+NVARS,2+NVARS,(1+NVARS,1),(1+NVARS,2),par(i,NVARS,(i+1,i+3)))
: vgroup("[0]Parameters",groundTruth,learnable)
: route(2+NVARS,4+NVARS,(1,1),(2,2),(1,3),(2,4),par(i,NVARS,(i+3,i+5)))
: vgroup("[1]Loss & Gradients",loss,gradients)
)) ~ (!,si.bus(NVARS))
with {
groundTruth = vgroup("Hidden",
_,hslider("[0]gain",.5,0,1,.1) : *,hslider("[1]DC",-.5,-1,1,.1) : +
);
NVARS = 2;
x1 = diffVar(NVARS,1,hbargraph("[0]gain", 0, 1));
x2 = diffVar(NVARS,2,hbargraph("[1]DC", -1, 1));
learnable = vgroup("Learned", diffInput(NVARS),x1,_ : diffMul(NVARS),x2 : diffAdd(NVARS));
loss = ro.cross(2) : - : abs <: attach(hbargraph("[1]loss",0.,2));
alpha = hslider("[0]Learning rate [scale:log]", 1e-4, 1e-6, 1e-1, 1e-6);
gradients = (ro.cross(2): -),si.bus(NVARS)
: route(NVARS+1,2*NVARS+1,(1,1),par(i,NVARS,(1,i*2+3),(i+2,2*i+2)))
: (abs,1e-10 : max),par(i,NVARS, *)
: route(NVARS+1,NVARS*2,par(i,NVARS,(1,2*i+2),(i+2,2*i+1)))
: par(i,NVARS, /,alpha : * <: attach(hbargraph("gradient %i",-1e-2,1e-2)));
};
Running this code in the web IDE, we see the learned gain and DC values leap (more or less eagerly depending on the learning rate) to meet the hidden values.
Note that we actually needn't compute the loss function, unless we wanted to
use some low threshold on route()
instance in vgroup("DDSP",...)
.
The example we've just considered is a pretty basic one, and if the inputs to
groundTruth
and learnable
were out of phase by, say, 25 samples, it
would be a lot harder to minimise the loss function.
To work around this we might take time-domain loss over windowed chunks of
input, or compute phase-invariant loss in the frequency domain.
To include the diff
library, use Faust's library
expression:
df = library("/path/to/diff.lib");
The library defines a selection of differentiable primitives and helper functions for describing differentiable Faust programs.
diff
uses Faust's pattern matching feature where possible.
To avoid having to pass the number of differentiable parameters to each
primitive, differentiable primitives are defined within an environment
expression named df.env
.
Begin by defining parameters with df.vars
and then call df.env
, passing
in the parameters as an argument, e.g.:
df = library("diff.lib");
...
vars = df.vars((x1,x2))
with {
x1 = -~_ <: attach(hbargraph("x1",0,1));
x2 = -~_ <: attach(hbargraph("x2",0,1));
};
d = df.env(vars);
Having defined a differentiable environment in this way, primitives can be called as follows, and the appropriate number of partial derivatives will be calculated:
process = d.diff(+);
Additionally, parameters themselves can be accessed with vars.var(n)
, where
n
is the parameter index, starting from 1:
df = library("diff.lib");
vars = df.vars((gain))
with {
gain = -~_ <: attach(hbargraph("gain",0,1));
};
d = df.env(vars);
process = d.input,vars.var(1) : d.diff(*);
The number of parameters can be accessed with vars.N
:
...
learnable = d.input,si.bus(vars.N) // A differentiable input, N gradients
...
For the examples for the primitives that follow, assume the following boilerplate:
df = library("diff.lib");
vars = df.vars((x1,x2)) with { x1 = -~_; x2 = -~_; };
d = df.env(vars);
diff(x)
- Input: a constant numerical expression, i.e. a signal of constant value
x
- Output: one dual signal consisting of the constant signal and
vars.N
partial derivatives, which all equal$0$ .
ma = library("maths.lib");
process = d.diff(2*ma.PI);
diff(_)
- Input: one dual signal
- Output: the unmodified dual signal
process = d.diff(_);
diff(!)
- Input: one dual signal
- Output: None (no signals returned)
process = d.diff(!), _;
diff(+)
- Input: two dual signals
- Output: one dual signal consisting of the sum and
vars.N
partial derivatives
process = d.diff(+);
diff(-)
- Input: two dual signals
- Output: one dual signal consisting of the difference and
vars.N
partial derivatives
process = d.diff(-);
diff(*)
- Input: two dual signals
- Output: one dual signal consisting of the product and
vars.N
partial derivatives
process = d.diff(*);
diff(/)
- Input: two dual signals
- Output: one dual signal consisting of the quotient and
vars.N
partial derivatives
NB. To prevent division by zero in the partial derivatives, diff(/)
uses whichever is the largest of
process = d.diff(/);
diff(^)
- Input: two dual signals
- Output: one dual signal consisting of the first input signal raised to the
power of the second, and
vars.N
partial derivatives.
process = d.diff(^);
diff(int)
- Input: one dual signal
- Output: one dual signal consisting of the integer cast and
vars.N
partial derivatives
NB. int
is a discontinuous function, and its derivative is impulse-like at
integer values of
process = d.diff(int);
diff(mem)
- Input: one dual signal
- Output: one dual signal consisting of the delayed signal and
vars.N
delayed partial derivatives
process = d.diff(mem);
diff(@)
- Input: two dual signals
- Output: one dual signal consisting of the first input signal delayed by the
second, and
vars.N
partial derivatives of the delay expression
NB. the general time-domain expression for the derivative of a delay features
a component which is a derivative with respect to (discrete) time:
diff(@)
is of
limited use for time-variant
process = d.input,d.diff(10) : d.diff(@);
diff(sin)
- Input: one dual signal
- Output: one dual signal consisting of the sine of the input and
vars.N
partial derivatives
process = d.diff(sin);
diff(cos)
- Input: one dual signal
- Output: one dual signal consisting of the cosine of the input and
vars.N
partial derivatives
process = d.diff(cos);
diff(tan)
- Input: one dual signal
- Output: one dual signal consisting of the tangent of the input and
vars.N
partial derivatives
NB. To prevent division by zero in the partial derivatives, diff(tan,vars.N)
uses whichever is the largest of
process = d.diff(tan);
diff(asin)
- Input: one dual signal
- Output: one dual signal consisting of the arcsine of the input and
vars.N
partial derivatives
NB. To prevent division by zero in the partial derivatives, diff(asin,vars.N)
uses whichever is the largest of
process = d.diff(asin);
diff(acos)
- Input: one dual signal
- Output: one dual signal consisting of the arccosine of the input and
vars.N
partial derivatives
NB. To prevent division by zero in the partial derivatives, diff(acos,vars.N)
uses whichever is the largest of
process = d.diff(acos);
diff(atan)
- Input: one dual signal
- Output: one dual signal consisting of the arctan of the input and
vars.N
partial derivatives
process = d.diff(atan);
diff(atan2)
- Input: two dual signals
- Output: one dual signal consisting of the arctan2 of the input and
vars.N
partial derivatives
process = d.diff(atan2);
diff(exp)
- Input: one dual signals
- Output: one dual signal consisting of the exp of the input and
vars.N
partial derivatives
process = d.diff(exp);
diff(log)
- Input: one dual signals
- Output: one dual signal consisting of the log of the input and
vars.N
partial derivatives
process = d.diff(log);
diff(log10)
- Input: one dual signals
- Output: one dual signal consisting of the
$log_{10}$ of the input andvars.N
partial derivatives
process = d.diff(log10);
diff(sqrt)
- Input: one dual signals
- Output: one dual signal consisting of the sqrt of the input and
vars.N
partial derivatives
process = d.diff(sqrt);
diff(abs)
- Input: one dual signals
- Output: one dual signal consisting of the abs of the input and
vars.N
partial derivatives
process = d.diff(abs);
diff(min)
- Input: two dual signals
- Output: one dual signal consisting of the min of the input and
vars.N
partial derivatives
process = d.diff(min);
diff(max)
- Input: two dual signals
- Output: one dual signal consisting of the max of the input and
vars.N
partial derivatives
process = d.diff(max);
diff(floor)
- Input: one dual signals
- Output: one dual signal consisting of the floor of the input and
vars.N
partial derivatives
process = d.diff(floor);
diff(ceil)
- Input: one dual signals
- Output: one dual signal consisting of the floor of the input and
vars.N
partial derivatives
process = d.diff(ceil);
Remainder of the primitives are defined as the following:
This is due to the fact that these primitives (especially bitwise primitives and such) are not very defined in autodiff. These operators are naturally not defined in differentiation and hence its implementation will vary from case to case; as a result, this seems like the best solution to deal with the issue at the moment.
input
process = d.input;
rec(f~g,ngrads)
A utility for supporting the creation of differentiable recursive circuits. Facilitates the passing of gradients into the body of the recursion.
- Inputs:
f
: A differentiable expression taking two dual signals as input and producing one dual signal as output.g
: A differentiable expression taking one dual signal as input and producing one dual signal as output.ngrads
: The number of differentiable variables ing
, i.e. the number of gradients to be passed into the body of the recursion.
- Outputs: One dual signal; the result of the recursion.
E.g. a differentiable 1-pole filter with one parameter, the coefficient of the feedback component:
process = gradient,d.input : df.rec(f~g,1)
with {
vars = df.vars((a)) with { a = -~_; };
d = df.env(vars);
f = d.diff(+);
g = d.diff(_),vars.var(1) : d.diff(*);
gradient = _;
};
phasor(f0)
osc(f0)
A utility function used to iterate N
times through a summation of dual signals.
sumall(N)
This backpropagation circuit is exclusively for parameter estimation and it functions via the creation of gradients and a loss function in order to guide the learnable
parameter to the groundTruth
. The available loss functions can be found below.
backprop(groundTruth, learnable, lossFunction)
NB. this is defined outside of the autodiff environment, e.g.:
df = library("diff.lib");
...
process = df.backprop(groundTruth, learnable, lossFunction);
backpropNoInput(groundTruth, learnable, lossFunction)
NB. this is defined outside of the autodiff environment, e.g.:
df = library("diff.lib");
...
process = df.backpropNoInput(groundTruth, learnable, lossFunction);
This loss function requires a windowSize that allows Faust to record the (ba.)slidingMean of the last windowSize
inputs to the loss function. This allows the input to be averaged over a small period of time and avoid random spikes of inputs or inconsistencies in signals. This loss function also calculates the loss as well the gradients to guide the learnable
parameter to the required truth
parameter.
Furthermore, optimizer
can be substituted with a scheduler as listed below.
learnMAE(windowSize, optimizer)
- Input: windowSize, optimizer
- Output: loss, a gradient per parameter defined in the environment
Mathematically, this loss function is defined as:
where
learnMSE(windowSize, optimizer)
- Input: windowSize, optimizer
- Output: loss, a gradient per parameter defined in the environment
Mathematically, this loss function is defined as:
where
learnMSLE(windowSize, optimizer)
- Input: windowSize, optimizer
- Output: loss, a gradient per parameter defined in the environment
Mathematically, this loss function is defined as:
where
learnHuber(windowSize, optimizer, delta)
- Input: windowSize, optimizer, delta
- Output: loss, a gradient per parameter defined in the environment
Mathematically, this loss function is defined as:
while, gradients are defined in autodiff as:
where
NB. This loss function converges to the global minima for the range
learnLinearFreq(windowSize, optimizer)
This scheduler decays the learnable rate by
- Input: learning_rate, epoch, delta
- Output: resulting learning_rate
learning_rate : learning_rate_scheduler(epoch, delta)
NB. We plan to implement other schedulers such as cosine decay and more.
Optimizers are algorithms or methods used in ML to adjust the learning rate / gradients of a model in order to minimize the loss function.
One can easily introduce the concept of momentum into their optimizer by simply modifying their variables in the diff environment to the following:
df = library("diff.lib");
vars = df.vars((x1,x2))
with {
x1 = _ : +~(_ : *(momentum)) : -~_ <: attach(hbargraph("x1",0,1));
x2 = _ : +~(_ : *(momentum)) : -~_<: attach(hbargraph("x2",0,1));
momentum = 0.9;
};
We suggest the use of this only when using SGD as the optimizer.
This is a regular stochastic gradient descent optimizer which does not account for an adaptive learning rate. This performs pure gradient descent.
optimizeSGD(learningRate)
This is an optimizer, implemented as per the original Adam paper6.
optimizeAdam(learningRate, beta1, beta2)
We recommend beta1 and beta2 to be 0.9 and 0.999; similar to Kera's recommendations.
This is an optimizer, implemented as per the original RMSProp presentation7.
optimizeRMSProp(learningRate, rho)
We recommend rho to be 0.9; similar to Kera's recommendations.
An implementation of any of the above optimizers can be seen below:
df.backprop(truth,learnable,d.learnMAE(1<<5,d.optimizeSGD(1e-3)))
The core concept of NNs is a neuron. We introduce the concept of a neuron in this library. Backpropagation is a difficult in a functional language, such as Faust.
Since much of what we have covered deals with parameter estimation, we know that gradient descent is especially effective to do this task in DSP. The issue is the creation of a fully functioning ML model that can create accurate weights and biases to deal with tasks such as classification, regression and more. This can also be extended to more complex models such as the creation of generative models, such as autoencoders.
We introduce the concept of a single functioning neuron in example .\examples\experiments\single_neuron.dsp. This allows us to take a single hidden layer between the input and the output. This example serves to an classification example to illustrate how Faust deals with non-linearities and how quickly epoches occur in Faust.
As we know, the structure of a single neuron looks something like so:
In Faust, this looks something like this, along with the backpropagation algorithm:
So, what exactly happens in a neuron? Say we use a sigmoid function as a non-linear activation function in this example. The hidden layer calculates the following:
Mathematically, the backpropagation algorithm, even for a single neuron, is complex. This example utilizes the MAE / L1-norm loss function. The loss is represented as:
This seems simple enough, but for defining the gradients, we need to define:
This needs to further simplified for our usage via chain rule:
This is pretty complex. Imagine the complexity for more deeper layers! You would have chains of chains of chains ... rules. As a result, there is a definite need for a generalization for such gradients.
NB. Faust's system of running epoches is very quick (it reaches 20000 epoches in about 10 seconds) and hence the chance of overfitting is very high in this example.
- A dataset creation / storage method...
- A more generalized method for calculating gradients for weights in neurons...
- Automatic parameter normalisation...
- Reverse mode autodiff...
- Batched training data/ground truth...
- Offline training → weights → real-time inference...
Footnotes
-
This serves well enough for the example at hand, but in practice — in a machine learning implementation — a learnable parameter is more like a bargraph. We'll get to that later. ↩
-
An input isn't strictly a Faust primitive. In fact, syntactically, what we're calling an input here is indistinguishable from Faust's identity function, or wire (
_
), the derivative of which is also a wire. We need a distinct expression, however, for an arbitrary signal — mic input, a soundfile, etc. — we know to be entering our program from outside, as it were, and for which we have, in principle, no analytic description. ↩ -
Actually, programmatic parameter updates are possible via Widget Modulation, but changes aren't reflected in the UI. In the interests of keeping things intuitive and visually illustrative, we won't use widget modulation here. ↩
-
We hear the signal produced by the loss function, however; there's plenty of fun to be had (see examples/broken-osc.dsp for example) in sonifying the byproducts of the learning process. ↩
-
Yes, this is a bit of an abomination, mathematically-speaking. ↩
-
https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf ↩