Skip to content

Conversation

@OneAdder
Copy link
Collaborator

@OneAdder OneAdder commented Feb 24, 2025

Fully-Connected Layer for 2D Shapes

Also known as MLP, FeedForward, etc. A common component of neural networks, including transformers. The idea is very simple: first linear transformation => activation => second linear transformation.

This is the last piece of tranformer architecture.
When #203, #205 and this one are merged. We can start adding transformer encoders and decoders.

Python reference: https://github.com/OneAdder/neural-fortran-references/blob/main/fc2d_layer.py

Problem

Softmax derivative here is incorrect. This implementation is actually prime of logistic function which does not equivalent to softmax.
Derivative of softmax w.r.t. to each element in input requires computation of Jacobian matrix:

$jacobian_{i, j} = \pmatrix{\frac{dsofmax_1}{dx_1} & ... & \frac{dsofmax_1}{dx_j} \cr ... & ... & ... \cr\frac{dsofmax_i}{dx_1} & ... & \frac{dsofmax_i}{dx_j} }$
$\frac{dsoftmax}{dx} = gradient \times jacobian$

Where:

  • $\frac{dsoftmax_i}{x_j} = softmax(x_j) \cdot (\alpha - softmax(x_i))$ where $\alpha$ is $1$ for $i = j$, $0$ otherwise

  • $x$ is the input sequence

Similar to my implementation for MultiHead Attention here.

Possible Solutions

It is not easy to resolve as activation_function doesn't accept input, so:

  • Do nothing, I added crutch that throws an error when softmax is passed as activation
  • Make softmax a layer without parameters rather than an activation function, this will work
  • Make a wrapper activation_layer that extends base_layer and accepts activation function

@jvdp1
Copy link
Collaborator

jvdp1 commented Feb 25, 2025

@OneAdder Please forgive my ignorance here. Could you please clarify the distinction between the fc2d layer and the dense layer?

@OneAdder
Copy link
Collaborator Author

@jvdp1 The terms are not particularly well defined here in practice. This is also sometimes called dense. The mathematical distinction is that dense in neural-fortran is linear transformation => activation while my fc2d is linear transformation => activation => linear transformation. Theoretically the same as dense(some_activation) => dense(linear_activation).
The key difference here is from software development perspective. fc2d works with 2D shape. dense can't handle those

@milancurcic milancurcic self-requested a review March 19, 2025 20:58
@milancurcic
Copy link
Member

milancurcic commented Mar 19, 2025

Thanks @OneAdder for starting this. From your explanation I understand what this does.

Rather than introducing a composition of multiple operations as a single layer, I suggest that we build a basic building block first, and then if needed, we can add a "shallow-wrapper" layer around those elementary layers.

Specifically, rather than introducing here a new layer that does "first linear transformation => activation => second linear transformation", I suggest we simply introduce a dense2d layer which is the same as dense but that works on 2-d inputs.

Then, the operation proposed here would be: dense2d(activation) => dense2d(linear). We already have a linear activation function which allows using existing dense layers as linear layers. What do you think?

And thanks for pointing out the incorrect softmax derivative. I don't even recall how and why I did that.

@OneAdder
Copy link
Collaborator Author

@milancurcic It makes sense. I can do it. Should we merge this and then refactor it or the other way around?
BTW, I think we should actually make a consistent API for combined layers. Something along the lines of the following: base_layer is inherited by combined_layer which implements get and set params and gradients methods which point to the params of the layers that make up the combined layer. Combined layers extend combined_layer class

@milancurcic
Copy link
Member

milancurcic commented Mar 26, 2025

Thanks, @OneAdder. If you agree, I suggest that here we simply provide a 2-d version of an existing dense layer (I suggest dense2d) which accepts an activation function as well as a linear activation as a special case.

Good ideas for combined_layer but let's discuss it in a separate issue. I opened #217.

@milancurcic
Copy link
Member

Actually, since we already have linear2d, should we just refactor it to accept an activation function and thus call it dense2d? Then, creating dense2d(..., activation="linear") would give us linear2d.

@OneAdder
Copy link
Collaborator Author

@milancurcic on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants