The typical neuron consists of a cell body, a nucleus, the axon and axon terminals, and dendrites. The dendrites split into complex trees and receive signals from other neurons. Signals are sent through the axon terminals to other neurons.
Neurons are found throughout the nervous system (brain, spinal cord, periphery) but are most concentrated in the brain. A human brain has about 100 billion neurons. Each neuron has, on average, about 7000 connections to other neurons.
This wouldn’t be particularly remarkable except for the generally-accepted theory that the brain is an “information processing machine.” That is to say, it is computing something with all those neurons. The question is, then, can properties of the brain be the reason humans are intelligent, and can those properties be used in systems we create?
The most intriguing property of the brain, apart from its massive connectivity, is that each neuron is more-or-less doing the same thing. Most neurons “activate” (discharges an electrical signal down the axon) based on whether or not neurons connected to it have recently activated themselves. Each connection to another neuron has a certain strength, so that if the connected neuron activates, a signal of a certain strength is received. The signals received add up, and may cause the receiving neuron itself to activate.
The signals have to add up in a certain time frame, otherwise they will just dissipate. If the receiving neuron does activate, it completely discharges. It is an “all or nothing” event. More received electrical charge does not create greater discharges; the amount of received charge only determines if an activation will occur.
Some connections among neurons are excitatory, which increase the chance of activation by the receiving neuron, and inhibitory which do the opposite.
The connections among neurons and their strengths are not hard-wired. “Learning” is essentially the process of changing connections (“synaptic pruning”) and their strengths (“synaptic plasiticity”).
One important theory about how these changes come about is the Hebbian learning rule (from Donald Hebb, 1949). It attempts to explain how connections strengthen:
Let us assume that the persistence or repetition of a reverberatory activity (or “trace”) tends to induce lasting cellular changes that add to its stability. […] When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.
This is often summarized as: “Cells that fire together, wire together.” The theory would basically explain associative memory: if you repeatedly hear the birthday song while you are simultaneously looking at a birthday cake, eventually whenever you see or think about a birthday cake, you are reminded of the song.
We want to build systems that exhibit some properties of the brain. These properties are:
- Simple, nearly-identical components (neurons)
- Massively-parallel computation
- A learning mechanism
These properties are important for several reasons. Systems built out of many similar (well, let’s just say “identical”) components may be more robust if we can be certain the design of the identical components is a good design. Massively-parallel computation promises that very complex tasks can be done in very little time; accomplishing this is still just pie in the sky for most systems built by humans. And if we have access to a reliable learning mechanisms, systems do not so much have to be designed but rather “trained.” Perhaps we can build systems that begin like children but learn, on their own, to be highly capable. Perhaps we can just throw together a bunch of simple “neurons,” show the system lots of training examples, and the system will do the rest of the work.
The first attempt was by Frank Rosenblatt in 1957. He designed the “perceptron,” a computational unit that can be described by the following diagram.
The perceptron is analogous to a single neuron. It receives inputs
from
\begin{equation}
f\left(∑i=1^k xi wji + wj0\right) =
\begin{cases}
0 & \text{if sum } < 0.5
1 & \text{if sum } \geq 0.5
\end{cases}
\end{equation}
Each perceptron may be connected to others, so that a perceptron’s output can be other perceptrons’ inputs. However, it turns out that the complexity of the network greatly affects how learning will work.
The first property to notice about perceptrons is that a single perceptron can compute simple logic functions, assuming we have binary inputs. For example,
Outputs of perceptrons can become inputs to other perceptrons, allowing us to build all possible logic circuits. Thus, perceptrons are capable of general computation, putting them in the large class of fantastically different systems that are Turing-complete. A Turing-complete system is able to compute anything that can be computed; examples include electrical logic gates (i.e., a typical computer), the Lambda calculus, cellular automata, and now neural networks.
But can a complex network learn? When we say “learn,” we mean: can the weights be assigned via a training process rather than assigned manually? This would be ideal: just put together a bunch of perceptrons, give it lots of examples (e.g., examples of prime numbers and composite numbers), and let it figure out how to set the weights so that it turns into a machine that computes that function (e.g., the function to determine if a number is prime or not).
Look again at the definition of the threshold function:
\begin{equation}
f\left(∑i=1^k xi wji + wj0\right) =
\begin{cases}
0 & \text{if sum } < 0.5
1 & \text{if sum } \geq 0.5
\end{cases}
\end{equation}
Suppose we want to train a single perceptron. We want that function to be 1 for positive examples and 0 for negative examples. This means we want the sum of the inputs to be at least 0.5 for positive examples, and less for negative examples.
The sum is a linear function. It can be “trained” using typical
methods that work on linear functions. One such method is gradient
descent (or ascent), a.k.a. hill-climbing. Normally, with gradient
descent, we compute the derivative (gradient) of the objective
function (the threshold function
Let
where $w’ij$ is the new weight and
A single perceptron can have many inputs; perhaps thousands. What kinds of functions can it learn? Recall the sum of the weighted inputs is a linear function. The threshold function just says “yes” or “no” based on this sum. Suppose there are two inputs: then the weighted sum describes a line in 2D. The threshold function just says “yes” if an input vector is above (or below) the line, or exactly on the line, and “no” otherwise.
Here is a diagram of this threshold line:
The answer is, therefore, that a perceptron can learn any “linearly separable” function. If the positive examples and negative examples can be plotted (possibly in higher dimensions), and there exists a line/plane that separates the two classes, then a perceptron can learn that line/plane. We know this because the line/plane can be described by a weighted sum of the inputs. This fact is why you may find perceptrons described in terms of “computational geometry” — they are effectively learning geometric functions.
It turns out that not all functions are linearly separable. The simplest case is the XOR function. You can plot it like so:
There is simply no line that separates the positive and negative examples of the XOR function. A perceptron cannot learn this function.
This is such a basic function that, when Minsky and Papert proved in 1969 that XOR could not be learned, perceptrons took a huge hit. Research in neural networks did not fully recover until the 1980s, when a new learning mechanism was discovered.
Putting lots of perceptrons together in a single-layer network changes nothing. They all learn independently. (So XOR still cannot be learned.) However, we can get more than 0/1 as an output. We can get any number of outputs we wish.
These networks are “fully connected” and “feed-forward” — fully connected because each input connects to every perceptron, and feed-forward because there are no loops. “Recurrent” networks have loops, where the output of some perceptron becomes the input to another perceptron earlier in the chain, or an input to itself. Recurrent networks are decidedly more complex (they can be dynamic non-linear systems) but are able to keep memories and other such interesting features.
When the function being learned is not linearly separable, the threshold function causes some oscillating. This is because points near the separator are continually changing classifications as the weights update and move the separator.
We also may desire probabilistic answers rather than yes/no answers. If the data cannot actually be separated, then classifications near boundary should be less confident than classifications far from the boundary. It would be nice if the perceptrons can report this confidence as a classification probability.
The threshold function we saw cannot do this for us. We need a continuous function that looks the same but gives graded answers.
A good choice is the logistic function:
where $s_j = ∑i=1^k xi wij + w0j$. This is what it looks like:
Standard logistic curve (a.k.a. sigmoid function), from Wikipedia
Since we no longer have binary outputs, but rather probabilistic ones, we need to change our evaluation function. An output is no longer either correct or incorrect. If the true answer is 1.0, and the logistic function gives 0.79, then we say the error is 0.21. For the sake of easy derivatives, we’ll go further and say the error, now called the “loss,” is the square of the difference:
where
Our learning algorithm needs to be revised. Luckily, we can just work backwards from the loss function and attempt to walk up/down the gradient. Since the logistic function and the loss function are differentiable, this gradient is easy to find for each weight separately.
\begin{eqnarray}
\frac{∂ L_j}{∂ wij}
&=& \frac{∂}{∂ wij} (d_j - f(s_j))^2
&=& 2 (d_j - f(s_j)) \frac{∂}{∂ wij} (d_j - f(s_j)) \qquad \text{chain rule} \
&=& -2 (d_j - f(s_j)) f’(s_j) \frac{∂}{∂ wij} s_j \qquad \text{chain rule}\
&=& -2 (d_j - f(s_j)) f’(s_j) \frac{∂}{∂ wij} ∑i’=1^k xi’ wij’ + wj0 \qquad \text{expand } s_j\
&=& -2 (d_j - f(s_j)) f’(s_j) xi \qquad \text{picks out } xi wij \text{, whose derivative is } xi
\end{eqnarray}
Note that
We want to update our weights in the opposite direction of the loss
gradient. So our weight update is the following, with
As an example, consider the task of classifying handwritten digits with classes 0-9. Here are some examples (intentionally pixelated to reduce the input space, i.e., the number of pixels):
For convenience later, we define the output node error gradient as:
And the hidden node error gradient (contribution of error of next node
And the total error on hidden node
i.e., the sum across all nodes
For an output node
\begin{eqnarray}
\frac{∂ L_k}{∂ wjk}
&=& -2(d_k - p_k) \frac{∂ p_k}{∂ wjk}
&=& -2(d_k - p_k) \frac{∂ f(s_k)}{∂ wjk} \
&=& -2(d_k - p_k) f’(s_k) \frac{∂ s_k}{∂ wjk} \
&=& -2(d_k - p_k) f’(s_k) \frac{∂}{∂ wjk}\Big( ∑_j
wjk p_j + w0k \Big) \
&=& -2(d_k - p_k) f’(s_k) p_j \
&=& p_j Δ_k
\end{eqnarray}
For a hidden node
\begin{eqnarray}
\frac{∂ L_k}{∂ wij}
&=& -2(d_k - p_k) \frac{∂ p_k}{∂ wij}
&=& -2(d_k - p_k) \frac{∂ f(s_k)}{∂ wij} \
&=& -2(d_k - p_k) f’(s_k) \frac{∂ s_k}{∂ wij} \
&=& Δ_k \frac{∂}{∂ wij} \Big( ∑_j wjk p_j + w0k \Big) \
&=& Δ_k wjk \frac{∂ p_j}{∂ wij} \
&=& Δ_k wjk \frac{∂ f(s_j)}{∂ wij} \
&=& Δ_k wjk f’(s_j) \frac{∂ s_j}{∂ wij} \
&=& Δ_k wjk f’(s_j) \frac{∂}{∂ wij} \Big( ∑_i wij p_i + w0j \Big) \
&=& Δ_k wjk f’(s_j) p_i \
&=& p_i Δjk
\end{eqnarray}
If
For input hidden node,
=== Evaluation on test split === === Summary === Correctly Classified Instances 551 98.0427 % Incorrectly Classified Instances 11 1.9573 % Kappa statistic 0.9782 Mean absolute error 0.0058 Root mean squared error 0.0612 Relative absolute error 3.2483 % Root relative squared error 20.4131 % Total Number of Instances 562 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.962 0 1 0.962 0.981 0.999 0 0.957 0.002 0.978 0.957 0.968 0.998 1 1 0.002 0.982 1 0.991 1 2 0.939 0.004 0.969 0.939 0.954 1 3 1 0.002 0.984 1 0.992 1 4 0.982 0.008 0.931 0.982 0.956 0.999 5 1 0.002 0.982 1 0.991 1 6 0.984 0.002 0.984 0.984 0.984 1 7 0.983 0 1 0.983 0.991 0.999 8 1 0 1 1 1 1 9 Weighted Avg. 0.98 0.002 0.981 0.98 0.98 1 === Confusion Matrix === a b c d e f g h i j <-- classified as 51 0 1 0 0 0 1 0 0 0 | a = 0 0 45 0 0 1 0 0 1 0 0 | b = 1 0 0 55 0 0 0 0 0 0 0 | c = 2 0 0 0 62 0 4 0 0 0 0 | d = 3 0 0 0 0 63 0 0 0 0 0 | e = 4 0 0 0 1 0 54 0 0 0 0 | f = 5 0 0 0 0 0 0 56 0 0 0 | g = 6 0 0 0 1 0 0 0 62 0 0 | h = 7 0 1 0 0 0 0 0 0 58 0 | i = 8 0 0 0 0 0 0 0 0 0 45 | j = 9
Like a biologist taking a scalpel to a rat, he can ablate his network, removing a few cells from the simulation. These ”lesions” might cause NetTalk to act a little fuzzy, but it retains its general ability to pronounce English.
And so it is with humans. If one cuts a few wires in a conventional computer the whole system will crash. But brains have the kind of robustness demonstrated by Dr. Sejnowski’s invention. — NY Times, 1988
YouTube (38min - 49min)