Today I'll write a matrix formulation of Neil Schemenauer's back-propagation neural network code, bpnn.py, and write about how neural networks work along the way. Garden variety Artificial Neural Networks (ANNs) are supervised learning algorithms that can learn a nonlinear decision boundary between different classes, or be used for regression. They are used for anomaly detection, time series prediction, and pattern recognition among other applications. Deep Learning methods (not really covered in this post) have achieved state-of-the-art performance in speech recognition, topic modeling, visual pattern recognition, and more in recent years.
Individual neurons
The fundamental object in the network is an individual neuron. Each one has at least one input, and one output. Each input \(x_i\) has a corresponding weighting factor \(w_i\), and the output is a nonlinear function of the weighted sum of the inputs; in the Perceptron, the function is the Indicator function (a.k.a. Heavyside step function)
Why is it reasonable? It says that the errors at the output layer are propagated back the hidden layer according to the connection strengths (weights) between those two layers.
[As an aside, I note that backpropagation by itself isn't good enough for modern deep learning networks because the errors attenuate too much after many layers. Fortunately, however, when it comes to the simpler feedforward backpropagation network architectures, you will almost never need more than 2 hidden layers.]
The update to the hidden layer weights are then
where \(w_{ih}\) is the i-th weight going into h-th hidden node (component of an I+1 x H matrix).
Python Code for ANN
You can see all of this implemented in Neil Schemenauer's back-propagation neural network code, bpnn.py, which also includes a momentum factor which helps to prevent the algorithm getting stuck in a local minimum. There is a built-in demo that trains a network to learn the XOR function.
I implemented a Numpy matrix version of the code which is available at github. I added some feature normalization code and an iris dataset demo which uses two tanh output nodes to make classification of all 3 classes possible. Perhaps a softmax output node would be more appropriate? (this would require some easy extensions to the code)
An existing neural networks library in Python is Pybrain. It has a nice amount of flexibility allowing you to define the activation functions in each layer (linear, sigmoid, tanh, softmax, 'long short-term memory', etc.). You can specify the amount of connectivity between layers, momentum, weight decay, and more.
There is a Python package being developed by the Deep Learning community called Theano. By sending the matrix calculations to your GPU you can get a better than 100x speedup when you train the network -- a must for anyone doing work in this area. It's really nice that an easy GPU solution has been made available relatively quickly after GPUs were first used in research.
Individual neurons
The fundamental object in the network is an individual neuron. Each one has at least one input, and one output. Each input \(x_i\) has a corresponding weighting factor \(w_i\), and the output is a nonlinear function of the weighted sum of the inputs; in the Perceptron, the function is the Indicator function (a.k.a. Heavyside step function)
\( \theta \left(\sum_{i=1}^{N_I} w_i x_i + b \right) \)
Rather than using a bias term \(b\), we could have placed the bias constant at the beginning of the weights vector \(w\) and created a zeroth, dummy component in the input vector \(x_0=1\):
\( \theta\left(\sum_{i=0}^{N_I} w_i x_i \right) = \theta \left( \textbf{w}^T \textbf{x} \right) \)
where \(\textbf{w}\) and \(\textbf{x}\) are taken to be column vectors.
An alternative nonlinear function, which we'll use from now on, is the Symmetric Sigmoid function (equivalent to the tanh function, which ranges from -1 and 1, and S=0 when x=0)
\( S \left( \sum_{i=0}^{N_I} w_i x_i \right) = S \left( \textbf{w}^T \textbf{x} \right) = a\left( \textbf{w}^T \textbf{x} \right) \)
where \(a\) is just a more generic name for the activation function.
How can a neuron learn the parameters \(\textbf{w}\) from a dataset of training examples \( \left\{\textbf{x}_j,y_j | j=1,...,N \right\} \)? We can define the error for each training example to be the square of the difference between the target value \(y_j\) and the value predicted by the neuron, and add up the error from all the training examples:
tanh(x) is also known as a squashing function.
How can a neuron learn the parameters \(\textbf{w}\) from a dataset of training examples \( \left\{\textbf{x}_j,y_j | j=1,...,N \right\} \)? We can define the error for each training example to be the square of the difference between the target value \(y_j\) and the value predicted by the neuron, and add up the error from all the training examples:
\( E(\textbf{w})= \frac{1}{2}\sum_j \left( y_j - a\left( \textbf{w}^T \textbf{x}_j \right) \right)^2 = \frac{1}{2} \sum_j \left( y_j - a\left( \sum_i w_i x_{ij} \right)\right)^2 \)
where \(y_j\) is the label of the j-th training example and its feature values are \(\textbf{x}j\). \(E(\textbf{w})\) is called the cost function, aka objective function. We want to minimize the cost function as a function of the parameters \(\textbf{w}\) to find the parameters that best fit our model to the data.
We will use the Gradient Descent method to minimize the cost: \(E(\textbf{w})\) has a global minimum and we can approach it by nudging \(\textbf{w}\) in the direction of the negative gradient. Specifically, we choose to make the following step:
We will use the Gradient Descent method to minimize the cost: \(E(\textbf{w})\) has a global minimum and we can approach it by nudging \(\textbf{w}\) in the direction of the negative gradient. Specifically, we choose to make the following step:
\( \textbf{w}_{new} = \textbf{w} - \epsilon \nabla_w E(w) \)
where \(\epsilon\) is a small constant step size factor. Under stochastic gradient descent, \(E\) has a simpler form since the descent is made one training example at a time. This potentially results in faster convergence:
\( E(\textbf{w}) = \frac{1}{2}\left( y - a\left( w_i x_i \right) \right) ^2\)
\( \left[ \nabla_\textbf{w} E(\textbf{w}) \right]_k = \frac{\partial E}{\partial w_k} = (y-a)(-a')(\sum_i \delta_{ik}x_i) = (y-a)(-a')x_k \)
\( \left[ \nabla_\textbf{w} E(\textbf{w}) \right]_k = \frac{\partial E}{\partial w_k} = (y-a)(-a')(\sum_i \delta_{ik}x_i) = (y-a)(-a')x_k \)
Neural Network
Now consider 3 layers of neurons, comprised of:
Figure from STATISTICA Help
Now consider 3 layers of neurons, comprised of:
- an input layer of \(I+1\) neurons where \(I\) is the number of features (\(I+1\) is for \(I\) features plus \(1\) bias component). Each of these neurons has one feature input and an activation value; the activation value is transmitted to the \(H\) nodes in the next layer. We define these input neurons to apply unit weight to each input.
- In the next layer there are \(H\) "hidden layer" neurons. Each of these has a weight vector of length \(I+1\) which is applied to the inputs. Each neuron has an activation value which is transmitted to the \(O\) output neurons.
- \(O\) is the number of output layer neurons. In a simple network designed to classify an input feature vector into one of two possible classes (coded by the numbers -1 and 1), it suffices to use a single tanh output node which will return a number between -1 and 1.
Backpropagation
A popular algorithm for training a whole network of neurons is called the backpropagation algorithm. It starts with the well-defined error that we can calculate at the output nodes, which is based on the difference between the training example label (e.g. -1 or 1) and the network output values.
Compute the activation functions
Before we can update the weights, we'll need to calculate the activation functions that result from the current training example we are considering. This gives us \(a_i\), \(a_h\), and \(a_o\), which are the activations of the i-th input node, h-th hidden node, and o-th output node, respectively. This is a feedforward network calculation: We start with the \(a_i\)'s which are identical to the feature vector components, then use these to compute all the \(a_h\), and finally all the \(a_o\). The activations of the input layer are defined as
Weight updates / Weight training
For the weight update, we start by updating the weights that connect the hidden layer with the output layer: The output layer has \(O\) nodes numbered \(o\)=1,...\(O\), each of which has a weight vector of length \(H\) that applies weights to the outputs of the \(H\) hidden layer nodes. We just have to write the weight update equation for each of the \(O\) output nodes.
where \(w_{ho}\) is the h-th weight of the o-th output node (component of an \(H\) x \(O\) matrix), and I defined \(E_o=(y-a_o) a_o'\). Notice that the feature vector of the current training example has been fed foward through the network until it is represented by the activation functions \(a_h\) of the hidden layer which are the components of the \(H\) dimensional activation vector.
At this point, we need a measure of the error at the hidden layer in order to update the weights of the hidden layer nodes. In the backpropagation algorithm, this error at the h-th hidden node is defined to be
A popular algorithm for training a whole network of neurons is called the backpropagation algorithm. It starts with the well-defined error that we can calculate at the output nodes, which is based on the difference between the training example label (e.g. -1 or 1) and the network output values.
Compute the activation functions
Before we can update the weights, we'll need to calculate the activation functions that result from the current training example we are considering. This gives us \(a_i\), \(a_h\), and \(a_o\), which are the activations of the i-th input node, h-th hidden node, and o-th output node, respectively. This is a feedforward network calculation: We start with the \(a_i\)'s which are identical to the feature vector components, then use these to compute all the \(a_h\), and finally all the \(a_o\). The activations of the input layer are defined as
\( a_i = x_i \)
The activation \(a_h\) of the h-th hidden layer node is
\( a_h = a(\sum_i w_{ih}a_i) \)
where \(w_{ih}\) is the i-th weight component of the h-th hidden layer node.
The activation \(a_o\) of the o-th output layer node is
\( a_o = a(\sum_h w_{ho}a_h) \)
where \(w_{ho}\) is the h-th weight component of the o-th output layer node.
(Each of the weight matrices are initialized to small random numbers initially).
Weight updates / Weight training
For the weight update, we start by updating the weights that connect the hidden layer with the output layer: The output layer has \(O\) nodes numbered \(o\)=1,...\(O\), each of which has a weight vector of length \(H\) that applies weights to the outputs of the \(H\) hidden layer nodes. We just have to write the weight update equation for each of the \(O\) output nodes.
\( w^{new}_{ho} = w_{ho} + \epsilon [(y-a_o) a_o'] a_h = w_{ho} + \epsilon E_o a_h \)
where \(w_{ho}\) is the h-th weight of the o-th output node (component of an \(H\) x \(O\) matrix), and I defined \(E_o=(y-a_o) a_o'\). Notice that the feature vector of the current training example has been fed foward through the network until it is represented by the activation functions \(a_h\) of the hidden layer which are the components of the \(H\) dimensional activation vector.
At this point, we need a measure of the error at the hidden layer in order to update the weights of the hidden layer nodes. In the backpropagation algorithm, this error at the h-th hidden node is defined to be
\( E_h = \sum_o E_o w_{ho} \)
Why is it reasonable? It says that the errors at the output layer are propagated back the hidden layer according to the connection strengths (weights) between those two layers.
[As an aside, I note that backpropagation by itself isn't good enough for modern deep learning networks because the errors attenuate too much after many layers. Fortunately, however, when it comes to the simpler feedforward backpropagation network architectures, you will almost never need more than 2 hidden layers.]
The update to the hidden layer weights are then
\( w^{new}_{ih} = w_{ih}+ \epsilon E_h a_i \)
where \(w_{ih}\) is the i-th weight going into h-th hidden node (component of an I+1 x H matrix).
Python Code for ANN
You can see all of this implemented in Neil Schemenauer's back-propagation neural network code, bpnn.py, which also includes a momentum factor which helps to prevent the algorithm getting stuck in a local minimum. There is a built-in demo that trains a network to learn the XOR function.
I implemented a Numpy matrix version of the code which is available at github. I added some feature normalization code and an iris dataset demo which uses two tanh output nodes to make classification of all 3 classes possible. Perhaps a softmax output node would be more appropriate? (this would require some easy extensions to the code)
Total training error (y-axis) after each training epoch (x-axis). You can see that the objective function is likely non-convex and the momentum term was helpful in avoiding local minima and plateaus. This is from training a 1 hidden layer network to recognize handwritten digits from the MNIST dataset. You can also get the digits through scikit-learn: from sklearn.datasets import load_digits, digits = load_digits()
I trained a 2-hidden-layer network in Pybrain to find a nonlinear decision boundary between two classes (red, blue) of training points after (Left) 3 epochs and (Right) 24 epochs. The stochastic nature of the training can be credited with finding a circular decision boundary after just 3 rounds through the training data.
Here I trained a 2-hidden-layer network in Pybrain with one softmax output to classify these 3 Gaussian blobs of training points. The predictions were thresholded in this figure, but the probabilities for each class are output by the network.
There is a Python package being developed by the Deep Learning community called Theano. By sending the matrix calculations to your GPU you can get a better than 100x speedup when you train the network -- a must for anyone doing work in this area. It's really nice that an easy GPU solution has been made available relatively quickly after GPUs were first used in research.