Long-short term memory neural networks are a state-of-the-art tool for modeling sequences (with well publicized results in speech recognition, speech synthesis, music synthesis, language translation, nonlinear time series forecasting, etc.). They're different from ordinary neural networks in that they have a state vector (or memory) which is sequentially, continually modified with each new input presented to the network.
The innovation of the LSTM is the recurrent state vector and the rules for modifying it. The state places the current input in context, and it can remember patterns from the distant past. The state is mutated with each new input according to a couple of simple rules; the parameters of the rules are themselves learned from the training data.
The innovation of the LSTM is the recurrent state vector and the rules for modifying it. The state places the current input in context, and it can remember patterns from the distant past. The state is mutated with each new input according to a couple of simple rules; the parameters of the rules are themselves learned from the training data.
Quantitatively, the basic LSTM cell is described by the following equations. It is like a standard neural network layer, but it has a state and the output depends on both the previous state and the current input. Also, some of the information flows are gated/masked by gates that are learned during training.
- g^i, g^f, and g^o are the input gate, forget gate, and output gate, respectively. They have corresponding weight matrices W and bias factors b.
- h_{t-1} is the previous state vector; h_t is the current state vector which has been updated by the current input. The updated state vector is the previous state multipled element-wise with the forget gate; we add to this an "update" which is run through the nonlineary (typically tanh) of the cell. The update is multiplied by the input gate g^i.
- [h_{t-1},i] is the previous state vector concatenated to the current input feature vector
- The output o of the LSTM cell is the output gate g^o multiplied by tanh of the updated state.
- All gates, the state vector, and the output o have the same dimensionality N which is the number of neurons, a hyperparameter
- Note that an additional layer/nonlinearity should be added after the previously mentioned "output" to match the problem at hand. For example if we're predicting which of 10 classes comes next in the sequence, a softmax with 10 outputs is the final output layer.
Deep LSTM
For a deep LSTM, simply feed the output of the first LSTM cell into another LSTM cell (which has it's own state and weights), and so on. All of these LSTM cells are operating within the same time step. Put a final output layer at the end.
Backpropagation Through Time
In training, we use a sliding window over the dataset. Note that in the above equations, the weight and bias matrices do not depend on the sequence index t. These weights are referred to as tied weights, because they're the same for all times in this window. Suppose we have a sliding window containing 10 steps in the sequence. When we "unfold" the LSTM computational graph in time, each of the 10 inputs will feed into the LSTM as it exists at the 10 different points in time. All of these cells have the same weights W and b, but the state vector mutates as information flows forward in time through the graph. As we backpropagate the error back in time, we will average-over-time the weight updates to compute the overall weight update.
More later!