Data Science Aficionado

Thursday, December 1, 2016

Long-Short Term Memory Neural Networks

Long-short term memory neural networks are a state-of-the-art tool for modeling sequences (with well publicized results in speech recognition, speech synthesis, music synthesis, language translation, nonlinear time series forecasting, etc.). They're different from ordinary neural networks in that they have a state vector (or memory) which is sequentially, continually modified with each new input presented to the network.

The innovation of the LSTM is the recurrent state vector and the rules for modifying it. The state places the current input in context, and it can remember patterns from the distant past. The state is mutated with each new input according to a couple of simple rules; the parameters of the rules are themselves learned from the training data.

Quantitatively, the basic LSTM cell is described by the following equations. It is like a standard neural network layer, but it has a state and the output depends on both the previous state and the current input. Also, some of the information flows are gated/masked by gates that are learned during training.

g^i, g^f, and g^o are the input gate, forget gate, and output gate, respectively. They have corresponding weight matrices W and bias factors b.
h_{t-1} is the previous state vector; h_t is the current state vector which has been updated by the current input. The updated state vector is the previous state multipled element-wise with the forget gate; we add to this an "update" which is run through the nonlineary (typically tanh) of the cell. The update is multiplied by the input gate g^i.
[h_{t-1},i] is the previous state vector concatenated to the current input feature vector
The output o of the LSTM cell is the output gate g^o multiplied by tanh of the updated state.
All gates, the state vector, and the output o have the same dimensionality N which is the number of neurons, a hyperparameter
Note that an additional layer/nonlinearity should be added after the previously mentioned "output" to match the problem at hand. For example if we're predicting which of 10 classes comes next in the sequence, a softmax with 10 outputs is the final output layer.

Deep LSTM

For a deep LSTM, simply feed the output of the first LSTM cell into another LSTM cell (which has it's own state and weights), and so on. All of these LSTM cells are operating within the same time step. Put a final output layer at the end.

Backpropagation Through Time

In training, we use a sliding window over the dataset. Note that in the above equations, the weight and bias matrices do not depend on the sequence index t. These weights are referred to as tied weights, because they're the same for all times in this window. Suppose we have a sliding window containing 10 steps in the sequence. When we "unfold" the LSTM computational graph in time, each of the 10 inputs will feed into the LSTM as it exists at the 10 different points in time. All of these cells have the same weights W and b, but the state vector mutates as information flows forward in time through the graph. As we backpropagate the error back in time, we will average-over-time the weight updates to compute the overall weight update.

More later!

Wednesday, January 29, 2014

Artificial Neural Networks

Today I'll write a matrix formulation of Neil Schemenauer's back-propagation neural network code, bpnn.py, and write about how neural networks work along the way. Garden variety Artificial Neural Networks (ANNs) are supervised learning algorithms that can learn a nonlinear decision boundary between different classes, or be used for regression. They are used for anomaly detection, time series prediction, and pattern recognition among other applications. Deep Learning methods (not really covered in this post) have achieved state-of-the-art performance in speech recognition, topic modeling, visual pattern recognition, and more in recent years.

Individual neurons

The fundamental object in the network is an individual neuron. Each one has at least one input, and one output. Each input \(x_i\) has a corresponding weighting factor \(w_i\), and the output is a nonlinear function of the weighted sum of the inputs; in the Perceptron, the function is the Indicator function (a.k.a. Heavyside step function)

\( \theta \left(\sum_{i=1}^{N_I} w_i x_i + b \right) \)

Rather than using a bias term \(b\), we could have placed the bias constant at the beginning of the weights vector \(w\) and created a zeroth, dummy component in the input vector \(x_0=1\):

\( \theta\left(\sum_{i=0}^{N_I} w_i x_i \right) = \theta \left( \textbf{w}^T \textbf{x} \right) \)

where \(\textbf{w}\) and \(\textbf{x}\) are taken to be column vectors. An alternative nonlinear function, which we'll use from now on, is the Symmetric Sigmoid function (equivalent to the tanh function, which ranges from -1 and 1, and S=0 when x=0)

\( S \left( \sum_{i=0}^{N_I} w_i x_i \right) = S \left( \textbf{w}^T \textbf{x} \right) = a\left( \textbf{w}^T \textbf{x} \right) \)

where \(a\) is just a more generic name for the activation function.

tanh(x) is also known as a squashing function.

How can a neuron learn the parameters \(\textbf{w}\) from a dataset of training examples \( \left\{\textbf{x}_j,y_j | j=1,...,N \right\} \)? We can define the error for each training example to be the square of the difference between the target value \(y_j\) and the value predicted by the neuron, and add up the error from all the training examples:

\( E(\textbf{w})= \frac{1}{2}\sum_j \left( y_j - a\left( \textbf{w}^T \textbf{x}_j \right) \right)^2 = \frac{1}{2} \sum_j \left( y_j - a\left( \sum_i w_i x_{ij} \right)\right)^2 \)

where \(y_j\) is the label of the j-th training example and its feature values are \(\textbf{x}j\). \(E(\textbf{w})\) is called the cost function, aka objective function. We want to minimize the cost function as a function of the parameters \(\textbf{w}\) to find the parameters that best fit our model to the data.

We will use the Gradient Descent method to minimize the cost: \(E(\textbf{w})\) has a global minimum and we can approach it by nudging \(\textbf{w}\) in the direction of the negative gradient. Specifically, we choose to make the following step:

\( \textbf{w}_{new} = \textbf{w} - \epsilon \nabla_w E(w) \)

where \(\epsilon\) is a small constant step size factor. Under stochastic gradient descent, \(E\) has a simpler form since the descent is made one training example at a time. This potentially results in faster convergence:

\( E(\textbf{w}) = \frac{1}{2}\left( y - a\left( w_i x_i \right) \right) ^2\)

\( \left[ \nabla_\textbf{w} E(\textbf{w}) \right]_k = \frac{\partial E}{\partial w_k} = (y-a)(-a')(\sum_i \delta_{ik}x_i) = (y-a)(-a')x_k \)

Neural Network

Figure from STATISTICA Help

Now consider 3 layers of neurons, comprised of:

an input layer of \(I+1\) neurons where \(I\) is the number of features (\(I+1\) is for \(I\) features plus \(1\) bias component). Each of these neurons has one feature input and an activation value; the activation value is transmitted to the \(H\) nodes in the next layer. We define these input neurons to apply unit weight to each input.
In the next layer there are \(H\) "hidden layer" neurons. Each of these has a weight vector of length \(I+1\) which is applied to the inputs. Each neuron has an activation value which is transmitted to the \(O\) output neurons.
\(O\) is the number of output layer neurons. In a simple network designed to classify an input feature vector into one of two possible classes (coded by the numbers -1 and 1), it suffices to use a single tanh output node which will return a number between -1 and 1.

Backpropagation

A popular algorithm for training a whole network of neurons is called the backpropagation algorithm. It starts with the well-defined error that we can calculate at the output nodes, which is based on the difference between the training example label (e.g. -1 or 1) and the network output values.

Compute the activation functions

Before we can update the weights, we'll need to calculate the activation functions that result from the current training example we are considering. This gives us \(a_i\), \(a_h\), and \(a_o\), which are the activations of the i-th input node, h-th hidden node, and o-th output node, respectively. This is a feedforward network calculation: We start with the \(a_i\)'s which are identical to the feature vector components, then use these to compute all the \(a_h\), and finally all the \(a_o\). The activations of the input layer are defined as

\( a_i = x_i \)

The activation \(a_h\) of the h-th hidden layer node is

\( a_h = a(\sum_i w_{ih}a_i) \)

where \(w_{ih}\) is the i-th weight component of the h-th hidden layer node.

The activation \(a_o\) of the o-th output layer node is

\( a_o = a(\sum_h w_{ho}a_h) \)

where \(w_{ho}\) is the h-th weight component of the o-th output layer node.

(Each of the weight matrices are initialized to small random numbers initially).

Weight updates / Weight training

For the weight update, we start by updating the weights that connect the hidden layer with the output layer: The output layer has \(O\) nodes numbered \(o\)=1,...\(O\), each of which has a weight vector of length \(H\) that applies weights to the outputs of the \(H\) hidden layer nodes. We just have to write the weight update equation for each of the \(O\) output nodes.

\( w^{new}_{ho} = w_{ho} + \epsilon [(y-a_o) a_o'] a_h = w_{ho} + \epsilon E_o a_h \)

where \(w_{ho}\) is the h-th weight of the o-th output node (component of an \(H\) x \(O\) matrix), and I defined \(E_o=(y-a_o) a_o'\). Notice that the feature vector of the current training example has been fed foward through the network until it is represented by the activation functions \(a_h\) of the hidden layer which are the components of the \(H\) dimensional activation vector.

At this point, we need a measure of the error at the hidden layer in order to update the weights of the hidden layer nodes. In the backpropagation algorithm, this error at the h-th hidden node is defined to be

\( E_h = \sum_o E_o w_{ho} \)

Why is it reasonable? It says that the errors at the output layer are propagated back the hidden layer according to the connection strengths (weights) between those two layers.

[As an aside, I note that backpropagation by itself isn't good enough for modern deep learning networks because the errors attenuate too much after many layers. Fortunately, however, when it comes to the simpler feedforward backpropagation network architectures, you will almost never need more than 2 hidden layers.]

The update to the hidden layer weights are then

\( w^{new}_{ih} = w_{ih}+ \epsilon E_h a_i \)

where \(w_{ih}\) is the i-th weight going into h-th hidden node (component of an I+1 x H matrix).

Python Code for ANN

You can see all of this implemented in Neil Schemenauer's back-propagation neural network code, bpnn.py, which also includes a momentum factor which helps to prevent the algorithm getting stuck in a local minimum. There is a built-in demo that trains a network to learn the XOR function.

I implemented a Numpy matrix version of the code which is available at github. I added some feature normalization code and an iris dataset demo which uses two tanh output nodes to make classification of all 3 classes possible. Perhaps a softmax output node would be more appropriate? (this would require some easy extensions to the code)

Total training error (y-axis) after each training epoch (x-axis). You can see that the objective function is likely non-convex and the momentum term was helpful in avoiding local minima and plateaus. This is from training a 1 hidden layer network to recognize handwritten digits from the MNIST dataset. You can also get the digits through scikit-learn: from sklearn.datasets import load_digits, digits = load_digits()

An existing neural networks library in Python is Pybrain. It has a nice amount of flexibility allowing you to define the activation functions in each layer (linear, sigmoid, tanh, softmax, 'long short-term memory', etc.). You can specify the amount of connectivity between layers, momentum, weight decay, and more.

I trained a 2-hidden-layer network in Pybrain to find a nonlinear decision boundary between two classes (red, blue) of training points after (Left) 3 epochs and (Right) 24 epochs. The stochastic nature of the training can be credited with finding a circular decision boundary after just 3 rounds through the training data.

Here I trained a 2-hidden-layer network in Pybrain with one softmax output to classify these 3 Gaussian blobs of training points. The predictions were thresholded in this figure, but the probabilities for each class are output by the network.

There is a Python package being developed by the Deep Learning community called Theano. By sending the matrix calculations to your GPU you can get a better than 100x speedup when you train the network -- a must for anyone doing work in this area. It's really nice that an easy GPU solution has been made available relatively quickly after GPUs were first used in research.

Tuesday, January 7, 2014

Latent Semantic Analysis

In this first post I'll develop a Latent Semantic Analysis code in Python, which can be used for text analysis, a synonym- and topic-aware Search Engine, Topic Modeling and Document Classification, and Similar Pages finding among other applications. We'll run 20k Wikipedia articles though the code to learn what topics Wikipedia is made of.

Before we get into Latent Semantic Analysis, suppose we're interested in a writing basic document search engine. Consider mapping each document into an N-dimensional bag-of-words vector, where the relative frequency of the i-th word in a dictionary is stored in the i-th component of the vector (the relative frequency is the # of occurrences of that word in the document, divided by the total # of words in the document). Treating a search query comprised of a handful of words as a vector \(\textbf{q}\), you can compute the cosine similarity between \(q\) and document \(\textbf{d}_j\), \(\cos \theta = \textbf{q} \cdot \textbf{d}_j\ / (|\textbf{q}||\textbf{d}_j|)\), and rank these scores to get search results. This vector space model also allows you to compute the similarity between documents, which is useful for a "Similar Pages" engine. Note that the model can be improved by using tf-idf vectors instead of relative frequency vectors. Features of this model are as follows:

A search query that only contains synonyms of the words in an otherwise highly relevant document won't yield that relevant document.
Without doing additional analysis, no document clusters or 'principal components' of the corpus have been identified.

The term-document matrix

Now consider the rectangular \(m\) x \(n\) term vs. document matrix \(M\), where \(M\) can be expressed either as a list of side-by-side document vectors \(\textbf{d}_j\) (column vectors) or equivalently as a stack of term vectors \(\textbf{t}_i^T\) (row vectors). See the Wikipedia page for Latent Semantic Analysis to see what this matrix looks like. The collection or body of documents is called the corpus.

In other words:

\(\textbf{d}_j\) is \(j\)-th document vector, which is an \(m\)-component column vector of relative frequencies of words. The relative frequency of the \(i\)-th word in this vector is the number of times that word occurs in a document divided by the total number of words in that document.
\(\textbf{t}_i^T\) is an \(n\)-component row vector for the \(i\)-th term. The \(j\)-th component of this vector is the relative frequency of this term in document \(j\).

Singular Value Decomposition

Latent Semantic Analysis can be described most simply as the analysis of the truncated Singular Value Decomposition (SVD) of the term-document matrix.

In general, the (non-truncated) SVD of an \(m\) x \(n\) rectangular matrix \(M\) is

\(M = USV^T\)

where \(U\) is an \(m\) x \(r\) unitary matrix (\(U^{-1} = U^T\)) whose columns are the left singular vectors, \(S\) is a square \(r\) x \(r\) diagonal matrix whose ordered entries are the singular values, and V is an \(n\) x \(r\) unitary matrix whose columns are the right singular vectors. This decomposition is possible by the proof here.

Consider the \(m\) x \(m\) symmetric covariance matrix \(C=MM^T\): if we think of the relative frequencies \(\{M_{ij}\}\) as random variables, we see that the entry \(C_{pq}\) is the covariance between term p and term q, and the expectation value integral has been taken over the space of documents. It is the term-term similarity matrix; and \(M^TM\) is the document-document similarity matrix.

With the unitarity condition you can show that the columns of \(U\) are the eigenvectors of \(MM^T\), and that the entries of \(S\) are the square roots of the eigenvalues of \(MM^T\). Likewise, the columns of \(V\) are the eigenvectors of \(M^TM\). Note that \(MM^T\) and \(M^TM\) have exactly the same eigenvalues.

It turns out that these eigenvectors are none other than the principal components of the respective covariance matrices (click the link to see why). As such, we can think of the \(m\) x \(1\) left singular vectors as topic vectors, which are orthogonal collections of words that frequently co-occur within the document space. Another way to think about the components of individual topic vectors is that they contain synonymous words in a general sense; "war" will be correlated with "general," "brigade," and "artillery."

Latent Semantic Analysis: Truncated SVD

Here is where the truncation comes in: We keep only the \(k\) largest singular values, where \(k\) is somewhere in the range 50-500 depending on application (see this). The result is that

The matrix shapes are now \(U_{mk}\), \(S_{kk}\), and \(V_{nk}\)
We approximate the matrix \(M\) with \(M_k = U_k S_k V^T_k\), which is a matrix of smaller rank. We replaced all the r dimensions with the smaller rank k.
The rows of V are documents within the lower dimensional semantic space.
The rows of U are the words/terms within the lower dimensional semantic space.
The low frequency, possibly spurious correlations between words are rejected. This helps with reducing noise.

A document vector such as a query \(q\) is placed into the semantic space by applying the transformation

\(q_k = S^{-1}U^T q\). Simply projecting one query vector is slow, O(nterm*k). [edit: This could be sped up to O(querylength*k) where querylength is the number of words in the query if a sparse matrix (vector) is used for \(q\).]

At this point you could write an LSA-based document classifier / topic identifier by computing the cosine similarity between a document and a labeled training corpus; putting the query document into the semantic space is O(nterm*k) [edit: or O(doclength*k) for a sparse document] and comparing it against the corpus is O(ndoc*k). Clearly, an LSA search engine would be far slower than an engine that utilizes tries.

Depending on how many dictionary words you use, the term-document matrix might have only 2% non-zero entries, or significantly less. Note that the current Python scipy.sparse.linalg.svds sparse SVD algorithm has a self-described naive implementation is and is actually not sparse at all. I recommend looking into sparsesvd or gensim for a true sparse svd algorithm.

Some snippets for a Python/numpy/scipy LSA implementation

1. Turn a set of documents (corpus) into a list of documents: each document is itself a list of words. Example: corpus_tokenized = [[word.lower() for word in document.split()] for document in corpus]

2. Create a dictionary based on the corpus. It's crucial to utilize a hash table for O(1) look-up time, for getting the index number of each word. In Python this means simply using a dictionary with words as keys and indices as values.

3. The hard part: build the sparse term vs. document matrix using SciPy's COO, CSC, and CSR sparse matrix data structures:

def make_term_doc_matrix_coo(self, corpus, mode='tf'):
    nterm = len(self.rdict)
    ndoc = len(corpus)
    # Record total word count in each doc for later normalization
    wordcount_doc = [0]*ndoc

    row = []
    col = []
    count = []
    from collections import defaultdict
    for idoc,doc in enumerate(corpus):
        tf_dictionary = defaultdict(int)
        # Compute frequency of each word in dictionary
        for word in doc:
            if word in self.rdict:
                tf_dictionary[word] += 1
        for word,wordcount in tf_dictionary.items():
            row.append(self.rdict[word])
            col.append(idoc)
            count.append(wordcount)

    # To get word count, use CSR matrix for fast row-slice sums
    # To get document frequency per term, use CSC matrix for fast column-slice sums
    td_coo = coo_matrix((count,(row,col)), shape=(nterm, ndoc))
    self.term_doc_csc = td_coo.tocsc()
    td_csr            = td_coo.tocsr()

    wordcount_per_doc = (td_csr.sum(axis=0) + 1).astype(float).ravel().tolist()[0]
        inverse_wordcount_per_doc = np.reciprocal( np.array(wordcount_per_doc, dtype=float) )

    if mode == 'tf': # term frequency
        self.term_doc_csc = self.term_doc_csc.astype(float)
        return

    if mode == 'rtf': # relative term frequency
        td_csc_float = self.term_doc_csc.astype(float)
        for idoc in range(ndoc):
            td_csc_float.data[  td_csc_float.indptr[idoc]
                              : td_csc_float.indptr[idoc+1]] *= inverse_wordcount_per_doc[idoc]

        self.term_doc_csc = td_csc_float

    if mode == 'tf-idf':
        '''This is a weird trick for simulating a boolean logic operation on a 
        sparse scipy matrix. Convert integer matrix to bool, then int, then sum.'''
        # document frequency computed for each term (df_t):
        df_t = self.term_doc_csc.astype(bool).astype(int).sum(axis=1).ravel().tolist()[0]
        df_t = np.array(df_t, dtype=float)
        df_t += 1.
        idf_t = np.log10( ndoc * np.reciprocal( df_t ) )

        td_csc_float = self.term_doc_csc.astype(float).log1p()
        # convert to CSR to select each row/term and multiply by idf_t
        td_csr_float = td_csc_float.tocsr()
        for iterm in range(nterm):
            td_csr_float.data[  td_csr_float.indptr[iterm]
                              : td_csr_float.indptr[iterm+1]] *= idf_t[iterm]

        self.term_doc_csc = td_csr_float.tocsc()
        return

4. Compute the SVD using a sparse algorithm that utilizes SciPy's CSC format. This code snippet also calculates \(U^T\) and \(S^{-1}\) which are useful for projecting a vector into the semantic space.

def svd_sparse(self, k):
    from sparsesvd import sparsesvd
    # Note: the sparsesvd algorithm returns largest eigenvalue first in list
    self.uT, self.s, self.vT = sparsesvd(self.term_doc_csc, k)
    self.u = self.uT.T
    self.v = self.vT.T
    self.s_inv = np.copy(self.s)
    for i in range(len(self.s)):
        self.s_inv[i] = 1./self.s[i]

Bonus: A Python generator function that retrieves random pages from Wikipedia using a Wikipedia API.

def docgen_wiki(pages=1, stopwords):
    # Generator that returns the title and full plain text rendition of random wikipedia pages
    # Uses wikipedia python API
    import wikipedia as wk
    while pages > 0:
        result = None
        while result is None:
            try:
                title = wk.random(pages=1)
                result = wk.WikipediaPage(title)
            except:
                # Most likely exception is caused by arriving at a Wikipedia 
                # Disambiguation page. Try again if this happens.
                pass
        tokens = []
        for word in result.content.split():
            word_strip = word.lower().rstrip('.')
            if word_strip not in stopwords:
                if word_strip.isalpha():
                    tokens.append(word_strip)
        yield title, result.content, tokens
        pages -= 1

Some Results

I gathered 19,500 random plain text Wikipedia articles, which were used as a basis for a dictionary of the 100,000 most common words. Then I built a sparse term-document matrix of tf-idf values and ran this through a sparse SVD algorithm with k=300. This sparse matrix has a density of 0.12%. The resulting \(U\), \(S\), and \(V\) matrices give very reasonable looking results when used in a search engine. Likewise when I use one of the Wikipedia articles as the query vector, it gives very nice results as a Similar Pages engine.

In my own experiments on Wikipedia I've found that the definition of tf-idf you choose has a big impact on the topic vectors. If you use 'relative frequency' as the tf part of tf-idf, then words from numerous bot-generated articles will make an appearance in many of the topic vectors. Also, articles that list dozens of species with he same genus name will be all over the topic vectors. Here are the top 15 topic vectors (largest 8 components shown for each) when I instead use \(tf(t,d) = \log (f(t,d) + 1)\):

An analysis using all of Wikipedia using gensim found many Wiki meta-words and bot topics, and found natural looking topics when Latent Dirichlet Allocation was applied.