Tuesday, January 7, 2014

Latent Semantic Analysis

In this first post I'll develop a Latent Semantic Analysis code in Python, which can be used for text analysis, a synonym- and topic-aware Search Engine, Topic Modeling and Document Classification, and Similar Pages finding among other applications. We'll run 20k Wikipedia articles though the code to learn what topics Wikipedia is made of.

Before we get into Latent Semantic Analysis, suppose we're interested in a writing basic document search engine. Consider mapping each document into an N-dimensional bag-of-words vector, where the relative frequency of the i-th word in a dictionary is stored in the i-th component of the vector (the relative frequency is the # of occurrences of that word in the document, divided by the total # of words in the document). Treating a search query comprised of a handful of words as a vector \(\textbf{q}\), you can compute the cosine similarity between \(q\) and document \(\textbf{d}_j\), \(\cos \theta = \textbf{q} \cdot \textbf{d}_j\ / (|\textbf{q}||\textbf{d}_j|)\), and rank these scores to get search results. This vector space model also allows you to compute the similarity between documents, which is useful for a "Similar Pages" engine. Note that the model can be improved by using tf-idf vectors instead of relative frequency vectors. Features of this model are as follows:

  • A search query that only contains synonyms of the words in an otherwise highly relevant document won't yield that relevant document.
  • Without doing additional analysis, no document clusters or 'principal components' of the corpus have been identified.


The term-document matrix


Now consider the rectangular \(m\) x \(n\) term vs. document matrix \(M\), where \(M\) can be expressed either as a list of side-by-side document vectors \(\textbf{d}_j\) (column vectors) or equivalently as a stack of term vectors \(\textbf{t}_i^T\) (row vectors). See the Wikipedia page for Latent Semantic Analysis to see what this matrix looks like. The collection or body of documents is called the corpus.

In other words:

  • \(\textbf{d}_j\) is \(j\)-th document vector, which is an \(m\)-component column vector of relative frequencies of words. The relative frequency of the \(i\)-th word in this vector is the number of times that word occurs in a document divided by the total number of words in that document.
  • \(\textbf{t}_i^T\) is an \(n\)-component row vector for the \(i\)-th term. The \(j\)-th component of this vector is the relative frequency of this term in document \(j\).


Singular Value Decomposition



Latent Semantic Analysis can be described most simply as the analysis of the truncated Singular Value Decomposition (SVD) of the term-document matrix.

In general, the (non-truncated) SVD of an \(m\) x \(n\) rectangular matrix \(M\) is

\(M = USV^T\)

where \(U\) is an \(m\) x \(r\) unitary matrix (\(U^{-1} = U^T\)) whose columns are the left singular vectors, \(S\) is a square \(r\) x \(r\) diagonal matrix whose ordered entries are the singular values, and V is an \(n\) x \(r\) unitary matrix whose columns are the right singular vectors. This decomposition is possible by the proof here.

Consider the \(m\) x \(m\) symmetric covariance matrix \(C=MM^T\): if we think of the relative frequencies \(\{M_{ij}\}\) as random variables, we see that the entry \(C_{pq}\) is the covariance between term p and term q, and the expectation value integral has been taken over the space of documents. It is the term-term similarity matrix; and \(M^TM\) is the document-document similarity matrix.

With the unitarity condition you can show that the columns of \(U\) are the eigenvectors of \(MM^T\), and that the entries of \(S\) are the square roots of the eigenvalues of \(MM^T\). Likewise, the columns of \(V\) are the eigenvectors of \(M^TM\). Note that \(MM^T\) and \(M^TM\) have exactly the same eigenvalues.

It turns out that these eigenvectors are none other than the principal components of the respective covariance matrices (click the link to see why). As such, we can think of the \(m\) x \(1\) left singular vectors as topic vectors, which are orthogonal collections of words that frequently co-occur within the document space. Another way to think about the components of individual topic vectors is that they contain synonymous words in a general sense; "war" will be correlated with "general," "brigade," and "artillery."


Latent Semantic Analysis: Truncated SVD



Here is where the truncation comes in: We keep only the \(k\) largest singular values, where \(k\) is somewhere in the range 50-500 depending on application (see this). The result is that

  • The matrix shapes are now \(U_{mk}\), \(S_{kk}\), and \(V_{nk}\)
  • We approximate the matrix \(M\) with \(M_k = U_k S_k V^T_k\), which is a matrix of smaller rank. We replaced all the r dimensions with the smaller rank k.
  • The rows of V are documents within the lower dimensional semantic space.
  • The rows of U are the words/terms within the lower dimensional semantic space.
  • The low frequency, possibly spurious correlations between words are rejected. This helps with reducing noise.
A document vector such as a query \(q\) is placed into the semantic space by applying the transformation
\(q_k = S^{-1}U^T q\). Simply projecting one query vector is slow, O(nterm*k). [edit: This could be sped up to O(querylength*k) where querylength is the number of words in the query if a sparse matrix (vector) is used for \(q\).]

At this point you could write an LSA-based document classifier / topic identifier by computing the cosine similarity between a document and a labeled training corpus; putting the query document into the semantic space is O(nterm*k) [edit: or O(doclength*k) for a sparse document] and comparing it against the corpus is O(ndoc*k). Clearly, an LSA search engine would be far slower than an engine that utilizes tries.

Depending on how many dictionary words you use, the term-document matrix might have only 2% non-zero entries, or significantly less. Note that the current Python scipy.sparse.linalg.svds sparse SVD algorithm has a self-described naive implementation is and is actually not sparse at all. I recommend looking into sparsesvd or gensim for a true sparse svd algorithm.


Some snippets for a Python/numpy/scipy LSA implementation




1. Turn a set of documents (corpus) into a list of documents: each document is itself a list of words. Example: corpus_tokenized = [[word.lower() for word in document.split()] for document in corpus]

2. Create a dictionary based on the corpus. It's crucial to utilize a hash table for O(1) look-up time, for getting the index number of each word. In Python this means simply using a dictionary with words as keys and indices as values.

3. The hard part: build the sparse term vs. document matrix using SciPy's COO, CSC, and CSR sparse matrix data structures:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def make_term_doc_matrix_coo(self, corpus, mode='tf'):
    nterm = len(self.rdict)
    ndoc = len(corpus)
    # Record total word count in each doc for later normalization
    wordcount_doc = [0]*ndoc

    row = []
    col = []
    count = []
    from collections import defaultdict
    for idoc,doc in enumerate(corpus):
        tf_dictionary = defaultdict(int)
        # Compute frequency of each word in dictionary
        for word in doc:
            if word in self.rdict:
                tf_dictionary[word] += 1
        for word,wordcount in tf_dictionary.items():
            row.append(self.rdict[word])
            col.append(idoc)
            count.append(wordcount)

    # To get word count, use CSR matrix for fast row-slice sums
    # To get document frequency per term, use CSC matrix for fast column-slice sums
    td_coo = coo_matrix((count,(row,col)), shape=(nterm, ndoc))
    self.term_doc_csc = td_coo.tocsc()
    td_csr            = td_coo.tocsr()

    wordcount_per_doc = (td_csr.sum(axis=0) + 1).astype(float).ravel().tolist()[0]
        inverse_wordcount_per_doc = np.reciprocal( np.array(wordcount_per_doc, dtype=float) )

    if mode == 'tf': # term frequency
        self.term_doc_csc = self.term_doc_csc.astype(float)
        return

    if mode == 'rtf': # relative term frequency
        td_csc_float = self.term_doc_csc.astype(float)
        for idoc in range(ndoc):
            td_csc_float.data[  td_csc_float.indptr[idoc]
                              : td_csc_float.indptr[idoc+1]] *= inverse_wordcount_per_doc[idoc]

        self.term_doc_csc = td_csc_float

    if mode == 'tf-idf':
        '''This is a weird trick for simulating a boolean logic operation on a 
        sparse scipy matrix. Convert integer matrix to bool, then int, then sum.'''
        # document frequency computed for each term (df_t):
        df_t = self.term_doc_csc.astype(bool).astype(int).sum(axis=1).ravel().tolist()[0]
        df_t = np.array(df_t, dtype=float)
        df_t += 1.
        idf_t = np.log10( ndoc * np.reciprocal( df_t ) )

        td_csc_float = self.term_doc_csc.astype(float).log1p()
        # convert to CSR to select each row/term and multiply by idf_t
        td_csr_float = td_csc_float.tocsr()
        for iterm in range(nterm):
            td_csr_float.data[  td_csr_float.indptr[iterm]
                              : td_csr_float.indptr[iterm+1]] *= idf_t[iterm]

        self.term_doc_csc = td_csr_float.tocsc()
        return

4. Compute the SVD using a sparse algorithm that utilizes SciPy's CSC format. This code snippet also calculates \(U^T\) and \(S^{-1}\) which are useful for projecting a vector into the semantic space.

1
2
3
4
5
6
7
8
9
def svd_sparse(self, k):
    from sparsesvd import sparsesvd
    # Note: the sparsesvd algorithm returns largest eigenvalue first in list
    self.uT, self.s, self.vT = sparsesvd(self.term_doc_csc, k)
    self.u = self.uT.T
    self.v = self.vT.T
    self.s_inv = np.copy(self.s)
    for i in range(len(self.s)):
        self.s_inv[i] = 1./self.s[i]

Bonus: A Python generator function that retrieves random pages from Wikipedia using a Wikipedia API.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
def docgen_wiki(pages=1, stopwords):
    # Generator that returns the title and full plain text rendition of random wikipedia pages
    # Uses wikipedia python API
    import wikipedia as wk
    while pages > 0:
        result = None
        while result is None:
            try:
                title = wk.random(pages=1)
                result = wk.WikipediaPage(title)
            except:
                # Most likely exception is caused by arriving at a Wikipedia 
                # Disambiguation page. Try again if this happens.
                pass
        tokens = []
        for word in result.content.split():
            word_strip = word.lower().rstrip('.')
            if word_strip not in stopwords:
                if word_strip.isalpha():
                    tokens.append(word_strip)
        yield title, result.content, tokens
        pages -= 1

Some Results


I gathered 19,500 random plain text Wikipedia articles, which were used as a basis for a dictionary of the 100,000 most common words. Then I built a sparse term-document matrix of tf-idf values and ran this through a sparse SVD algorithm with k=300. This sparse matrix has a density of 0.12%. The resulting \(U\), \(S\), and \(V\) matrices give very reasonable looking results when used in a search engine. Likewise when I use one of the Wikipedia articles as the query vector, it gives very nice results as a Similar Pages engine.

In my own experiments on Wikipedia I've found that the definition of tf-idf you choose has a big impact on the topic vectors. If you use 'relative frequency' as the tf part of tf-idf, then words from numerous bot-generated articles will make an appearance in many of the topic vectors. Also, articles that list dozens of species with he same genus name will be all over the topic vectors. Here are the top 15 topic vectors (largest 8 components shown for each) when I instead use \(tf(t,d) = \log (f(t,d) + 1)\):
















An analysis using all of Wikipedia using gensim found many Wiki meta-words and bot topics, and found natural looking topics when Latent Dirichlet Allocation was applied.