LSA-sourced association lists
Latent Semantic Analysis is a classical tool for automatically extracting similarities between documents, through dimensionality reduction. A term- document matrix is filled with weights corresponding to the importance of the term in the specific document (term-frequency/inverted document frequency in our case) and then is reduced via Singular Value Decomposition to a lower dimensional space called the concept space.
Formally, the term-document matrix X of dimension n x m (n terms and m documents) can be decomposed into U and V orthogonal matrices and X a diagonal matrix through singular value decomposition:
This in turn can be represented through a rank к approximation of X in a smaller dimensionally space (X becomes a к x к matrix). We used an arbitrary rank value of 150 in our experiment:
This representation is often used to compare documents in this new space, but as the problem is symmetrical it can be used to compare words. The Uk matrix of dimensions n x к represents the model of words in the new k-dimensional concept space. We can thus compare the relative similarity of each word by taking the cosine distance between their representations.
The LSA-sourced lists of associations are composed of the ordered list (by cosine distance) from the given word in a model build on each of the tree corpora as described above.
A crucial element in the application of Latent Semantic Analysis [LAN 08] is determining к, the number of concepts that are used to project the data to the reduced k-dimensional concept space. As this parameter is a characteristic of the corpus, and to some degree of the specific application, in this case it has been determined experimentally. For each corpus (PRUS, NCP and PAP), an LSA model has been built for a range of dimensions between 25 and 400 with an increment of 25. For each corpus, the dimension has been chosen as the one that gave the highest sum of matching words from 10 association lists in a window of 1,000 words. The final results, as presented in section 3.4, correspond to a dimension of 75 for PRUS and NCP and 300 for PAP. The calculations were made using the gensim topic modeling library.