Menu
Home
Log in / Register
 
Home arrow Language & Literature arrow COGNITIVE APPROACH TO NATURAL LANGUAGE PROCESSING
Source

LDA-sourced lists

Latent Dirichlet Allocation is a mechanism used for topic extraction [BLE 03]. It treats documents as probabilistic distribution sets of words or topics. These topics are not strongly defined - as they are identified on the basis of the likelihood of co-occurrences of words contained in them.

In order to obtain ranked lists of words associated with a given word wn, we take the set of topics generated by LDA, and then for each word contained, we take the sum of the weight of each topic multiplied by the weight of given word wn in this topic.

Formally, for N topics and wJi denoting the weight of the word i in the topic j, the ranking weight for the word i is computed as follows:

This representation allows us to create a ranked list of words associated with a given word wn based on their probability of co-occurrence in the documents.

Association ratio-based lists

In order to evaluate the quality of the relatively advanced mechanism of Latent Semantic Analysis, we will compare its efficiency to the association ratio as presented in [CHU 90], with some minor changes related to the nature of the processed data. For two words x and y, their association ratio fw(x,y) will be defined as the number of times y follows or precedes x in a window of w words. The original association ratio was asymmetric, considering only words y following the parameter x. This approach will, however, fail in the case of texts that are written in languages with no strict word ordering in sentences (Polish in our case) where syntactic information is represented through rich inflection rather than through word ordering. We will use the same value for w as is in Church and Hanks [CHU 90] that suggested a value of 5. This measure can be seen as simplistic in comparison with LSA, but, as the results will show, is useful nonetheless.

 
Source
Found a mistake? Please highlight the word and press Shift + Enter  
< Prev   CONTENTS   Next >
 
Subjects
Accounting
Business & Finance
Communication
Computer Science
Economics
Education
Engineering
Environment
Geography
Health
History
Language & Literature
Law
Management
Marketing
Mathematics
Political science
Philosophy
Psychology
Religion
Sociology
Travel