Latent Dirichlet Allocation Flashcards
LDA in 1 Sentence
A generative/probabilistic topic modeling algo that assumes that each document represents a distribution of topics, and each topic represents a distribution of words.
LDA Steps
1) Assign each document a topic randomly
2) Based on word distributions across topics (currently random/uniform), calculate p(topic|document) and p(word|topic)
3) Reassign W to new topic with p(topic|document) * p(word|topic) – essentially the probability that the topic generated word W
4) Repeat N number of iterations
In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
Alpha (LDA)
The prior on the per document topic distribution
High alpha indicates each document contains mixture of most of the topics.
Beta (LDA)
The prior on the per topic word distribution
High beta indicates each topic contains mixture of most of the words.
Coherence Score
Method for assessing quality of the learned topics. It is the sum of joint probabilities for pairs (or triplets or quadruplets) per word for a given topic.
It can help with figuring out K by looking at topic coherence for a held out data set and see where coherence bottoms out.