B05 Topic Modeling Flashcards
What is Topic Modeling?
Topic modeling is an analytic method for identifying the topics that best describe the information in a collection of documents.
What is a Topic?
- A topic is a label for a collection of words that occur
together. - Applied to documents, it gives us a quick idea of what a document is about.
- For example, we can infer that a document that has a high frequency of words like “rain”, “storm”, “snow”, “ice”, “winds” is a document about “weather”.
Topic modeling provides us with methods that allow us to:
- Discover the hidden topical patterns that are present across a collection of documents.
- Annotate the documents according to these topics.
- Use these annotations to organize, search and summarize the collection.
What are mixture models?
Mixture models are probabilistic models for representing
the presence of sub-populations within an overall
population, without requiring that an observed dataset
should identify the sub-population to which an individual
observation belongs.
Topic models are mixture models.
Several approaches to Topic Modeling include:
- TextRank
- Latent Semantic Indexing (LSI)
- Probabilistic Latent Semantic Analysis (pLSA)
- Latent Dirichlet Allocation (LDA)
Define Latent Dirichlet Allocation
- Latent Dirichlet Allocation (LDA) is a “generative
probabilistic model” that allows sets of observations to be explained by unobserved (latent) groups that explain why some parts of the data are similar. - Applied to topic modeling, LDA assumes that the topics
within a document and the words within a topic follow a
dirichlet distribution.
What are Dirichlet Distributions?
- A family of continuous multivariate probability
distributions parameterized by a vector ( ). - The values of the distribution are sampled over a
probability simplex (numbers that add up to 1). For
example - (0.6, 0.4), (0.1, 0.1, 0.8) or (0.05, 0.2, 0.15, 0.1, 0.3, 0.2). - The form of the distribution changes based on the value of the input parameter ( ).
What assumptions does Latent Dirichlet Allocation make?
- Documents with similar topics will use similar groups of
words. - Every document is a mixture of topics.
- Every topic is a mixture of words.
What does LDA do?
Therefore, given a corpus of documents, in order to identify the k-topics in each document, and the word distribution for each topic, LDA backtracks from the document level to identify the words and topics that are likely to have generated the corpus.
What does the LDA() function do?
- Takes “k” and a document term matrix (DTM) as the input.
- Returns two matrices - prevalence of topics in documents (gamma) and probability of words belonging to topics (beta).
- Methods supported are either Gibbs Sampling or VEM.
How do you find the best number of topics or best value of K?
Several approaches:
Topic coherence.
Quantitive measures of fit.
- Log-likelihood
- Perplexity
What is topic coherence?
- Evaluate the words in the topic to see if they make sense.
- High coherence is evident when all the words that make up the topic are relevant to the topic.
- Low coherence is evident when most of the words within a topic do not seem relevant. For example if the expected topic is “weather” then the words “rain”, “rice”, “sand”, “popsicle” have low coherence.
What is Log-likelihood?
- This measures how plausible the model parameters are given the input data.
- The values are negative.
- The larger the number, the better the log-likelihood.
What is Perplexity?
- This measures the model’s “surprise” when presented with new data.
- All values are positive.
- The smaller the number, the better the perplexity.
Method for finding the best K
Using the quantitative measures of fit, to find the best value for “k”, we:
- Fit a model for several values of “k”.
- Plot the values for both log-likelihood and perplexity.
- Pick the “k” value that corresponds with the “elbow”.