B05 Topic Modeling Flashcards

Question 1

Q

What is Topic Modeling?

Answer

A

Topic modeling is an analytic
method for identifying the
topics that best describe the
information in a collection of
documents.

Question 2

Q

What is a Topic?

Answer

A

A topic is a label for a collection of words that occur
together.
Applied to documents, it gives us a quick idea of what a document is about.
For example, we can infer that a document that has a high frequency of words like “rain”, “storm”, “snow”, “ice”, “winds” is a document about “weather”.

Question 3

Q

Topic modeling provides us with methods that allow us to:

Answer

A

Discover the hidden topical patterns that are present across a collection of documents.
Annotate the documents according to these topics.
Use these annotations to organize, search and summarize the collection.

Question 4

Q

What are mixture models?

Answer

A

Mixture models are probabilistic models for representing
the presence of sub-populations within an overall
population, without requiring that an observed dataset
should identify the sub-population to which an individual
observation belongs.
Topic models are mixture models.

Question 5

Q

Several approaches to Topic Modeling include:

Answer

A

TextRank
Latent Semantic Indexing (LSI)
Probabilistic Latent Semantic Analysis (pLSA)
Latent Dirichlet Allocation (LDA)

Question 6

Q

Define Latent Dirichlet Allocation

Answer

A

Latent Dirichlet Allocation (LDA) is a “generative
probabilistic model” that allows sets of observations to be explained by unobserved (latent) groups that explain why some parts of the data are similar.
Applied to topic modeling, LDA assumes that the topics
within a document and the words within a topic follow a
dirichlet distribution.

Question 7

Q

What are Dirichlet Distributions?

Answer

A

A family of continuous multivariate probability
distributions parameterized by a vector ( ).
The values of the distribution are sampled over a
probability simplex (numbers that add up to 1). For
example - (0.6, 0.4), (0.1, 0.1, 0.8) or (0.05, 0.2, 0.15, 0.1, 0.3, 0.2).
The form of the distribution changes based on the value of the input parameter ( ).

Question 8

Q

What assumptions does Latent Dirichlet Allocation make?

Answer

A

Documents with similar topics will use similar groups of
words.
Every document is a mixture of topics.
Every topic is a mixture of words.

Question 9

Q

What does LDA do?

Answer

A

Therefore, given a corpus of documents, in order to identify the k-topics in each document, and the word distribution for each topic, LDA backtracks from the document level to identify the words and topics that are likely to have generated the corpus.

Question 10

Q

What does the LDA() function do?

Answer

A

Takes “k” and a document term matrix (DTM) as the input.
Returns two matrices - prevalence of topics in documents (gamma) and probability of words belonging to topics (beta).
Methods supported are either Gibbs Sampling or VEM.

Question 11

Q

How do you find the best number of topics or best value of K?

Answer

A

Several approaches:

Topic coherence.
Quantitive measures of fit.
- Log-likelihood
- Perplexity

Question 12

Q

What is topic coherence?

Answer

A

Evaluate the words in the topic to see if they make sense.
High coherence is evident when all the words that make up the topic are relevant to the topic.
Low coherence is evident when most of the words within a topic do not seem relevant. For example if the expected topic is “weather” then the words “rain”, “rice”, “sand”, “popsicle” have low coherence.

Question 13

Q

What is Log-likelihood?

Answer

A

This measures how plausible the model parameters are given the input data.
The values are negative.
The larger the number, the better the log-likelihood.

Question 14

Q

What is Perplexity?

Answer

A

This measures the model’s “surprise” when presented with new data.
All values are positive.
The smaller the number, the better the perplexity.

Question 15

Q

Method for finding the best K

Answer

A

Using the quantitative measures of fit, to find the best value for “k”, we:

Fit a model for several values of “k”.
Plot the values for both log-likelihood and perplexity.
Pick the “k” value that corresponds with the “elbow”.

Question 16

Q

Strengths of LDA?

Answer

Study These Flashcards

A

Effective tool for topic modeling.
Easy to understand conceptually.
Has been shown to produce good
results over many domains.
There are new applications every
day.

Question 17

Q

Weaknesses of LDA?

Answer

Study These Flashcards

A

- Must know the number of
topics in advance.
- Dirichlet topic distribution
cannot capture correlations
among topics.

Question 18

Q

Some applications of Topic Modeling:

Answer

Study These Flashcards

A

Used as classifiers for named entity recognition.
Topic models can be used as a form of soft clustering
(assigns probabilities instead of binary assignments).
Using the counts of events as input, topic models can be used to segment customers based on event attendance.
Topic models can also be used to extract shipping route patterns by looking at coordinates from ship logs.

B05 Topic Modeling Flashcards

(18 cards)