B05 Topic Modeling Flashcards

1
Q

What is Topic Modeling?

A
Topic modeling is an analytic
method for identifying the
topics that best describe the
information in a collection of
documents.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a Topic?

A
  • A topic is a label for a collection of words that occur
    together.
  • Applied to documents, it gives us a quick idea of what a document is about.
  • For example, we can infer that a document that has a high frequency of words like “rain”, “storm”, “snow”, “ice”, “winds” is a document about “weather”.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Topic modeling provides us with methods that allow us to:

A
  • Discover the hidden topical patterns that are present across a collection of documents.
  • Annotate the documents according to these topics.
  • Use these annotations to organize, search and summarize the collection.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are mixture models?

A

Mixture models are probabilistic models for representing
the presence of sub-populations within an overall
population, without requiring that an observed dataset
should identify the sub-population to which an individual
observation belongs.
Topic models are mixture models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Several approaches to Topic Modeling include:

A
  • TextRank
  • Latent Semantic Indexing (LSI)
  • Probabilistic Latent Semantic Analysis (pLSA)
  • Latent Dirichlet Allocation (LDA)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define Latent Dirichlet Allocation

A
  • Latent Dirichlet Allocation (LDA) is a “generative
    probabilistic model” that allows sets of observations to be explained by unobserved (latent) groups that explain why some parts of the data are similar.
  • Applied to topic modeling, LDA assumes that the topics
    within a document and the words within a topic follow a
    dirichlet distribution.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are Dirichlet Distributions?

A
  • A family of continuous multivariate probability
    distributions parameterized by a vector ( ).
  • The values of the distribution are sampled over a
    probability simplex (numbers that add up to 1). For
    example - (0.6, 0.4), (0.1, 0.1, 0.8) or (0.05, 0.2, 0.15, 0.1, 0.3, 0.2).
  • The form of the distribution changes based on the value of the input parameter ( ).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What assumptions does Latent Dirichlet Allocation make?

A
  • Documents with similar topics will use similar groups of
    words.
  • Every document is a mixture of topics.
  • Every topic is a mixture of words.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does LDA do?

A

Therefore, given a corpus of documents, in order to identify the k-topics in each document, and the word distribution for each topic, LDA backtracks from the document level to identify the words and topics that are likely to have generated the corpus.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the LDA() function do?

A
  • Takes “k” and a document term matrix (DTM) as the input.
  • Returns two matrices - prevalence of topics in documents (gamma) and probability of words belonging to topics (beta).
  • Methods supported are either Gibbs Sampling or VEM.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you find the best number of topics or best value of K?

A

Several approaches:

Topic coherence.
Quantitive measures of fit.
- Log-likelihood
- Perplexity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is topic coherence?

A
  • Evaluate the words in the topic to see if they make sense.
  • High coherence is evident when all the words that make up the topic are relevant to the topic.
  • Low coherence is evident when most of the words within a topic do not seem relevant. For example if the expected topic is “weather” then the words “rain”, “rice”, “sand”, “popsicle” have low coherence.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Log-likelihood?

A
  • This measures how plausible the model parameters are given the input data.
  • The values are negative.
  • The larger the number, the better the log-likelihood.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Perplexity?

A
  • This measures the model’s “surprise” when presented with new data.
  • All values are positive.
  • The smaller the number, the better the perplexity.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Method for finding the best K

A

Using the quantitative measures of fit, to find the best value for “k”, we:

  • Fit a model for several values of “k”.
  • Plot the values for both log-likelihood and perplexity.
  • Pick the “k” value that corresponds with the “elbow”.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Strengths of LDA?

A
  • Effective tool for topic modeling.
  • Easy to understand conceptually.
  • Has been shown to produce good
    results over many domains.
  • There are new applications every
    day.
17
Q

Weaknesses of LDA?

A
- Must know the number of
topics in advance.
- Dirichlet topic distribution
cannot capture correlations
among topics.
18
Q

Some applications of Topic Modeling:

A
  • Used as classifiers for named entity recognition.
  • Topic models can be used as a form of soft clustering
    (assigns probabilities instead of binary assignments).
  • Using the counts of events as input, topic models can be used to segment customers based on event attendance.
  • Topic models can also be used to extract shipping route patterns by looking at coordinates from ship logs.