Topic Model 9 Flashcards

1
Q

What are topic models?

A
  • Topic models can help you automatically discover patterns in a corpus
  • Automatically group topically-related words in “topics”
  • Associate tokens and documents with those topics
  • Unsupervised learning (does not require training data!)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a Topic?

A

a grouping of words that are likely to appear in the same context
• A hidden structure that helps determine what words are likely to appear in a corpus
• e.g. if “war” and “military” appear in a document, you probably won’t be surprised to find that “troops” appears later on
• long-range context (rather than local dependencies like n-grams, syntax)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the goal of topic modelling?

A

The goal of topic modelling is to uncover these latent variables — topics — that shape the meaning of our document and corpus.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the two basis assumptions of topic model?

A
  • each document consists of a mixture of topics, and

* each topic consists of a collection of words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Pros and cons of Latent Semantic Analysis?

A

Pros:
• Easy to understand and implement
• Quick and efficient
Cons:
• Lack of interpretable embeddings (we don’t know what the topics are, and the components may be arbitrarily positive/negative)
• Need many documents to get accurate results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Name two topic models.

A
  • Latent Semantic Analysis (LSA)

* Latent Dirichlet Allocation (LDA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does LSA work?

A
  • The simplest topic modeling method

* Input: doc-term matrix M (BOW or TF-IDF)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does LDA work?

A
How do we write a 
document
• Choose the topics we 
want to discuss in the 
document (i.e. decide a 
distribution of topics)
• Write sentences about
the selected topics (i.e.
select word from each 
topic

Go over chapter 9, slide 15 to 18

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Pros and cons of Latent Dirichlet Allocation?

A
Pros:
• Works better than LSA and 
probabilistic LSA (pLSA) 
• Generalizes to new document easily
Cons:
• Expensive computation: ExpectationMaximization (EM) algorithm or Gibbs 
Sampling based posterior estimation
• Performance is sensitive to hyper-parameters: # of topics and iterations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does LSA work with word vectors?

A

m: # of docs
n: size of vocabulary
t: wanted num of topics (specified by
the designer)
• M: mn, doc-term matrix
• U: m
t, doc-topic matrix
• V: t*n, topic-term matrix
• U may replace M in IR systems to
represent documents
• V can be viewed as word vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How Truncated SVD work and what is it used for?

A
Truncated SVD (singular value decomposition) is a standard method 
for reducing matrix dimensionality

• Factorizes any matrix
M into the product of 3 separate matrices: M=USV, where S is a diagonal matrix of the singular values of M.

• Selecting only the t largest singular values, and only keeping the first t columns of U and rows of V.

•t is a hyperparameter: it reflects the
number of topics we want to find.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Probabilistic Topic Modeling?

A
  • Given a set of documents, we suppose
  • There is fixed distribution of topics (phii)
  • Each document draws words from a distribution of topics (theta)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly