Topic Model 9 Flashcards

Question 1

Q

What are topic models?

Answer

A

Topic models can help you automatically discover patterns in a corpus
Automatically group topically-related words in “topics”
Associate tokens and documents with those topics
Unsupervised learning (does not require training data!)

Question 2

Q

What is a Topic?

Answer

A

a grouping of words that are likely to appear in the same context
• A hidden structure that helps determine what words are likely to appear in a corpus
• e.g. if “war” and “military” appear in a document, you probably won’t be surprised to find that “troops” appears later on
• long-range context (rather than local dependencies like n-grams, syntax)

Question 3

Q

What is the goal of topic modelling?

Answer

A

The goal of topic modelling is to uncover these latent variables — topics — that shape the meaning of our document and corpus.

Question 4

Q

What are the two basis assumptions of topic model?

Answer

A

each document consists of a mixture of topics, and

* each topic consists of a collection of words.

Question 5

Q

Pros and cons of Latent Semantic Analysis?

Answer

A

Pros:
• Easy to understand and implement
• Quick and efficient
Cons:
• Lack of interpretable embeddings (we don’t know what the topics are, and the components may be arbitrarily positive/negative)
• Need many documents to get accurate results

Question 6

Q

Name two topic models.

Answer

A

Latent Semantic Analysis (LSA)

* Latent Dirichlet Allocation (LDA)

Question 7

Q

How does LSA work?

Answer

A

The simplest topic modeling method

* Input: doc-term matrix M (BOW or TF-IDF)

Question 8

Q

How does LDA work?

Answer

A

How do we write a 
document
• Choose the topics we 
want to discuss in the 
document (i.e. decide a 
distribution of topics)
• Write sentences about
the selected topics (i.e.
select word from each 
topic

Go over chapter 9, slide 15 to 18

Question 9

Q

Pros and cons of Latent Dirichlet Allocation?

Answer

A

Pros:
• Works better than LSA and 
probabilistic LSA (pLSA) 
• Generalizes to new document easily
Cons:
• Expensive computation: ExpectationMaximization (EM) algorithm or Gibbs 
Sampling based posterior estimation
• Performance is sensitive to hyper-parameters: # of topics and iterations

Question 10

Q

How does LSA work with word vectors?

Answer

A

m: # of docs
n: size of vocabulary
t: wanted num of topics (specified by
the designer)
• M: mn, doc-term matrix
• U: mt, doc-topic matrix
• V: t*n, topic-term matrix
• U may replace M in IR systems to
represent documents
• V can be viewed as word vectors

Question 11

Q

How Truncated SVD work and what is it used for?

Answer

A

Truncated SVD (singular value decomposition) is a standard method 
for reducing matrix dimensionality

• Factorizes any matrix
M into the product of 3 separate matrices: M=USV, where S is a diagonal matrix of the singular values of M.

• Selecting only the t largest singular values, and only keeping the first t columns of U and rows of V.

•t is a hyperparameter: it reflects the
number of topics we want to find.

Question 12

Q

What is Probabilistic Topic Modeling?

Answer

A

Given a set of documents, we suppose
There is fixed distribution of topics (phii)
Each document draws words from a distribution of topics (theta)

Topic Model 9 Flashcards

(12 cards)