Topic Model 9 Flashcards
What are topic models?
- Topic models can help you automatically discover patterns in a corpus
- Automatically group topically-related words in “topics”
- Associate tokens and documents with those topics
- Unsupervised learning (does not require training data!)
What is a Topic?
a grouping of words that are likely to appear in the same context
• A hidden structure that helps determine what words are likely to appear in a corpus
• e.g. if “war” and “military” appear in a document, you probably won’t be surprised to find that “troops” appears later on
• long-range context (rather than local dependencies like n-grams, syntax)
What is the goal of topic modelling?
The goal of topic modelling is to uncover these latent variables — topics — that shape the meaning of our document and corpus.
What are the two basis assumptions of topic model?
- each document consists of a mixture of topics, and
* each topic consists of a collection of words.
Pros and cons of Latent Semantic Analysis?
Pros:
• Easy to understand and implement
• Quick and efficient
Cons:
• Lack of interpretable embeddings (we don’t know what the topics are, and the components may be arbitrarily positive/negative)
• Need many documents to get accurate results
Name two topic models.
- Latent Semantic Analysis (LSA)
* Latent Dirichlet Allocation (LDA)
How does LSA work?
- The simplest topic modeling method
* Input: doc-term matrix M (BOW or TF-IDF)
How does LDA work?
How do we write a document • Choose the topics we want to discuss in the document (i.e. decide a distribution of topics) • Write sentences about the selected topics (i.e. select word from each topic
Go over chapter 9, slide 15 to 18
Pros and cons of Latent Dirichlet Allocation?
Pros: • Works better than LSA and probabilistic LSA (pLSA) • Generalizes to new document easily Cons: • Expensive computation: ExpectationMaximization (EM) algorithm or Gibbs Sampling based posterior estimation • Performance is sensitive to hyper-parameters: # of topics and iterations
How does LSA work with word vectors?
m: # of docs
n: size of vocabulary
t: wanted num of topics (specified by
the designer)
• M: mn, doc-term matrix
• U: mt, doc-topic matrix
• V: t*n, topic-term matrix
• U may replace M in IR systems to
represent documents
• V can be viewed as word vectors
How Truncated SVD work and what is it used for?
Truncated SVD (singular value decomposition) is a standard method for reducing matrix dimensionality
• Factorizes any matrix
M into the product of 3 separate matrices: M=USV, where S is a diagonal matrix of the singular values of M.
• Selecting only the t largest singular values, and only keeping the first t columns of U and rows of V.
•t is a hyperparameter: it reflects the
number of topics we want to find.
What is Probabilistic Topic Modeling?
- Given a set of documents, we suppose
- There is fixed distribution of topics (phii)
- Each document draws words from a distribution of topics (theta)