Semantic Analysis: Topic Modeling Flashcards
Topic Modeling Paradigms (machine learning in the concept of NLP)
Canonical: Match a preestablished list of topics for the domain (library of congress, vatican, Thomson Bible Chain Reference - one of the earliest topic modeling before computers, an anointed topic expert).
Organic: Discover the “natural” topics of a corpus. Let topics bubble up from out of the “lake” of unstructured documents. Most popular because it’s a purely statistical approach.
Entity-centric: Topics are strongly related to sets of NEs that may change over time (often people). Established list of named entities (e.g., find all topics related to this list of people’s names).
Canonical Topic Modeling
Match a preestablished list of topics for the domain (library of congress, vatican, Thomson Bible Chain Reference - one of the earliest topic modeling before computers, an anointed topic expert).
Organic Topic Modeling
Discover the “natural” topics of a corpus. Let topics bubble up from out of the “lake” of unstructured documents. Most popular because it’s a purely statistical approach.
LSA: Latent semantic analysis
LDA: Latent dirichlet allocation (not latent discriminant analysis!)
NMF: non-negative matrix factorization
Latent semantic analysis (LSA)
Discovering cluster of words that make a topic across a collection of documents. Sparse vectors, reduce to a smaller # of dimensions from a broad vocabulary to a smaller number of words for each group. Similar to clustering.
Starts with a large term-document matrix
Then creates a topic to topic matrix (how many times did two topics co-occur in the same document)
Chooses ones that do a good job of separating documents. Desirable is a lot of separation.
Example: Fish would likely show up as a topic when looking at menus, but onions would not.
To LSA, a topic is a mix of words that commonly occur together
Latent Dirichlet Allocation (LDA)
-Groups words with high cooccurrence in a corpus of documents
-Output with overlap in keywords
-Uses a probability distributions over words rather than a topic-topic matrix
-formula: topic distribution over the keywords * document distribution over topics
-Random seed, we decide how many topics we want to create
-
Latent Dirichlet Allocation (LDA)
LDA is the KMeans of Topic Modeling - LSA pushed out
- Groups words with high cooccurrence in a corpus of documents
- Output with overlap in keywords
- Uses a probability distributions over words rather than a topic-topic matrix
- formula: topic distribution over the keywords * document distribution over topics
- Random seed, we decide how many topics we want to create
- Assigns a random topic to every word in a document
- Figure out the proportion of words in document d assigned to topic z for every word in every document. Then figure out the proportion of assignments to topic z among all docs having word k. Then reassign word k in document d to whatever topic has the highest score from the first two steps. Repeat steps again and again. Everything has to be recomputed over and over.
- Magic! With each iteration words start to get assigned to the same topic until the are more correct.
- Two concentration parameters that control how many things are going to be in a topic and the overlap
Most important aspect of LDA (Alpha, Beta)
Alpha - high value means each document likely to contain a mixture of topics. Low value then a document only contains a few topics.
Beta - high value means each topic is likely to contain a mixture of many words. A low value means that a topic is more likely to contain just a few words.
-Can be controlled independently
-Arguably the most favored method because people like being able to tweak alpha/beta parameters
-High alpha will lead to documents being more similar in terms of what topics they contain.
-A high beta value will similarly lead to topics being more similar in terms of what words they contain.
Define # of k
Non-Negative Matrix Factorization (NMF)
A version of LDA in which parameters have been tweaked to enforce a sparse number o topics. Tends to produce a smaller number of topics really well. Not good for a large number of topics.
- Cheap in computation so much faster than LDA
- Doesn’t require as much tuning as LDA especially on noisy text (not topically pure)
- Works well on small corpora
- Some people say that it assigns document similar to how humans do
- Works well for a small number of documents
- Unstable algorithm (different results on same document). To address this, us NNDSVD - Nonnegative Double Singular Value Decomposition.
Working with Organic Topic Models (LSA, LDA, NMF)
Outputted topics have words like the, a, an, in
Forgot to remove stopwords
Applications of Topic Modelers
- Movie recommendations
- News article recommendations
- Book recommender
- Dating-website match recommender
Canonical Topic Modeling
Authorized source for topics. We can only select the topics from that list.
Standards org
Boss
Topic expert
How is it different than classification? We want to enable topic-driven exploration of the corpus by end users.
Constrain organic topic model to the canonical list of topics (cut of words that don’t occur in the topic list)
Use an Information Retrieval approach
Extension vs. Intension
- concepts are extensionally related when they extend to some of the same referents in the world
- concepts are intensionally related when their meanings (definitions) overlap
- if extensionally and intensionally related together, then would be group words together for the topic model
Entity-Centric Topic Modeling
Centered around a list of people, e.g., sports, teams, countries, shows, leagues, years, seasons, franchises, etc.
1 way
-Canonical list of named entities (names)
-Context harvesting - try to find topics that come up in the context of those names
-Contextual organic topics
another approach (topic first)
- List of canonical topics
- get the names of the people, places, teams, companies, movies, etc that we are supposed to care about.
Singular Value Decomposition (SVD)
-Any set of vectors (A) can be expressed in terms of their lengths of projections (S) on some set of orthogonal axes (V)
Lift score
Likelihood of word being search in a super action such as visiting a certain website to rank the words by importance and then threshold it to obtain a set of words highly associated with the context.