C8 Flashcards
topic modelling
assumptions: each document consists of a mixture of topics and each topic consists of a mixture of words
unsupervised technique:
- topic labels are not given
- number of topics needs to be pre-specified
- think about it as clustering
most used technique: LDA
generative probablistic model LDA
Latent Dirichlet Allocation
Topic = probability distribution over fixed vocabulary. Every topic contains a probability for every word in the vocabulary
Each document is a distribution over topics (Dirichlet distribution, prior set on this distribution is sparse)
Generate a document as a bag of words:
- draw a topic from the distribitution (e.g. yellow), lookup the yellow distribution, and draw a word from the yellow distribution etc.
- order of words doesn’t matter; the words are drawn independently of each other
We only observe the words in the documents. The topics are latent
goal of LDA
infer the underlying topic structure by only observing the documents
LDA: learn distributions
- What are the topics, what are the distributions over terms?
- For each document, what is the distribution over topics?
LDA: learn the topics from the data
Goal: to learn β (the topic models) and theta (topic proportion per document)
- we only observe the words W
- start with random probability distributions of words in topics and of topics in documents
- update the probability distributions while observing the words in the documents (Bayesian framework)
- until β converges, or the maximum number of epochs has been reached
LDA: challenges
- choose the number of topics
- random initialization of the clustering => non-deterministic outcome
- interpreting the outputs: what do the topics mean?
evaluation of topic modelling
- topic coherence: measure similarity of words inside a topic and between topics
- human evaluation: word intrusion = given these 5 high-probability topic words + 1 random word, can you find the intruder?
single-document summarization examples
- news articles
- scientific articles
- meeting reports
multi-document summarization examples
- output of a search engine
- news about a single topic from multiple sources
- discussion threads summarization
extractive summarization
a summary composed completely of material from the source
abstractive summarization
a summary that contains material not originally in the source, but shorter paraphrases
describe the extractive summarization method and its pros and cons
Select the most important nuggets (sentences)
Classification or ranking task:
- classification: for each sentence, select it: yes/no
- ranking: assign a score to each sentence, then select the
top-k
+ feasible / easy to implement
+ reliable (literal re-use of text)
- but limited in terms of fluency
- (fixes required after sentence selection)
strong baseline: take first three sentences from the document
extractive summarization: sentence selection methods
Unsupervised methods:
- centrality-based
- (graph-based)
Supervised methods:
- feature-based
- (embeddings based)
unsupervised sentence selection (centrality-based)
- measure the cosine similarity between each sentence and the document (use either sparse or dense vectors)
- Select the sentences with the highest similarity (the most representative sentences)
supervised sentence selection
feature engineering + classifier (eg. SVM)
features: position in the document, word count, word lengths, word frequencies, punctuation, representativeness (similarity to full document/title)