C8 Flashcards
topic modelling
assumptions: each document consists of a mixture of topics and each topic consists of a mixture of words
unsupervised technique:
- topic labels are not given
- number of topics needs to be pre-specified
- think about it as clustering
most used technique: LDA
generative probablistic model LDA
Latent Dirichlet Allocation
Topic = probability distribution over fixed vocabulary. Every topic contains a probability for every word in the vocabulary
Each document is a distribution over topics (Dirichlet distribution, prior set on this distribution is sparse)
Generate a document as a bag of words:
- draw a topic from the distribitution (e.g. yellow), lookup the yellow distribution, and draw a word from the yellow distribution etc.
- order of words doesn’t matter; the words are drawn independently of each other
We only observe the words in the documents. The topics are latent
goal of LDA
infer the underlying topic structure by only observing the documents
LDA: learn distributions
- What are the topics, what are the distributions over terms?
- For each document, what is the distribution over topics?
LDA: learn the topics from the data
Goal: to learn β (the topic models) and theta (topic proportion per document)
- we only observe the words W
- start with random probability distributions of words in topics and of topics in documents
- update the probability distributions while observing the words in the documents (Bayesian framework)
- until β converges, or the maximum number of epochs has been reached
LDA: challenges
- choose the number of topics
- random initialization of the clustering => non-deterministic outcome
- interpreting the outputs: what do the topics mean?
evaluation of topic modelling
- topic coherence: measure similarity of words inside a topic and between topics
- human evaluation: word intrusion = given these 5 high-probability topic words + 1 random word, can you find the intruder?
single-document summarization examples
- news articles
- scientific articles
- meeting reports
multi-document summarization examples
- output of a search engine
- news about a single topic from multiple sources
- discussion threads summarization
extractive summarization
a summary composed completely of material from the source
abstractive summarization
a summary that contains material not originally in the source, but shorter paraphrases
describe the extractive summarization method and its pros and cons
Select the most important nuggets (sentences)
Classification or ranking task:
- classification: for each sentence, select it: yes/no
- ranking: assign a score to each sentence, then select the
top-k
+ feasible / easy to implement
+ reliable (literal re-use of text)
- but limited in terms of fluency
- (fixes required after sentence selection)
strong baseline: take first three sentences from the document
extractive summarization: sentence selection methods
Unsupervised methods:
- centrality-based
- (graph-based)
Supervised methods:
- feature-based
- (embeddings based)
unsupervised sentence selection (centrality-based)
- measure the cosine similarity between each sentence and the document (use either sparse or dense vectors)
- Select the sentences with the highest similarity (the most representative sentences)
supervised sentence selection
feature engineering + classifier (eg. SVM)
features: position in the document, word count, word lengths, word frequencies, punctuation, representativeness (similarity to full document/title)
problems with sentence selection
- selecting sentences that contain unresolved references to sentences not included in the summary or not explicitly included in the original document
- improvements might be needed after sentence selection (sentence ordering, revision, fusion, compression)
abstractive summarization: method and pros/cons
Learn a text-to-text-transformation model (cf. translation)
- training data: pairs of longer and shorter texts, eg. for scientific documents: full article and abstract , editor-written summaries of comment threads (NY Times)
- sequence-to-sequence models: learning a mapping between an input sequence and an output sequence
+ more natural/fluent result
- but a lot of training data needed
- and risk of untrue content
Pegasus
encoder-decoder pre-training for abstractive summarization
pre-training objectives (self-supervised):
1. Masked Language Modelling (like BERT)
2. Gap Sentences Generation (GSG)
motivation:
- large-scale document-summary datasets (for supervised learning) are rare
- creating training data is expensive (‘low-resource summarization’)
challenges in summarization
- Factual consistency (for abstractive summarization)
- Task subjectivity/ambiguity
- Training data bias
- Evaluation
factual consistency
main challenge of abstractive summarization models
research showed that the majority of generated summaries contained non-faithful content => human judgement is still crucial for this kind of evaluation as automatic metrics do not strongly correlate with summary faithfulness
training data bias
Most used sets for training and evaluation summarization models are based on news data
In newspaper articles, the most important information is in the first paragraph but in other domains this does not always apply
evaluation of summarization
- compare to reference summaries
- ask human juges
compare to reference summaries
compute overlap with human reference summary
ROUGE metrics: measures quality of a summary by comparison with reference summaries (literal)
ROUGE
the proportion of n-grams from the reference summaries that
occur in the automatically created summary (‘recall-oriented’)
ROUGE-N = # n−grams in automatic AND reference summary / # n−grams in reference summary
(also count beginning and end-of markers)
rating criteria for summary jugdement by humans
rating criteria:
- relevance/importance: selection of important content from the source
- consistency: factual alignment between the summary and the source
- fluency: quality of individual sentences
- coherence: collective quality of all sentences
ask multiple judges per summary
challenges in evaluation (abstractive summarization)
- ROUGE often has weak correlation with human judgements
- but human judgments for relevance (importance) and fluency are strongly correlated to each other