Lecture 9 - Text Summarization and Topic Modeling Flashcards
What is the process of summarization?
to take an information source, extract content from it, and present the most important content to the user in a condensed form and in a manner sensitive to the user’s or application’s needs
Why should we automate the summarization process?
Human-generated summaries are very expensive. But the summary, even if automated, should contain the right information, as a summary is often the only thing people read
What can act as an input in a summarization algorithm?
What about the output?
INPUT:
news/ scientific articles, emails, videos etc.
OUTPUT:
keywords, highlight information in the input, chunks directly from the input (or paraphrase and aggregate in new ways) etc.
What are some summarization algorithms?
- keyword summarization (easy, poor representation of the content)
- sentence extraction (medium hard, good representation of the content, sentences often don’t fit together)
- natural language understanding/ generation
* build knowledge representation of text
* generate sentences summarizing the content
* hard to do well
What method of text summarization do most of the current system use?
Sentence extraction and then piece them together to form a text (e.g., results from Google Search)
Differentiate between data-driven and knowledge-based text summarization.
Data-driven => relying on features of the input documents that can be easily computed from statistical analysis
- word statistics (informativeness of words)
- cue phrases (linguistic expressions such as ‘now’ and ‘well’ that may explicitly mark the structure of a discourse)
- section headers
- sentence position
Knowledge-based => use more sophisticated natural language processing tools
- discourse information (text structure etc.)
- use external lexical resources
- use machine learning
In data-driven text summarization, frequency can be a good indicator of the importance of some word/ sentence to be included in the summary, as the important content is repeated multiple times across a. document. Explain the greedy frequency method.
- compute word probability from input
- compute sentence weight as a function of word probability
- pick the best sentence
Explain the sentence clustering “algorithm” in the context of text summarization.
- cluster sentences from the input into similar themes
- choose one sentence to represent a theme
- consider bigger themes as more important
drawback: each sentence is assigned to one theme only
How does text summarization with machine learning techniques work?
- ask people to select the important sentences in a text (label good and bad summary sentences)
- use these as training examples for supervised machine learning
- each sentence is represented as a number of features
- based on the features, distinguish sentences that are appropriate for the summary
- run on new inputs
- Naive Bayes classifier is a good example
Explain the algorithm of sentence extraction.
- represent each sentence as a vector of features
- compute score based on features
- select n-highest ranking sentences
- present in order in which they occur in text
- post-processing to make summary more readable or concise (eliminate redundant sentences etc.)
What are some good features that can be used to decide whether a sentence is important to be included in the summary?
- Fixed-phrase features = certain phrases indicate summary (e.g., “In summary”)
- Paragraph features = the first and the last paragraph are more likely to be important
- Thematic word feature = repetition is an indicator of importance
- Uppercase word features = uppercase often indicates named entities
- Sentence length cut-off = summary sentence should have more than 5 words
What texts are usually subject to multi-document summarization?
- single event/person tracked over a long period of time (like in the death of someone in news articles and so on)
- multiple events of similar nature (marathon races etc.)
- an issue with related events (gun control)
What is query-specific summarization?
Query-specific summaries are specialized for a single information need, a query
Summarization is much easier if we have a description of what the user wants
Why is multi-document summarization useful?
For presenting and organizing search results.
- many results are very similar, and grouping closely related documents helps cover more event facets
- summarizing similarities and differences between documents
Automatic essay grading
Topic identification
What is topic modeling?
Topic modeling (unsupervised machine learning method) provides methods for automatically organizing, understanding, searching, and summarizing large electronic archives.
What can you use topic modeling for?
Discover topics in a corpus
Model the evolution of topics over time
Image annotation
Model connections between topics
Explain what a generative model is in simple words.
A generative model basically assumes that a document is built based on some topics (that the topics build the document, even though we know it’s human who do it). So, each document is a mixture of corpus-wide topics, each topic is a distribution over words, and each word is drawn from one of those topics.
Explain what a generative model is in simple words.
A generative model basically assumes that a document is built based on some topics (that the topics build the document, even though we know it’s human who do it). So, each document is a mixture of corpus-wide topics, each topic is a distribution over words, and each word is drawn from one of those topics.
But, we only observe the document, so our goal is to infer the underlying topic structure.
What are the three latent (hidden) variables in LDA?
- Word distribution per topic (word-topic matrix)
- Topic distribution per document (topic-doc matrix)
- Topic word assignment
These are the three distributions that we want to infer
Explain in one sentence the training and test process of LDA.
TRAIN: learn latent (hidden) variables on training data (collection of documents)
TEST: Predict topic distribution of an unseen document
What are the two goals of LDA that it tries to balance? (even if there is a trade off between them)
- for each document, allocate its words to as few topics as possible
- for each topic, assign high probability to as few terms as possible
What is Gibbs sampling used for?
Gibbs sampling is often used to estimate the posterior probability over a high-dimensional random variable z
How does Gibbs sampling work?
It generates a sequence of samples from the joint probability distribution of two or more random variables.
Aim: to compute posterior distribution over latent variable z
Pre-requisites: we must know the conditional probability of z
True or False? LDA assumes that the order of words does not matter.
True
Dynamic topic models can be a solution to that