Lecture 9 - Text Summarization and Topic Modeling Flashcards

Question 1

Q

What is the process of summarization?

Answer

A

to take an information source, extract content from it, and present the most important content to the user in a condensed form and in a manner sensitive to the user’s or application’s needs

Question 2

Q

Why should we automate the summarization process?

Answer

A

Human-generated summaries are very expensive. But the summary, even if automated, should contain the right information, as a summary is often the only thing people read

Question 3

Q

What can act as an input in a summarization algorithm?

What about the output?

Answer

A

INPUT:
news/ scientific articles, emails, videos etc.

OUTPUT:
keywords, highlight information in the input, chunks directly from the input (or paraphrase and aggregate in new ways) etc.

Question 4

Q

What are some summarization algorithms?

Answer

A

keyword summarization (easy, poor representation of the content)
sentence extraction (medium hard, good representation of the content, sentences often don’t fit together)
natural language understanding/ generation
* build knowledge representation of text
* generate sentences summarizing the content
* hard to do well

Question 5

Q

What method of text summarization do most of the current system use?

Answer

A

Sentence extraction and then piece them together to form a text (e.g., results from Google Search)

Question 6

Q

Differentiate between data-driven and knowledge-based text summarization.

Answer

A

Data-driven => relying on features of the input documents that can be easily computed from statistical analysis

word statistics (informativeness of words)
cue phrases (linguistic expressions such as ‘now’ and ‘well’ that may explicitly mark the structure of a discourse)
section headers
sentence position

Knowledge-based => use more sophisticated natural language processing tools

discourse information (text structure etc.)
use external lexical resources
use machine learning

Question 7

Q

In data-driven text summarization, frequency can be a good indicator of the importance of some word/ sentence to be included in the summary, as the important content is repeated multiple times across a. document. Explain the greedy frequency method.

Answer

A

compute word probability from input
compute sentence weight as a function of word probability
pick the best sentence

Question 8

Q

Explain the sentence clustering “algorithm” in the context of text summarization.

Answer

A

cluster sentences from the input into similar themes
choose one sentence to represent a theme
consider bigger themes as more important

drawback: each sentence is assigned to one theme only

Question 9

Q

How does text summarization with machine learning techniques work?

Answer

A

ask people to select the important sentences in a text (label good and bad summary sentences)
use these as training examples for supervised machine learning
each sentence is represented as a number of features
based on the features, distinguish sentences that are appropriate for the summary
run on new inputs
Naive Bayes classifier is a good example

Question 10

Q

Explain the algorithm of sentence extraction.

Answer

A

represent each sentence as a vector of features
compute score based on features
select n-highest ranking sentences
present in order in which they occur in text
post-processing to make summary more readable or concise (eliminate redundant sentences etc.)

Question 11

Q

What are some good features that can be used to decide whether a sentence is important to be included in the summary?

Answer

A

Fixed-phrase features = certain phrases indicate summary (e.g., “In summary”)
Paragraph features = the first and the last paragraph are more likely to be important
Thematic word feature = repetition is an indicator of importance
Uppercase word features = uppercase often indicates named entities
Sentence length cut-off = summary sentence should have more than 5 words

Question 12

Q

What texts are usually subject to multi-document summarization?

Answer

A

single event/person tracked over a long period of time (like in the death of someone in news articles and so on)
multiple events of similar nature (marathon races etc.)
an issue with related events (gun control)

Question 13

Q

What is query-specific summarization?

Answer

A

Query-specific summaries are specialized for a single information need, a query

Summarization is much easier if we have a description of what the user wants

Question 14

Q

Why is multi-document summarization useful?

Answer

A

For presenting and organizing search results.

many results are very similar, and grouping closely related documents helps cover more event facets
summarizing similarities and differences between documents

Automatic essay grading

Topic identification

Question 15

Q

What is topic modeling?

Answer

A

Topic modeling (unsupervised machine learning method) provides methods for automatically organizing, understanding, searching, and summarizing large electronic archives.

Question 16

Q

What can you use topic modeling for?

Answer

Study These Flashcards

A

Discover topics in a corpus
Model the evolution of topics over time
Image annotation
Model connections between topics

Question 17

Q

Explain what a generative model is in simple words.

Answer

Study These Flashcards

A

A generative model basically assumes that a document is built based on some topics (that the topics build the document, even though we know it’s human who do it). So, each document is a mixture of corpus-wide topics, each topic is a distribution over words, and each word is drawn from one of those topics.

Question 18

Q

Explain what a generative model is in simple words.

Answer

Study These Flashcards

A

A generative model basically assumes that a document is built based on some topics (that the topics build the document, even though we know it’s human who do it). So, each document is a mixture of corpus-wide topics, each topic is a distribution over words, and each word is drawn from one of those topics.

But, we only observe the document, so our goal is to infer the underlying topic structure.

Question 19

Q

What are the three latent (hidden) variables in LDA?

Answer

Study These Flashcards

A

Word distribution per topic (word-topic matrix)
Topic distribution per document (topic-doc matrix)
Topic word assignment

These are the three distributions that we want to infer

Question 20

Q

Explain in one sentence the training and test process of LDA.

Answer

Study These Flashcards

A

TRAIN: learn latent (hidden) variables on training data (collection of documents)

TEST: Predict topic distribution of an unseen document

Question 21

Q

What are the two goals of LDA that it tries to balance? (even if there is a trade off between them)

Answer

Study These Flashcards

A

for each document, allocate its words to as few topics as possible
for each topic, assign high probability to as few terms as possible

Question 22

Q

What is Gibbs sampling used for?

Answer

Study These Flashcards

A

Gibbs sampling is often used to estimate the posterior probability over a high-dimensional random variable z

Question 23

Q

How does Gibbs sampling work?

Answer

Study These Flashcards

A

It generates a sequence of samples from the joint probability distribution of two or more random variables.

Aim: to compute posterior distribution over latent variable z
Pre-requisites: we must know the conditional probability of z

Question 24

Q

True or False? LDA assumes that the order of words does not matter.

Answer

Study These Flashcards

A

True

Dynamic topic models can be a solution to that

Lecture 9 - Text Summarization and Topic Modeling Flashcards

(24 cards)