Lecture 9 - Text Summarization and Topic Modeling Flashcards

1
Q

What is the process of summarization?

A

to take an information source, extract content from it, and present the most important content to the user in a condensed form and in a manner sensitive to the user’s or application’s needs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why should we automate the summarization process?

A

Human-generated summaries are very expensive. But the summary, even if automated, should contain the right information, as a summary is often the only thing people read

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What can act as an input in a summarization algorithm?

What about the output?

A

INPUT:
news/ scientific articles, emails, videos etc.

OUTPUT:
keywords, highlight information in the input, chunks directly from the input (or paraphrase and aggregate in new ways) etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some summarization algorithms?

A
  1. keyword summarization (easy, poor representation of the content)
  2. sentence extraction (medium hard, good representation of the content, sentences often don’t fit together)
  3. natural language understanding/ generation
    * build knowledge representation of text
    * generate sentences summarizing the content
    * hard to do well
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What method of text summarization do most of the current system use?

A

Sentence extraction and then piece them together to form a text (e.g., results from Google Search)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Differentiate between data-driven and knowledge-based text summarization.

A

Data-driven => relying on features of the input documents that can be easily computed from statistical analysis

  • word statistics (informativeness of words)
  • cue phrases (linguistic expressions such as ‘now’ and ‘well’ that may explicitly mark the structure of a discourse)
  • section headers
  • sentence position

Knowledge-based => use more sophisticated natural language processing tools

  • discourse information (text structure etc.)
  • use external lexical resources
  • use machine learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In data-driven text summarization, frequency can be a good indicator of the importance of some word/ sentence to be included in the summary, as the important content is repeated multiple times across a. document. Explain the greedy frequency method.

A
  • compute word probability from input
  • compute sentence weight as a function of word probability
  • pick the best sentence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the sentence clustering “algorithm” in the context of text summarization.

A
  1. cluster sentences from the input into similar themes
  2. choose one sentence to represent a theme
  3. consider bigger themes as more important

drawback: each sentence is assigned to one theme only

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does text summarization with machine learning techniques work?

A
  • ask people to select the important sentences in a text (label good and bad summary sentences)
  • use these as training examples for supervised machine learning
  • each sentence is represented as a number of features
  • based on the features, distinguish sentences that are appropriate for the summary
  • run on new inputs
  • Naive Bayes classifier is a good example
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain the algorithm of sentence extraction.

A
  1. represent each sentence as a vector of features
  2. compute score based on features
  3. select n-highest ranking sentences
  4. present in order in which they occur in text
  5. post-processing to make summary more readable or concise (eliminate redundant sentences etc.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some good features that can be used to decide whether a sentence is important to be included in the summary?

A
  1. Fixed-phrase features = certain phrases indicate summary (e.g., “In summary”)
  2. Paragraph features = the first and the last paragraph are more likely to be important
  3. Thematic word feature = repetition is an indicator of importance
  4. Uppercase word features = uppercase often indicates named entities
  5. Sentence length cut-off = summary sentence should have more than 5 words
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What texts are usually subject to multi-document summarization?

A
  • single event/person tracked over a long period of time (like in the death of someone in news articles and so on)
  • multiple events of similar nature (marathon races etc.)
  • an issue with related events (gun control)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is query-specific summarization?

A

Query-specific summaries are specialized for a single information need, a query

Summarization is much easier if we have a description of what the user wants

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why is multi-document summarization useful?

A

For presenting and organizing search results.

  • many results are very similar, and grouping closely related documents helps cover more event facets
  • summarizing similarities and differences between documents

Automatic essay grading

Topic identification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is topic modeling?

A

Topic modeling (unsupervised machine learning method) provides methods for automatically organizing, understanding, searching, and summarizing large electronic archives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What can you use topic modeling for?

A

Discover topics in a corpus
Model the evolution of topics over time
Image annotation
Model connections between topics

17
Q

Explain what a generative model is in simple words.

A

A generative model basically assumes that a document is built based on some topics (that the topics build the document, even though we know it’s human who do it). So, each document is a mixture of corpus-wide topics, each topic is a distribution over words, and each word is drawn from one of those topics.

18
Q

Explain what a generative model is in simple words.

A

A generative model basically assumes that a document is built based on some topics (that the topics build the document, even though we know it’s human who do it). So, each document is a mixture of corpus-wide topics, each topic is a distribution over words, and each word is drawn from one of those topics.

But, we only observe the document, so our goal is to infer the underlying topic structure.

19
Q

What are the three latent (hidden) variables in LDA?

A
  1. Word distribution per topic (word-topic matrix)
  2. Topic distribution per document (topic-doc matrix)
  3. Topic word assignment

These are the three distributions that we want to infer

20
Q

Explain in one sentence the training and test process of LDA.

A

TRAIN: learn latent (hidden) variables on training data (collection of documents)

TEST: Predict topic distribution of an unseen document

21
Q

What are the two goals of LDA that it tries to balance? (even if there is a trade off between them)

A
  1. for each document, allocate its words to as few topics as possible
  2. for each topic, assign high probability to as few terms as possible
22
Q

What is Gibbs sampling used for?

A

Gibbs sampling is often used to estimate the posterior probability over a high-dimensional random variable z

23
Q

How does Gibbs sampling work?

A

It generates a sequence of samples from the joint probability distribution of two or more random variables.

Aim: to compute posterior distribution over latent variable z
Pre-requisites: we must know the conditional probability of z

24
Q

True or False? LDA assumes that the order of words does not matter.

A

True

Dynamic topic models can be a solution to that