Module 3 Flashcards

1
Q

What are the 5 main types of tasks performed with text?

A

classification
clustering
sentiment analysis
conversation simulation
text generation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is sentiment analysis?

A

Process of identifying the subjective information about an object from the text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can we perform classification or clustering on raw texts?

A

Tokenization
Stop-word removal
Stemming
Text normalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define tokenization

A

A token is a unit of processing text e.g. sentences, words. So tokenization is the process of converting a text corpus into a set of tokens (segments). It is a useful unit for linguistic processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are stop words?

A

Stop words include and, you, is, at, the, in, and a. In particular, articles, prepositions, conjunctions, and sometime pronouns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is stemming?

A

Process of cutting the end or beginning of the word to convert it into its original root.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a widely used text-normalization method?

A

Term frequency/inverse document frequency which is a vector space, representations-based normalization. It evaluates how important the word is to the document. TF measures the frequency of a term in a document and IDF measures how rarely the term appears in the document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does bag of words do?

A

This method treats the tokenized document as a list of words and does not care about their position in the sentence. It then transforms the tokens into a list of words and their frequency. Useful in text clustering algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does N-grams do?

A

Similar to bag of words but it separates words based on their adjacent words so it takes the positioning of words into account. N refers to how many words it captures, usually 2 or 3 will do

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does part of speech tagging do?

A

A method that explains and annotates the particular roles played by words in a sentence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is an example of a word embedding algorithm?

A

Word2Vec. It identifies the role of the word based on its context. It uses a neural network and converts the given corpus into a vector of words. It establishes an accurate association between words by converting the words into vectors and using the Euclidean distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What two major tasks does Word2Vec perform?

A

using a context to predict words to that context, i.e., Continuous bag-of-Words (CBOW) or using the word to predict the context (skip-gram)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is lemmitization?

A

Similar to stemming but it also considers the role of the word in the sentence as well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the downsides of bag of words and n-grams

A

Bag of words fails to capture the structure of a sentence and n-grams is inefficient in terms of resource use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is sentiment analysis and the three types of info that are required to automatically be identified?

A

opinion mining; the process that automatically enables machines to understand opinion.

The three types of info are:
- Polarity (pos/neg)
- Subject (topic being discussed)
- Opinion Holder (who is expressing opinion)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is subjectivity classification?

A

Classifying a sentence as a subjective or objective sentence (e.g. fact or opinion)

17
Q

What components can we break down sentiment analysis into?

A
  1. polarity - fine grained sentiment analysis
  2. emotions - usually research focuses on facial expressions because lexicons aren’t good at detecting emotion
  3. intentions (interested, not interested)
18
Q

What are the 3 popular approaches that perform sentiment analysis?

A
  1. using a dictionary
  2. machine learning based
  3. a hybrid approach
19
Q

What is a dictionary based approach for sentiment analysis?

A

When the developer creates rules and then creates a sentiment analysis algorithm based on these rules.

20
Q

What are the steps in a dictionary based approach in sentiment analysis?

A

Receive input text, tokenize the text, count the number of positive and negative words, assign each token a sentiment score based on the counted number of pos and neg words

21
Q

What are the popular baseline lexicons?

A

NRC, AFINN, and Bing

22
Q

How does NRC work?

A

It assigns a 0/1 score to each word including positive, negative, anger, joy, disgust, fear, sadness, surprise, and trust.

23
Q

How does the scoring for AFINN work?

A

It assigns a score between -5 and 5

24
Q

How does Bing scoring work?

A

It categorizes words as 0/1 within a positive and negative category

25
Q

How many words are all these baseline lexicons based on?

A

A unigram, or one word

26
Q

What are some other popular word embedding methods?

A

Aside from word2vec, we have GLoVE and FastText. GLoVE resolves the challenge of word2vec ignoring whether some context words appear more often that others by scanning the entire corpus once and counting the co-occurrences once for all words. The result will be a matrix that involves all cooccurrences, i.e., a window-based global co-occurrence matrix. It is a dense
representation that presents the overall (global) statistics of word co-occurrences. GloVe
uses this matrix to learn word embeddings by considering the global statistical properties
of word co-occurrences, not just the local context (Word2Vec focuses on local context
only).

27
Q

Summarize the difference between Glove and word2vec

A

word2vec models tries to capture co-occurrences one window at a time but glove captures the overall counts of how often co-occurrences appear

28
Q

What is FastTest and how is it different than the previous word embedding methods?

A

FastText uses a vector for a single word instead of having a vector for a set of adjacent words. Skip-gram of Word2Vec uses uni-gram (one complete word) and takes
every single word as a single unit, but FastText uses n-gram and considers n number of characters together as a single unit.

29
Q

What are some challenges of sentiment analysis?

A

Subjectivity and lack of tone in text, lack of context awareness, sarcasm

30
Q

What are some sentiment analysis packages?

A

Syuzhet in R, BERT, GPT models, ROBERTA, InstructGPT

31
Q

What is thematic analysis?

A

A popular quantitative analysis which is the process of identifying patterns across data objects, collected under the umbrella of a central concept

32
Q

What are the two types of themes?

A

Semantic: within the explicit or surface meaning of the data

Latent: the underlying ideas, assumptions, and conceptualizations

33
Q

What are the steps of theme analysis?

A
  1. Become familiar with the data
  2. Generate initial codes
  3. Search for themes
  4. Review themes
  5. Define (refine) themes
  6. Write-up
34
Q

What are some algorithmic theme extraction methods

A

Rapid automatic keyword extraction, keyBERT, etc

35
Q

What is TextRank?

A

TextRank [Mihalcea ’04] is a graph-based ranking model for text processing that is
particularly useful for keyword and sentence extraction. It is based on the PageRank algorithm.

36
Q

What does Rake theme extraction algorithm do?

A

Rake [Rose ’10] identifies key phrases in a text by analyzing the co-occurrence of words and their
frequencies. It focuses on phrases containing words that frequently appear together and are relatively rare in
the document as a whole. RAKE is particularly good at extracting multi-word phrases as keywords.

37
Q

What is Yake theme extraction algorithm?

A

Yake [Campos ’18], It starts by preprocessing the document, (convert into lowercase letters, removing stop
words, punctuations, etc.) Next, it uses five features to identify the characteristics of each word in the text,
these features include (i) Casing (ii) Word Positions (values more those words occurring at the beginning of a
document based on the assumption that relevant keywords often tend to concentrate more at the beginning
of a document), (iii) Word Frequency, (iv) Word Relatedness to Context, and (v) Word DifSentence. Then, it
combines these features into a single measurement that calculates a keyword score for each word. This
score balances phrase coverage (how much of the document the phrase covers) and phrase significance (if
the phrase is salient compared to other phrases). Afterward, it ranks candidates by the keyword score and
returns the top-scored keywords as extracted keywords.

38
Q
A