Module 3 Flashcards

Question 1

Q

What are the 5 main types of tasks performed with text?

Answer

A

classification
clustering
sentiment analysis
conversation simulation
text generation

Question 2

Q

What is sentiment analysis?

Answer

A

Process of identifying the subjective information about an object from the text

Question 3

Q

How can we perform classification or clustering on raw texts?

Answer

A

Tokenization
Stop-word removal
Stemming
Text normalization

Question 4

Q

Define tokenization

Answer

A

A token is a unit of processing text e.g. sentences, words. So tokenization is the process of converting a text corpus into a set of tokens (segments). It is a useful unit for linguistic processing.

Question 5

Q

What are stop words?

Answer

A

Stop words include and, you, is, at, the, in, and a. In particular, articles, prepositions, conjunctions, and sometime pronouns.

Question 6

Q

What is stemming?

Answer

A

Process of cutting the end or beginning of the word to convert it into its original root.

Question 7

Q

What is a widely used text-normalization method?

Answer

A

Term frequency/inverse document frequency which is a vector space, representations-based normalization. It evaluates how important the word is to the document. TF measures the frequency of a term in a document and IDF measures how rarely the term appears in the document.

Question 8

Q

What does bag of words do?

Answer

A

This method treats the tokenized document as a list of words and does not care about their position in the sentence. It then transforms the tokens into a list of words and their frequency. Useful in text clustering algorithms

Question 9

Q

What does N-grams do?

Answer

A

Similar to bag of words but it separates words based on their adjacent words so it takes the positioning of words into account. N refers to how many words it captures, usually 2 or 3 will do

Question 10

Q

What does part of speech tagging do?

Answer

A

A method that explains and annotates the particular roles played by words in a sentence

Question 11

Q

What is an example of a word embedding algorithm?

Answer

A

Word2Vec. It identifies the role of the word based on its context. It uses a neural network and converts the given corpus into a vector of words. It establishes an accurate association between words by converting the words into vectors and using the Euclidean distance

Question 12

Q

What two major tasks does Word2Vec perform?

Answer

A

using a context to predict words to that context, i.e., Continuous bag-of-Words (CBOW) or using the word to predict the context (skip-gram)

Question 13

Q

What is lemmitization?

Answer

A

Similar to stemming but it also considers the role of the word in the sentence as well.

Question 14

Q

What are the downsides of bag of words and n-grams

Answer

A

Bag of words fails to capture the structure of a sentence and n-grams is inefficient in terms of resource use

Question 15

Q

What is sentiment analysis and the three types of info that are required to automatically be identified?

Answer

A

opinion mining; the process that automatically enables machines to understand opinion.

The three types of info are:
- Polarity (pos/neg)
- Subject (topic being discussed)
- Opinion Holder (who is expressing opinion)

Question 16

Q

What is subjectivity classification?

Answer

A

Classifying a sentence as a subjective or objective sentence (e.g. fact or opinion)

Question 17

Q

What components can we break down sentiment analysis into?

Answer

A

polarity - fine grained sentiment analysis
emotions - usually research focuses on facial expressions because lexicons aren’t good at detecting emotion
intentions (interested, not interested)

Question 18

Q

What are the 3 popular approaches that perform sentiment analysis?

Answer

A

using a dictionary
machine learning based
a hybrid approach

Question 19

Q

What is a dictionary based approach for sentiment analysis?

Answer

A

When the developer creates rules and then creates a sentiment analysis algorithm based on these rules.

Question 20

Q

What are the steps in a dictionary based approach in sentiment analysis?

Answer

A

Receive input text, tokenize the text, count the number of positive and negative words, assign each token a sentiment score based on the counted number of pos and neg words

Question 21

Q

What are the popular baseline lexicons?

Answer

A

NRC, AFINN, and Bing

Question 22

Q

How does NRC work?

Answer

A

It assigns a 0/1 score to each word including positive, negative, anger, joy, disgust, fear, sadness, surprise, and trust.

Question 23

Q

How does the scoring for AFINN work?

Answer

A

It assigns a score between -5 and 5

Question 24

Q

How does Bing scoring work?

Answer

A

It categorizes words as 0/1 within a positive and negative category

Question 25

Q

How many words are all these baseline lexicons based on?

Answer

A

A unigram, or one word

Question 26

Q

What are some other popular word embedding methods?

Answer

A

Aside from word2vec, we have GLoVE and FastText. GLoVE resolves the challenge of word2vec ignoring whether some context words appear more often that others by scanning the entire corpus once and counting the co-occurrences once for all words. The result will be a matrix that involves all cooccurrences, i.e., a window-based global co-occurrence matrix. It is a dense
representation that presents the overall (global) statistics of word co-occurrences. GloVe
uses this matrix to learn word embeddings by considering the global statistical properties
of word co-occurrences, not just the local context (Word2Vec focuses on local context
only).

Question 27

Q

Summarize the difference between Glove and word2vec

Answer

A

word2vec models tries to capture co-occurrences one window at a time but glove captures the overall counts of how often co-occurrences appear

Question 28

Q

What is FastTest and how is it different than the previous word embedding methods?

Answer

A

FastText uses a vector for a single word instead of having a vector for a set of adjacent words. Skip-gram of Word2Vec uses uni-gram (one complete word) and takes
every single word as a single unit, but FastText uses n-gram and considers n number of characters together as a single unit.

Question 29

Q

What are some challenges of sentiment analysis?

Answer

A

Subjectivity and lack of tone in text, lack of context awareness, sarcasm

Question 30

Q

What are some sentiment analysis packages?

Answer

A

Syuzhet in R, BERT, GPT models, ROBERTA, InstructGPT

Question 31

Q

What is thematic analysis?

Answer

A

A popular quantitative analysis which is the process of identifying patterns across data objects, collected under the umbrella of a central concept

Question 32

Q

What are the two types of themes?

Answer

A

Semantic: within the explicit or surface meaning of the data

Latent: the underlying ideas, assumptions, and conceptualizations

Question 33

Q

What are the steps of theme analysis?

Answer

A

Become familiar with the data
Generate initial codes
Search for themes
Review themes
Define (refine) themes
Write-up

Question 34

Q

What are some algorithmic theme extraction methods

Answer

A

Rapid automatic keyword extraction, keyBERT, etc

Question 35

Q

What is TextRank?

Answer

A

TextRank [Mihalcea ’04] is a graph-based ranking model for text processing that is
particularly useful for keyword and sentence extraction. It is based on the PageRank algorithm.

Question 36

Q

What does Rake theme extraction algorithm do?

Answer

A

Rake [Rose ’10] identifies key phrases in a text by analyzing the co-occurrence of words and their
frequencies. It focuses on phrases containing words that frequently appear together and are relatively rare in
the document as a whole. RAKE is particularly good at extracting multi-word phrases as keywords.

Question 37

Q

What is Yake theme extraction algorithm?

Answer

A

Yake [Campos ’18], It starts by preprocessing the document, (convert into lowercase letters, removing stop
words, punctuations, etc.) Next, it uses five features to identify the characteristics of each word in the text,
these features include (i) Casing (ii) Word Positions (values more those words occurring at the beginning of a
document based on the assumption that relevant keywords often tend to concentrate more at the beginning
of a document), (iii) Word Frequency, (iv) Word Relatedness to Context, and (v) Word DifSentence. Then, it
combines these features into a single measurement that calculates a keyword score for each word. This
score balances phrase coverage (how much of the document the phrase covers) and phrase significance (if
the phrase is salient compared to other phrases). Afterward, it ranks candidates by the keyword score and
returns the top-scored keywords as extracted keywords.

Question 38

Q

Brainscape's Knowledge GenomeTM

Module 3 Flashcards

Brainscape's Knowledge Genome^TM