lecture 4 Flashcards

Question

F1 measure

Answer 1

- balance between precision and recall (harmonic mean) - offers better insight about model performance based on quality - especially important for class imbalance - 2*(precision*recall)/(precision+recall)

Answer 2

1: high: both P and R are high 2. low: both P and R are low 3. medium: one of P and R is low and the other is high

Answer 3

- method to provide a reference point for evaluating classification model performance - labels are assigned to observations at random 1. fix random seed 2. repeat n times 3. average results - serves as benchmark against which better models are evaluated

Answer 4

- assign most frequent class label to all instances, calculate results - results in high accuracy when one class significantly outweighs the other, but poor performance in identifying the majority class - **ensures that models not only have high overall accuracy, but can also correctly identify less frequent classes**

Answer 5

1. calculate precision and recall for every class separately 2. average the results over classes --> macro average: does not take class imbalance into account --> weighted average: weighted by class size

Answer 6

1. predefined list of words classified as positive/negative 2. count positive and negative words within the text. predict whichever is greater.

Answer 7

1. hard to know if words that seem pos/neg are actually used that way 2. opinion words might describe a characters attitude rather than an evaluation of the film 3. some words are semantic modifiers

Answer 8

data-driven method: use frequency counts to ascertain which words in corpora tend to be positive or negative

Answer 9

- for text classification - a mapping h from input data x to a label y - two components 1. representation of the data 2. formal structure of the learning method

Answer 10

1. sentiment analysis: only positive and negative words 2. only words in isolation (BoW) 3. conjunctions of words (sequential, ngrams, other nonlinear combinations) 4. higher order linguistic structure (syntax)

Answer 11

- simplest representation - text represented as counts of words that it contains - frequency of occurrence of each word is used as a feature for training a classifier

Answer 12

1. tokenize 2. count 3. vectorize: each dimension represents a unique word in the entire corpus, and the value in each dimension is the word's frequency in the document

Answer 13

1. insensitive to word order or semantics 2. vectors are sparse and high-dimensional 3. 'words' are not always the most meaningful units of information

Answer 14

- assign probabilities to sentences - looking at more than one word at a time - estimate P(S = w1...wn). this is a joint probability over all the words in S.

Answer 15

P(S = w1...wn) = product of conditional probabilities

Answer 16

- problem: many conditional probabilities are just as sparse due to the vast number of possible word combinations - solution: independence assumption. the probability of a word only depends on a fixed number of previous words (history) P(mast| i spent three years before the mast) = P(mast|before,the)

Answer 17

ngrams capture sentiment beyond the word level, since they have more context awareness

Answer 18

1. to evaluate our systems - good science requires controlled experimentation - good engineering requires benchmarks 2. to help our systems work well - data-driven methods instead of rule-based - learning

Answer 19

collecting statistics or patterns from corpora to govern the system's behavior 1. supervised learning 2. core behavior: training, refining behavior: tuning.

lecture 4 Flashcards

(43 cards)