lecture 4 Flashcards
sources of bias
- selection phase (influences data)
- annotation (influences data)
- input representation: how language is encoded and fed to models
- models
- research design
importance of data
- datasets form the basis of model training, evaluating, and benchmarking
- the ways in which we collect/construct/share these datasets inform the kinds of problems the field pursues and the methods explored in algorithm development
- good quality data ensures models perform well, are fair, and can be generalized across various contexts
text classification
corpora help us with text classification
goal: assign a label or category to a specific piece of text
why use text classification
- categorize language at word, sentence, and document level
- predict future outcomes
- find patterns
sentiment analysis
goal: predict the sentiment expressed in a piece of text (+, - , scale rating)
why is sentiment analysis hard
- sentiment is a measure of a speaker’s private state, which is unobservable
- sometimes words are a good indicator of sentiment, but many times it requires deep world + contextual knowledge
other text classification problems
- language identification: which language the text is in
- spam classificiation
- authorship attribution
- genre classificiation
- senitment analysis: understanding public opinion
questions when building a sentiment classifier
- what is the input for each prediction (e.g., sentence, text, etc.)
–> requires substantial data - what are the possible outputs (e.g., +, -, scale)
- how will the model decide (model decision mechanism)
- how to measure effectiveness (evaluation metrics)
–> requires substantial data
data-driven evaluation
choose a dataset for evaluation before you build a system
why is data-driven evaluation important
- controlled experimentation
- benchmarks: serve as reference points to evaluate the performance of a system
- your intuitions about inputs are probably wrong
where to get a corpus
- many corpora are prepared specifically for linguistic/NLP research with text from providers
- collect a new one by scraping websites
gold labels
annotations used to evaluate and compare sentiment analyzers
these can be
1. derived automatically from the original data artifact (metadata such as starratings)
2. added by human annotator who reads the text (but how to address trouble with deciding and agreeing between annotators)
sentiment analysis training data
(X,Y) pairs to learn h(X)
–> (input, output)
–> relies heavily on accurately labeled data
–> this is text classification
accuracy
- #correct / #total
- simplest measure
- not a good measure when there are class imbalances: when a classifier always predicts the majority class it will seem accurate but is ineffective in reality
- doesnt show the quality of predictions
confusion matrix
- gives more detailed insight into classification
- used for precision, recall, F1 score
precision
- accuracy of positive predictions (how often is my prediction correct)
- TP/ (TP + FP)
- measure of quality
precise model
might not find all positives, but the ones that the model does classify as positive are very likely to be correct
not precise model
may find a lot of positives, but its selection method is noisy. it wrongly detects many positives that arent true positives.
recall (sensitivity)
- how well are we capturing positive instances (how many of the positive instances do i find)
- TP / (TP + FN)
- measure of quantity
model with high recall
succeeds in finding all positive cases, even though it might also wrongly identify some negative cases as positive cases
model with low recall
not able to find all or a large part of positive cases
when to use precision vs recall
- precision: when we prioritize the quality of positive predictions over finding all positive instances
- recall: when the aim is to capture all positive cases, even if it leads to some false positives
tuning for high precision
the system should not make a mistake
tuning for high recall
the system should not miss a case
F1 measure
- balance between precision and recall (harmonic mean)
- offers better insight about model performance based on quality
- especially important for class imbalance
- 2(precisionrecall)/(precision+recall)
F1 score
1: high: both P and R are high
2. low: both P and R are low
3. medium: one of P and R is low and the other is high
random baseline
- method to provide a reference point for evaluating classification model performance
- labels are assigned to observations at random
- fix random seed
- repeat n times
- average results
- serves as benchmark against which better models are evaluated
majority baseline
- assign most frequent class label to all instances, calculate results
- results in high accuracy when one class significantly outweighs the other, but poor performance in identifying the majority class
- ensures that models not only have high overall accuracy, but can also correctly identify less frequent classes
evaluation for multiple classes
- calculate precision and recall for every class separately
- average the results over classes
–> macro average: does not take class imbalance into account
–> weighted average: weighted by class size
sentiment lexicon
- predefined list of words classified as positive/negative
- count positive and negative words within the text. predict whichever is greater.
problems with sentiment lexicon
- hard to know if words that seem pos/neg are actually used that way
- opinion words might describe a characters attitude rather than an evaluation of the film
- some words are semantic modifiers
solutions for sentiment lexicon problems
data-driven method: use frequency counts to ascertain which words in corpora tend to be positive or negative
h(x)
- for text classification
- a mapping h from input data x to a label y
- two components
1. representation of the data
2. formal structure of the learning method
representation of data for text classification
- sentiment analysis: only positive and negative words
- only words in isolation (BoW)
- conjunctions of words (sequential, ngrams, other nonlinear combinations)
- higher order linguistic structure (syntax)
bag of words
- simplest representation
- text represented as counts of words that it contains
- frequency of occurrence of each word is used as a feature for training a classifier
BoW process
- tokenize
- count
- vectorize: each dimension represents a unique word in the entire corpus, and the value in each dimension is the word’s frequency in the document
why is BoW not sufficient for modeling language
- insensitive to word order or semantics
- vectors are sparse and high-dimensional
- ‘words’ are not always the most meaningful units of information
ngrams
- assign probabilities to sentences
- looking at more than one word at a time
- estimate P(S = w1…wn). this is a joint probability over all the words in S.
ngrams: chain rule
P(S = w1…wn) = product of conditional probabilities
problem with chain rule + solution
- problem: many conditional probabilities are just as sparse due to the vast number of possible word combinations
- solution: independence assumption. the probability of a word only depends on a fixed number of previous words (history)
P(mast| i spent three years before the mast) = P(mast|before,the)
ngrams usefulness for sentiment analysis
ngrams capture sentiment beyond the word level, since they have more context awareness
why do we need text corpora
- to evaluate our systems
- good science requires controlled experimentation
- good engineering requires benchmarks - to help our systems work well
- data-driven methods instead of rule-based
- learning
learning
collecting statistics or patterns from corpora to govern the system’s behavior
- supervised learning
- core behavior: training, refining behavior: tuning.