NLP Flashcards

1
Q

Summarization

A

idea?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Ambiguity

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

morpheme

A

a meaningful morphological unit of a language that cannot be further divided (e.g. in, come, -ing, forming incoming ).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

morphological

A

relating to the form or structure of things.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Zipf’s law

A

Zipf’s law states that given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table. So word number N has a frequency of 1/N.
Thus the most frequent word will occur about twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

arg max or argmax

A

the arguments of the maxima are the points of the domain of some function at which the function values are maximized

In other words, arg max is the set of points, x, for which f(x) attains the function’s largest value (if it exists). Arg max may be the empty set, a singleton, or contain multiple elements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Language model?

A
  • Models that assign probabilities to sequences of words
  • the simplest model that assigns probabilities
    LM to sentences and sequences of words, the N-gram
  • Goal: Assign useful probabilities P(x)to sentences x
  • related task: assign probability P(w_i+1 | w_i) of an upcoming word

-> Probabilities should broadly indicate plausibility of sentences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Noisy Channel Model

A
  • Goal: predict sentence given acoustics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Unigram, bigram, trigram

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

stupid backoff

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Prior probability

A

a probability as assessed before making reference to certain relevant observations, especially subjectively or on the assumption that all possible outcomes be given the same probability.

the prior of an uncertain quantity is the probability distribution that would express one’s beliefs about this quantity before some evidence is taken into account

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Channel model probability

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Markov Model / Markov Chain

A

A finite state automaton with probabilistic state transitions

Makes Markov assumption that next state only depends on the current state and independent of previous history

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Higher-order Markov Language Models

A

k-gram models are equivalent to bi-gram models where the state space (the words) are k-tuples of the bi-gram state space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Maximum likelihood estimator

A

P(w_i | w_i-1)_mle = count(w_i-1,w_i) / count(w_i-1)

  • it is the estimate that makes it most likely that the word “w_i” will occur x times in a y million size corpus
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Back-off Models

A
  • use trigram if you have good evidence, otherwise bigram, otherwise unigram

related itea: interpolation -> mix unigram, bigram and trigram (mix all three all the time)
-> most of the time interpolation works better than backoff

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Laplace Smoothing

A
  • add-one: pretend we saw each word one more time than we did -> add one to all counts
  • also: add-x smoothing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Back-off: Linear Interpolation

A

related itea to backoff: interpolation -> mix unigram, bigram and trigram (mix all three all the time)
-> most of the time interpolation works better than backoff

-> how to set lambdas: use held-out corpus (aside training set and test set)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

interpolate?

A

interpolation is a method of constructing new data points within the range of a discrete set of known data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Extrapolation

A

extrapolation is the process of estimating, beyond the original observation range, the value of a variable on the basis of its relationship with another variable. It is similar to interpolation, which produces estimates between known observations, but extrapolation is subject to greater uncertainty and a higher risk of producing meaningless results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

gradient descent

A

tbd!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

perplexity

A

=> intrinsic evaluation of a language model
- is based on the inverse probability of the test set
-> perplexity is most common
=> is the probability of the test set, normalized by the number of words

vs. extrinsic evalution (e.g. using speech recognition system, MT system)

=> perplexity is a function of the probability of the sentence
=> also: perplexity is “weighted equivalent branching factor” (also “average branching factor”)
=> in general: the lower the perplexity, the better the model
=> minimizing perplexity is the same as maximizing probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

a better model of a text

A

is one which assigns a higher probability to the word that actually occurs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

the best language model

A

is one that best predicts an unseen test set

-> gives the highest P(sentence)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

chain rule

A

permits the calculation of any member of the joint distribution of a set of random variables using only conditional probabilities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

long-distance dependencies/information

A

mostly N-gram models sufficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

how do we deal with bigram that have zero probaility?

A

simplest idea: Add-one (Laplace) smoothing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

smoothing

A

Methods for assigning probability mass to rare events

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

backoff

A

use trigram if you have good evidence, otherwise back off and use bigram, otherwise unigram

in practice, interpolation works better than backoff

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

interpolation

A

mix unigram, bigram, trigram

in practice, interpolation works better than backoff

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Add-one (Laplace) Smoothing

A

how do we deal with bigram that have zero probaility?

-> ok for text classification, not for language modeling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

how do we deal with bigram that have zero probaility?

A

simplest idea: Add-one (Laplace) smoothing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

smoothing

A

Methods for assigning probability mass to rare events

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

backoff

A

use trigram if you have good evidence, otherwise back off and use bigram, otherwise unigram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

interpolation

A

mix unigram, bigram, trigram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

log-likelihood

A

tbd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

dot-product

A

tbd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Keser-Ney smoothing

A

tbd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Open vs Closed classes

A

tbd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

generative vs discriminative model

A
  • generative: models joint probability of the labels y
  • discriminative: models conditional probability
  • discrimininative models usually harder to train (we have to solve optimization problem like gradient descent), with generative models we usually just count
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

log-linear model

A
  • a conditional probability model
  • sometimes also called logistic regression model
  • a discriminative model
  • they have a feature function phi of x,y
  • a weight vector
    1) model: … cf 03-19
    2) learning: argmax of w
    3) prediction: argmax of y
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Sequence Methods

A

e.g. HMM -> a generative model (we model the joint probability of input and output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

named entity recognition (NER)

A

a sequence labeling task

e.g. twoDigitNum, allCaps, lowercase

see 03-24

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Part of speech tags

A
  • t
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

a sequence of categories

A

e.g. verb noun conj.verb adverb

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

transition & emission in HMM?

A

transition: y_i -> y_i+1
emission: y_i -> x_i

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

transition probabilities

A

–> HMM: Pr(𝑥_𝑖 | 𝑥_𝑖−1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

two sequences in HMM

A

• In a hidden Markov model there are two sequences. One is a Markov chain (the y_i’s) and one is emitted from the Markov chain (x_i’s)

  1. sequence: Markov chain (the labels i.e. y_i’s)
  2. sequence: emission of Markov chain (the emitted x_i’s)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

joint probability

A

A joint probability is a statistical measure where the likelihood of two events occurring together and at the same point in time are calculated. Joint probability is the probability of event Y occurring at the same time event X occurs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

HMM

A

An HMM is a probabilistic sequence model: given a sequence of units (words, letters, morphemes, sentences, whatever), they compute a probability distribution over possible sequences of labels
and choose the best label sequence.

  • > with HMM we model the joint probability of the input and the output (we also model the probability of seeing a certain sequence of words (input), not only tags)
  • > HMM makes strong assumptions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

e(x_i,y_i)

A

emission probability: prob of word x_i given category y_i

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

smoothed estimation (trigram HMM)

A

1) for transition prob: smooth the transition parameters just like n-grams models using lambda_1,2,3
2) for emission prob: usually pseudowords are used

  • use training corpus to do different counts (linear interpolation)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

HMM: Smoothing with Pseudowords

A

finesse this problem of low frequency words or words never seen in training by closing the vocabulary by mapping those low frequency words to a much smaller, finite set of perhabs 20 “words classes” which preserve information about the spellings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

HMM: Inference

A

HMM: training, smoothing
next: inference
problem stmt: for a given input sentence find the most likely sequence of tags (try over all possible sequences of tags)
-> argmax over sequence of y’s
-> but is exponential in length of set of tags
=> cf. Viterbi algorithm -> tractable solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Viterbi algorithm

A

a dynamic programming algorithm for finding the most likely sequence of hidden states (sequence of tags y’s) that results in a sequence of observed events

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Dynamic programming

A

dynamic programming (also known as dynamic optimization) is a method for solving a complex problem by breaking it down into a collection of simpler subproblems, solving each of those subproblems just once, and storing their solutions. The next time the same subproblem occurs, instead of recomputing its solution, one simply looks up the previously computed solution, thereby saving computation time at the expense of a (hopefully) modest expenditure in storage space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

HMM makes strong assumptions

A
  • > about how the different predictions in the sequence relate to one another:
    1) The word (emitted value) x_i is independent of the rest of the variables given y_i i.e. for each x_i we assume it is sufficient to look at y_i i.e. we don’t need to condition on the full sequence of labels -> ideally we want to assume more dependence between the words

=> BUT: Words are dependent on other words in the sentence, even given their POS
=> we want to assume more dependence between the words!

2) a POS tag is independent of past tags given the previous tag
=> BUT: POS tags have dependencies that cannot be bounded in distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

feature in log-linear model

A
  • > the features in a log-linear model serve the same function as the emission probabilities in HMM
  • > the higher the f, the higher is prob for a certain y
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

prediction

A

= tagging

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q
  1. model
  2. inference
  3. learning
A
  1. i.e. given the observed variables, how do we find the most likely y = inference, cf. Viterbi algorithm
  2. how to find the parameters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

first-order model

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

multinomial

A

you have a chart / lookup table for each word and label and for each pair of labels (?)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

log linear model

A

you have a feature function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

what’s the difference between a feature-based classifier and a classifier which just has a multinomial?

A

feature-based classifier: if you have completely different sequences of words but have similar features, if that is the case there is a connection between the sequences because of the features -> you have some influence between them even though they are not the same

multinomial: you have a chart labels/words where entries are independent. it tells you if you have this word, then it has this value. you don’t model any relation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

feature vector

A

a feature vector consists of say m features, of which many could be binary features or “indicator function” that ask some question

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

gradient ascent

A

hill climbing technique for concave functions (which have a local maximum which is also global maximum)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

named entity recognition (NER)

A

tbd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

co-reference resolution

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

relation extraction

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

apposition

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

ontology

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

supervision

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

precision, recall

A

precision: % of selected items that are correct
recall: % of correct items that are selected (opposite of precision)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

open information extraction system

A

should be able to indentify the ontology (ie the tables) as well as populate the database (does not have a predefined ontology eg thorugh wikipedia infoboxes, it only has text!
how does it define relations? just as strings/pieces of text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

how to cluster semantically equivalent relations? cf. 11-16 would be 3 different relation instances but represent the same
syntax does not align with semantics!

A

a first step: Semantic role labeling (SRL)

idea: take a sentence, split it into events and propositions and to be able to identify that this relation is expressed and it is very similar to other instances of the same type of relation
- > idea: the roles are invariant to the different sentences but stay the same for equivalent values ie. the role is the same regardless of what syntactic position they take - it’s kind of an abstraction over the sentence-level paraphrasing which you can do

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

Semantic Roles cf. 11-24

A
  • An underlying relationship an argument has with the main verb in a clause
  • This relation is (ideally) invariant to paraphrases
  • do not necessarily align with syntactic roles
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

3 main representation schemes for SRL

A

1) FrameNet scheme
judgment frame; roles: judge, evaluee, reason; verb “blame” invokes the judgment frame
-> very accurate, but coverage very limited (only a few thousands of frames
- semantic roles are defined by the frame
- a frame does a clustering of verbs (blame, hail, etc have the same frame)
- is more difficult to parse
- frames are manually defined
- a big problem: frame coverage

2) ProBank scheme
idea: semantic roles are more fine-grained than in FrameNet
- idea: try to abstract over the different instantiations of the same verb
- more frequently used than FrameNet, ProBank is more agnostic, does not generalize over verbs at all
- more fine-grained: e.g. She blames… or She praises… would be put in same frame with FrameNet but use 2 different semantic roles in PropBank

3)
VerbNet scheme
- takes max. coarse-grained approach
- 22 roles only
-> very generalized, sometimes very difficult to pick the right role because very abstractly defined
- crux is: use same set of roles across different classes??

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

semantic role labeling (SLR)

A

you want to abstract away from the syntax to directly represent the semantic relations, e.g. FrameNet (big dictionary of frames), several verbs can invoke the same frame (eg judgment frame); frames are manually defined

  • SLR is a simplified form of semantic representation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

different type of dictionary: semantic fields

A

clustering of words that share a meaning component

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

syntactic parse

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

error propagation

A

usually when you separate one prediction task into 2, situation arises called “error propagation”: you make a mistake at first stage, then no matter what you do, you’re already doomed when you get to the second step
-> that’s a reason why people don’t use pipelines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

sparsity

A

tbd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

classifier training vs feature?

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

precision, recall, F-score

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

NLP

A
  • information retrieval
  • machine translation
  • question answering
  • summarization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

information extraction

A

map text to database entries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

Ambiguity

A
  • key challenge
  • Language has much underlying structure, which is ambiguously expressed
  • Ambiguity is at all levels:
  • > words can sound the same but not mean the same (bank vs. bank)
  • > Sentences may look the same, but have different meanings
  • Alongside ambiguity, there is also redundancy: many different ways to say the same thing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

morpheme

A

is the smallest grammatical unit in a language. In other words, it is the smallest meaningful unit of a language. The linguistics field of study dedicated to morphemes is called morphology. A morpheme is not identical to a word, and the principal difference between the two is that a morpheme may or may not stand alone, whereas a word, by definition, is freestanding. When it stands by itself, it is considered as a root because it has a meaning of its own (e.g. the morpheme cat) and when it depends on another morpheme to express an idea, it is an affix because it has a grammatical function (e.g. the –s in cats to indicate that it is plural). Every word comprises one or more morphemes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

Shannon’s game (Language Modeling Problem)

A

given a sequence of words, what’s the distribution of the next word

90
Q

N-gram

A

An N-gram is a sequence of N
N-gram words: a 2-gram (or bigram) is a two-word sequence of words like “please turn”,
“turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence
of words like “please turn your”, or “turn your homework”. We’ll see how
to use N-gram models to estimate the probability of the last word of an N-gram
given the previous words, and also to assign probabilities to entire sequences

=> N-grams work well only for word prediction if the test corpus looks like the training corpus!

91
Q

unigram model

A

=> simplest Markov model: unigram model -> simply estimate prob of a sequence of words by product of probs by individual words (words are independent in this model)

92
Q

bigram model

A
  • Condition on previous single word
  • slightly more intelligent than unigram model

extend further to trigram, 4-gram, 5-gram

=> in general, insufficient model of language because language has long-distance dependencies

93
Q

markov assumption

A

a simplyfing assumption (because we cannot count how many times a sequence of words occurs):
-> the next state only depends on the current state and independent of previous history
-> P(the | its water is so transparent that =~ P(the | that)
or
P(the | its water is so transparent that =~ P(the | transparent that)

more formally:
P(w_i | w_1w_2…w_n) =~ P(w_i | w_i-k…w_i-1)
=> estimate by just the last few words

=> simplest Markov model: unigram model -> simply estimate prob of a sequence of words by product of probs by individual words

94
Q

Markov Model

A
The assumption that the probability of a word depends only on the previous word is called a Markov assumption. 
=> Markov models are the class of probabilistic models
that assume we can predict the probability of some future unit without looking too far into the past
=> Markov models assume the probability of a sequence can be given as a product of the transition probabilities
95
Q

estimate probability of a sentence using bigram

A
P(I want english food ) = 
P(I | ) x 
P(want | I) x
P(english | want) x
P(food | english) x
P( | food)
96
Q

why log space

A
  • to avoid arithmetic underflow
  • adding is faster than multiplying

p_1 x p_2 x p_3 = log(p_1) + log(p_2) + log(p_3)

97
Q

a held-out corpus

A

A held-out corpus is an additional training corpus that we use to set hyperparameters like these λ values, by choosing the λ values that maximize the likelihood of the held-out corpus

98
Q

how to deal with bigrams with zero probability?

A
  • simplest idea: add-one smoothing
99
Q

text classification

A

t

100
Q

Discounting Methods

A

-> Maximum likelihood estimates tend to be too high for low count items

=> One simple way to handle this is to discount all counts by a constant term: (say, 0.5)

But what do we do with the missing probability mass? -> collect it from the number of discounts made and distribute this mass to bigrams with count 0

101
Q

Kneser-Ney Smoothing

A

cf P_continuation

102
Q

classification: 3 main ideas

A
  • Representation in a feature space
  • Scoring by linear functions
  • Learning by optimizing
103
Q

joint vs. conditional probability

A

tbd

104
Q

con·di·tion·al prob·a·bil·i·ty

A

the probability of an event ( A ), given that another ( B ) has already occurred

105
Q

Naives Bayes

A
  • Simplest Generative Model

Naives Bayes is a probabilistic classifier, meaning that for a document d, out of all classes c ∈ C the classifier returns the class c^ which has the maximum posterior probability given the document

106
Q

feature space

A

t

107
Q

bag of words model

A

the bag of words assumption: the position of the words is ignored

=> we make use of the frequency of each word

108
Q

supervised learning

A

for the training data, you know the labels

109
Q

feature function

A

t

110
Q

likelihood

A

or log likelihood

the prob of seeing the labels given the inputs

111
Q

regularization

A

to combat overfitting with log-linear model, we add a regularization term

-> idea: to penalize the increase of w especially for low-appearance features

112
Q

neural net

A

non-linear, non-convex architecture

vs.
log-linear models: linear, convex

113
Q

feed forward neural network

A

t

114
Q

neural network

A

t

115
Q

backpropagation algorithm

A

t

116
Q

one-hot vector

A

cf. Feed-forward Neural Network

117
Q

softmax layer

A
  • > a standard way of forming a multi-label classifier

- Softmaxlayers turn vector outputs into a probability distribution

118
Q

morphological paradigms

A

t

119
Q

POS tagging problem

A

The POS tagging problem is to determine the (single) POS tag for a particular word token (instance)

120
Q

main sources of information for POS tagging

A
  1. Knowledge of word probabilities (to man is rare)

2. Knowledge of neighboring words

121
Q

Sequence Labeling as Classification

A

aim: predict category of each word

Approach 1: classify each token independently but use as input features and information about the surrounding tokens (sliding window) (but without category of neighboring words)

Approach 2: using Outputs as Inputs: Better input features are usually the categoriesof the surrounding tokens (not easy to integrate), but these are not available yet -> use category of either the preceding or succeeding tokens by going forward or backward and using previous output

  • > forward classification
  • > backward classification
122
Q

probabilistic sequence model

A
  • build a model that gives you a prob for the entire sequence
  • allows integrating uncertainty over multiple, interdependent classifications and collectively determine the most likely global assignment
    => classic solution: Hidden Markov Model (HMM)
    -> a global model: means it maximizes over sequences rather than local words only i.e. gives a prob for the entire sequence of words
123
Q

hidden markov model (HMM)

A

an HMM allows us to talk about both observed events (like words that we see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our probabilistic model

124
Q

HMM - why hidden?

A

we didn’t observe part-of-speech tags in the world; we saw words and had to infer the correct tags from the word sequence. We call the part-of-speech tags hidden
because they are not observed

125
Q

Trigram HMM

A

t

126
Q

conditional independence assumption

A

given x and y_i-1, y_i is conditionally independent of everything that preceded it
(simliar to Markov assumptions)
-> in MEMM, unlike HMMs, the words are always conditioned on, there is no assumption re the words

words are conditionally independent of each other given the class

127
Q

Markov chain

A

A Markov chain is a special case of a weighted automaton in which weights are probabilities (the
probabilities on all arcs leaving a node must sum to 1) and in which the input sequence uniquely determines which states the automaton will go through

-> A Markov chain is useful when we need to compute a probability for a sequence of events that we can observe in the world

128
Q

first-order Markov chain

A

A Markov chain embodies an important assumption about these probabilities. In a first-order Markov chain, the probability of a particular state depends only on the previous state

129
Q

discriminative models

A
  • allows us more dependencies between different words (than e.g. HMM) through the use of discriminative models which also allows to use feature representations
130
Q

Maximum Entropy Markov Models (MEMM)

A
  • here, I assume that p(y_i|y_i-1,x_1….x_m) is a log-linear model
  • is the product of transitions
  • unlike HMM, here I don’t assume anything about the independence of the words: I condition on the words, and there is no assumption that words are independent or conditionally independent
  • 2nd assumption: the structure of each of the factors in the product forms a log-linear model
  • f = feature function
  • w = weight vector
  • the denumerator is to normalize the expression
131
Q

why exponentiate? (e.g. in log-linear model)

A

t

132
Q

multinomial vs. feature function based model?

A
  • multinomial: you have a chart/table/lookup-table for each word and label: coarse-grained
  • feature function: fine-grained

difference:
cf. 20171119_120054[1].mp4 26:00ff

133
Q

history (context: feature function)

A

why history?

  • > y_i is conditioned on y_i-1 and all the words = this is called the history
  • > so it includes everything you condition on, not just observed values but also labels (e.g. the preceding tag)
134
Q

analogy MEMM and HMM (re transition and emission prob)

A

MEMM: with the feature function we’re trying to find relevant information for defining which y’s are more probable given everything that I condition on

  • the features with word/tag pairs in feature function f in MEMM serves the same purpose as the emission probability in HMM (of course! -> if the emission prob is high for a certain y given an x, then the feature function together with the weight vector should also yield a high probability for this y
  • some features n f correspond to transition probabilities in HMM: eg f_103 considers the two previous tags. You can have both trigram and bigram transition probabilites in f, you don’t have to restrict yourself to one of them
135
Q

block notation

A

t

136
Q

Ratnaparkhi’s Tagger

A

05-05

  • f_100 corresponds to emission prob e(x_i | y_i) in HMM
  • f_101 etc aare additional features that HMM could not cover
  • you can use trigram, bigram, unigram model in the same feature function set
  • transition feature f_103 corresponds to transition prob q(y_i | y_i-1) in HMM (trigram)
    and transition feature f_104 same but bigram
    and transition feature f_105 same but unigram

=> AND therefore: it corresponds to backoff model smoothing method: if f_103 does not hold then maybe f_104 holds and so forth
-> 103 would have very high weight, 104 bit less, 105 again bit less (like a backoff model)

=> you can combining any number of features! -> becomes more effective

137
Q

log-linear model: when we have probabilistic classifiers, we also need to care about 2 other things aside the model?

A
  • train it and to do inference
  • given training data, we want to estimate the parameters (=the weights w), and we need to know how to do inference: given weight values and x, what is the most likely y
138
Q

MLE in MEMM

A

sl 05-05

139
Q

differentiate the log likelihood of a log-linear model

A

sl 05-05

140
Q

inference in MEMMs

A

pretty much the same as for HMM:
use Viterbi

sl 05-10/11

141
Q

difference between inference and learning? (HMM, MEMM)

A
  • inference (ie finding the max y’s): finding a maximum path on the grid representation
  • > we have w and want to find the most likely y’s
  • learning:
142
Q

label bias

A
  • problem with MEMM, HMM
    -> occurs with any locally-normalized model
  • the model favors labels that predict well the next state over labels that predict poorly the next state, it will be biased towards them because of the high probability
    => your model predicts an unlikely/infrequent sequence of labels just because their probability does not decay exponentially (e.g. long book names) yet that sequence is very rare
  • could be harmful, not always
  • e.g. in POS tagging it is usually not a problem
143
Q

local normalization

A

a normalization term as denominator makes sure that the sum of probabilities of all possible values of y_j is 1

equivalent on a grid represenation: the sum of all outgoing edges of a node is 1

144
Q

prediction vs inference vs learning?

A

inference := the process of inferring something. a conclusion reached on the basis of evidence and reasoning.

145
Q

Sequence Conditional Random Fields (CRFs)

A
  • CRFs are similar to MEMMs, but the normalization is global
  • this solves the label bias: the scores of the next state given the current one need not sum up to 1
  • the normalization is not done for each transition but rather the sum of all paths given the x sums up to 1
  • each transition is not a probability anymore but rather a score/(or potentials)
  • you still normalize but over the sum of all the paths
  • we assume all different paths for a sentence sum up to 1, it might be that the outoging edges of a node do not sum up to 1
  • CRF has been standard NLP model until NN came
146
Q

why normalize at all?

A
  • it makes sure that if you increase the prob of some path you will decrease the prob of others
  • it serves as a regularization because otherwise you could assign infinity scores, there would be no limit, no boundaries
  • > normalization makes you work with a fixed budget, then the problem has a unique maximum that you can find
147
Q

how do we train w for CRF?

A
  • how to estimate/train w, how to find the most llikely weights?
  • no close form formula
  • write conditional log likelihood and maximize it
  • > compute gradient 05-17, but has intractable subtrahend term, then transform it st it has a polynomial term Pr(…) (with number of tags squared), and compute this term Pr(…) using dynamic programming for all values y_j and y_j-1 and over all j’s
148
Q

gradient computation CRF

A

t

149
Q

forward-backward algorithm (CRF)

A
  • > similar to Viterbi but instead of max I take the sum
  • M_i: unnormalized transition weight (not a probability!) to transition from y to y’
    alpha: is the sum over all paths of length i-1 that end with y (forward term)
    beta: same but start with y and go all the way to the end (backward term)
  • how is it helpful? I can compute it (tractable)
150
Q

domain adaptation

A
  • > trained with wall street journal, then apply it to wikipedia
  • Accuracies degrade outside of domain
  • e.g. try to find instances you’re most confident in
  • e.g. use lexicons
151
Q

Sentiment Analysis: Definition

A
  • Sentiment analysis is the detection of attitudes

• Simplest task:
Is the attitude of this text positive or negative?
• More complex:
Rank the attitude of this text from 1 to 5
• Advanced:
Detect the target, source, or complex attitude types

  • a simple classifier:
    Log-linear or Naïve Bayes classifier
152
Q

Negation in Sentiment Analysis

A
  • One practice is to add NOT_ to every word between negation and following punctuation (useful if you want to stick to the bag of words model, but it still fails: “didn’t like” would still get a positive polarity
  • if you have this heursistic of adding NOT_, then instead of “like” you would have a “NOT_like”
  • > but it is still a very crude solution
153
Q

Sentiment Analysis as Sequence Labeling

A
  • This construal of sentiment analysis attempts to capture the meaning of a word in contextby encoding (parts of) the sentence as features
  • Recently: using Recurrent Neural Networks (RNNs)
  • RNNs are FSAs:
    ) No longer finite in the same sense
    ) The state is an embedding of the history in a continuous space
154
Q

Recurrent Neural Networks

A
  • Map from dense sequence to dense representation
  • > idea RNN: save the history as a state

06-17

  • R is parametrized and parameters are shared across steps
    -> parameters of R are learned, it’s the history, everything we’ve learned so far
    ~ representation learning
    -> it obviates the need to create a feature function -> let the system find the features -> it is the NN dogma
  • with RNN, you don’t do any global optimization (you compute a y and move to the next)
  • RNN: no dynamic programming for inference
  • > the RNN computes a classification for each word on its own
  • RNN models are able to take the context (both preceding and following) into account, as well as the linear order between the words (bag of words cannot)
155
Q

sequence labeling

A

t

156
Q

Bi-RNN

A
  • When using RNNs for sentence classification (such as sentiment analysis), it is often practical to use Bi-RNNs
  • Bi-RNN: 2 RNNs, one going back to forth and the other forth to back, Output function is a function of both states
  • This allows the states associated with each word to encode relevant information from the words following them and preceding them
  • difference from bag-of-words: the result of the latter is just based on what a word is, with RNN it is based on what that word is and the state which encodes both the words preceding it and the words following it
  • > it is kind of a modified/enhanced bag-of-words
157
Q

sentiment analysis with Bi-RNN

A
  • bottom line: encode the state/output for each word and somehow aggregate them
  • one simple way to do sentiment analysis (or other sentence classification) with Bi-RNNs is to average the output sequence and train a binary logistic classifier for predicting the sentiment
158
Q

Neural Language Models

A

t

159
Q

Character-level Language Models

A

t

160
Q

what is grammar?

A
  • eg The rules and principles that guide the structure of sentences in language
161
Q

why do we need grammar?

A

Two main motivations:

  • Structural Analysis:decomposing complex units into simpler sub-structures can assist learning and generalization
  • Semantic Information: grammatical structure often reflects semantic structure and distinctions
  • > Grammatical analysis in NLP involves breaking the text into simpler parts and categorizing them
  • > Categorization allows information to propagate from one type to another (important!) -> allow insight that plain sequences (either a word is close or distant) do not (whereas with categorization you have a tree where words can be in all sorts of configurations to one another even if they are distant)
162
Q

a word

A
  • Practical: anything that is between two spaces
  • What about languages that don’t have spaces?
  • Phonological words
  • Grammatical words
163
Q

morphology

A

is the study of words, how they are formed, and their relationship to other words in the same language. It analyzes the structure of words and parts of words, such as stems, root words, prefixes, and suffixes. Morphology also looks at parts of speech, intonation and stress, and the ways context can change a word’s pronunciation and meaning

164
Q

clause (grammar)

A
  • basic syntactic unit
  • difficult to define
  • some kind of description of some activity, state or property
  • more formally: simplest kind of sentence
  • a sentence can have several clauses
  • simple clauses have a predicate, core arguments (usually subjects and objects), and non-core arguments (e.g., location, manner, instrument)
  • > predicate = some main relation
165
Q

Inter-Clause Relations (3x) (grammar)

A
  • basic element of grammar
    Three main inter-relations between clauses within a sentence:
    -> complement clauses:
    Clauses that serve as an argument in another clause
    •“John said he will kick the ball”

-> relative clauses:
Clauses that modify an argument of another clause
•“John, who has studied the language for years, speaks German well”

-> linked clauses:
• After he had studied it for years, John could speak German well

166
Q

Basic Units (grammar)

A
  • basic element of grammar:

1) noun phrase:
- an argument of a clause is generally a noun phrase

Two major classes:

1) Proper nouns or names
2) Common nouns

  • Noun phrases generally have a head (that determines their category, a word the phrase centers around and it determines its type (usually a noun))
    -> a word that determines the semantic type of the whole phrase, and is an essential component of its meaning:
    The man who wasn’t there

2) Syntactic Heads:
- the (main) verb is considered the head of a clause (in English at least) (take it as a convention)
- Arguments and modifiers are considered its dependents:
Johnran home (“ran” is the head of the clause, “John” and “home” are its dependents)

-> Some schemes view all syntactic relations as head-dependent relations

167
Q

Dependency parse

A
  • syntax is represented as head-dependent relations between words (cf tree structure)
  • Nodes are words, edges are relations
  • constituents are subtrees, but the words are all the nodes
  • The arguments and modifiers of a word are its children
  • Edge labels mark the type of relation
  • The verb (=predicate) of the main clause is the root

=> the assumption: every phrase has a head
-> this assumption allows me to build a syntactic tree where edges between words represent bi-lexical dependencies

168
Q

predicate

A
  • the thing asserted (usually the verb)

- the property that you are asserting

169
Q

Phrase-based (constituency) parse

A
  • represent the sentence as a collection of nested phrases
  • > it brackets words into phrases, parent phrases and so on
  • Words are at the leaves, phrases/constituents are subtrees
  • Non-terminal labels represent the phrase type. This is determined by the type of the headword (e.g., Noun Phrases (NP) are phrases headed by a noun)
170
Q

what information can you get from syntactic structure?
why do we want to use hierachary? or syntactic representation? what’s wrong with just representing sentences as sequences?

A
  • linear structure might lead us astray: two words might be far away from each other but might have a direct dependency between them (“war horses which were bought by the police like to drink”)
  • same goes for speech tags or sentiment analysis (e.g. negation scope)
  • -> this syntactic structure helps us achieve all kinds of useful things, do better language modeling, better disambiguation, negation scope
  • > linear structure is just an approximation
  • > in order to really model what’s going on we need some kind of hierarchy
171
Q

information from syntactic structure we can get?

A
  • phrases (e.g. for machine translation)
  • > use a phrase-based parse /constituency tree of the English sentence and then change subtrees in order to get order of words in Japanese (tree transformation) (word permutation only won’t help here)
  • > phrase move around in the tree and not single words
  • > same would be possible with dependency tree (then you’d move subtrees)
  • extract grammatical relations
  • > extract subject, object or all events from a sentence
  • allows to do syntactic disambiguation
  • > some are detectable directly from the syntactic tree
  • a proxy to semantics
  • > by knowing the syntactic tree I already know quite a bit about of information and can answer semantical questions
172
Q

dependency vs constituency tree

A

dependency:

  • constituents are subtrees, but the words are all the nodes
  • Verbs are heads of clauses, nouns are heads of noun phrases

constituency:

  • Words are at the leaves, phrases/constituents are subtrees
  • does not have the notion of a headword, it just puts brackets and has categories but does not tell you the headword
  • > however, you can extract the headword based on the categories in the tree
  • > basically it is possible to convert constituency trees into dependency trees (vice versa is trickier)
173
Q

head

A

the head is the word that determines the type of the phrase

174
Q

examples dependency trees (universal dependency format)

A

cf

08-syntax_von_pptx_1210.pdf, sl. 30,31

175
Q

Graph-Based Dependency Parsing

A
  • task of taking a piece of text and generating a dependency tree for it
  • > what learning algorithm do we use?
176
Q

sources of information for dependency parsing

A
  • Bi-lexical affinities
  • Dependency distance (mostly nearby words)
  • Intervening material (e.g. a verb is unlikely to appear between two dependent word)
  • valency (how many dependents a word has, e.g. give has two objects)
177
Q

dependency parsing evaluation (evaluation measures)

A
  • accuracy = #correct edges / #number of words

- labeled/unlabeled attachment score

178
Q

Graph-based Parsing (context: dependency tree)

A
  • a “structured prediction problem”

- MST Parser: minimum spanning tree (same algorithm as maximum spanning tree)

179
Q

MST parser

A

MST: minimum spanning tree (same algorithm as maximum spanning tree)

1) build complete directed graph (except ROOT)
- > each spanning tree rooted in tree is a possible tree
- > gives the space of possible parses
- > first assign a weight to each edge how likely it is to appear in the MST
- how do I score the edges?
- find model that gives weight to edges (whether it is likely or not) based on information that tells us if this edge is relevant
- we use feature function phi (usually binary function) and a weight vector to assign scores to edges

  • usually POS tagging is part of preprocessing for MST parser (then you can use POS tags in the feature function)
180
Q

MST parser: inference and learning

A

inference: is finding the maximum directed spanning tree
- > score each edge based on its features (Chu-Liu Edmonds algorithm)

learning:

  • use log-linear model: Pr(T) for all possible trees
  • using the gradient of the log-likelihood I can find the optimal teta to maximize the log-likelihood
  • > but computing the gradient is computationally expensive (we have to do for all possible trees for V)
  • > the common way to do optimization/find the most likely teta is not by computing the gradient but by using the averaged perceptron algorithm
181
Q

averaged/structured perceptron algorithm

A
  • sl 08-13
  • > a (supervised) learning algorithm! (for learning teta)
  • > that learned teta is used in Pr(T)
  • > Pr(T) used to find (best/most likely) dependency tree for a sentence
  • instead of taking the expected value, take the maximum value/tree (done by inference: you compute T’ (the MST maximum tree) by using inference, T_i is the correct tree from training data)
  • learning of teta done by the following: you take the sum over all the edges in the correct tree (gold standard) minus the sum of all the edges in the predicted tree
182
Q

higher-order features (graph-based parsing)

A
  • normally features are based on single edge alone
  • higher-order features are types of features that cannot be encoded on edge level alone
  • MST inference works well when the model factorizes over edges

e.g.
sibling relations:
-> features defined over two edges that have the same parent

grandparent features:
-> features defined over two edges, where one’s child is the parent of the other

valency:
-> the number of children a node has. Defined over all edges sharing a parent

=> formally, the score function factorizes over sets of edges

problem: you cannot run the MST algorithm anymore
- > if u want take scoring function that is over edge pairs or triplets you cannot run exact inference anymore (becomes intractable), but approximate methods exist for computing inference

183
Q

transition-based parsers

A
  • In transition-based parsing, the problem is decomposed to finding a sequence of transitions
  • we don’t assume any grammar or context in transition based parsing
184
Q

arc-standard parsers

A
  • you have a buffer and a stack and a transition set, you build edges and nodes
  • transition set T:
    SHIFT, LEFT-ARC_l, RIGHT-ARC_l
  • can only parse projective trees
  • > straightforward way to define a transition system
  • bottom-up approach
185
Q

arc-eager transition system

A
  • you have a buffer and a stack and a transition set
  • more incremental
    ~ it is more eager to create arcs
  • transition set T:
    SHIFT, LEFT-ARC_l, RIGHT-ARC_l, REDUCE
  • complexity at most 2n
  • Builds left-dependents bottom-up and right-dependents top-down
  • arc-standard has to find whole subtree in order to attach it arc-eager does not
  • we don’t know yet how to choose the next transition
  • arc-eager does top-down and bottom-up parsing
186
Q

projectivity

A
  • intuititve: a tree in which there re no crossing arcs
  • > ie all sub-trees are sub-strings
  • projective trees much easier to handle (crossing edges make it much more diffficult)
187
Q

transition systems

A
  • Transitions operate on the parser configuration (or state):
    c = (E,B,A)

A transition system is defined as:
S = (C, T , c_s , C_t )

  • transition sequence
  • > we transition from one configuration to next by transition
  • transition: a function from configuration to configuration
188
Q

greedy-transition based parsing

A
  • want to find the most probable sequence of transitions
  • in greedy parsing, instead of modeling this joint probability (the whole sequence i…m which maximizes this product sl 09-21), we make the assumption that its enough to choose the best transition each time given what we have so far (you make that decision and can’t change it - that’s why it’s greedy)
  • a score s(t, c) estimates this probability where
    t= a candidate transition
    c= current configuration
189
Q

inference in greedy-transition based parsing?

A

argmax_t_1…t_m = PI_i^m = P(t_i | c_i-1)

190
Q

oracle for arc-standard/arc-eager

A
  • a rule-based algorithm, does not need training

- it gets as input a correct tree

191
Q

feedforward vs recurrent neural networks

A

t

192
Q

transition classifiers (kinds of score functions)

A
  • linear transition classifier
  • NN transition classifier
  • Bi-RNN encoder
193
Q

error propagation in transition-based parsing

A
  • once wrong transition is picked, then recovery very difficult since parsing is greedy
194
Q

beam search algorithm

A
  • compromise between greedy and exhaustive search

- k-beam

195
Q

why is parsing hard?

A
  • An exponential number of options per sentence
  • Ambiguity
  • Many many rules to learn (exceptions)
  • Language speakers have no explicit knowledge of the rules of grammar
196
Q

Constituency Parsing and Context-free Grammars (CFGs)

A
  • you get annotated trees (eg from treebank) and you derive the parsing ruls from them
197
Q

Probabilistic CFG (PCFG)

A
  • not a deterministic parse
  • think of it as a multinomial (each symbol has 1-n possible derivations)
  • each tree has a probability
  • independence assumption: The problem is that the independence assumptions are very strong: Each production is independent of any other production
    e. g.:
  • All clauses have the same distribution
198
Q

A sequence of rules is the same as a parse tree

A
  • the two representations are equivalent
199
Q

what’s the idea in context free grammar?

A
  • rules are context-free: they ignore context

- each rule is independent of the others

200
Q

Grammar-based vs. Discriminative Approaches to Parsing

A

t

201
Q

model - learning - inference

A

t

202
Q

constituency parsing problem

A

you want to find the tree with the highest score

203
Q

Chomsky Normal Form (CNF)

A
  • A common binarizationis the Chomsky normal form, which is a grammar where each production is either of the form A -> B C or A -> w, where A, B and C are pre-terminals and w is a terminal
  • > often useful to treat all CFG rules as binary
  • > every CFG grammar can be converted to a CNF, which includes the same sentences
204
Q

Cocke-Kasami-Younger (CKY) Constituency Parsing (algorithm)

A
  • input assumed to be in CNF
  • > a recognizer that will only determine if a sentence is in the language
  • > It is simple to extend it into a parser that also constructs a parse tree
205
Q

Head Lexicalization

A
  • to overcome attachment ambiguity
  • trees are only different in terms of different rules applied
    -> nothing relates the words its only unlexicalized rules
    => a deficiency of the vanilla PCFG
    -> head lexicalization is a standard way to overcome this
    -> assume each constituent has a headword
    -> add headwords to each phrasal node
    -> now you have more info at the rule node because headword is most informative word of subtree/that constituent (*) -> I’ll need to retrain my PCFG but it will cause huge sparsity problems: not just only so many eg 15 non-terminals, no I have 15 times the size of the vocabulary
    -> has horrible sparsity
    => what to do now? use eg the Collins parser

bottom line: Lexicalization can resolve ambiguities and lead to huge boost in performance
- the cost is huge sparsity

(*) The head word of a phrase gives a good representation of the phrase’s structure and meaning

206
Q

headword

A

the most informative word/element in the subtree

207
Q

the Collins parser

A
  • keep the head lexicalization
  • to get over sparsity (introduced by head lexicalization), let’s make more (conditional) independence assumptions
  • > each rule is independent of the other
  • > each rule is not just a multinomial, it is a product of several terms
  • > no dependency between different siblings

=> This model still has many parameters, but it reduces each of the terms in the product to depend on 2-3 lexicalized categories, instead of n+m+2 for a regular
lexicalized PCFG of this model (cf. 10-36)

208
Q

learning and inference in constituency parsing

A

tbd

209
Q

Unsupervised Learning

A
  • paradigm of supervised learning: you have label data (you know the right answer)
  • clustering: main techniques for generalization without having label set
  • want to cluster words according to their part of speech without having annotated data
  • discriminative models would not make sense: we don’t have labels (relation between labels and observed features not possible to work with)
  • generative model can be useful
210
Q

POS Tagging: a structured prediction task

A

t

211
Q

unsupervised POS induction

A

Given a large corpus of raw text, partition the words into categories with respect to their grammatical function

POS tags are context-dependent -> makes the prediction much more difficult (e.g. the word ‘back’)

-> words will receive one POS tag only but if the most likely is taken the main tag will account for more than 90% of the probability mass (type-level clustering)

212
Q

Supervised vs. (Completely) Unsupervised Learning

A

t

213
Q

Distributional Clustering

A

-> words will receive one POS tag only but if the most likely is taken the main tag will account for more than 90% of the probability mass (type-level clustering)

214
Q

Unsupervised learning for Hidden Markov Models

A
  • adjectives have similar distribution (approx. definition of a POS tag)
  • nouns also

-> POS Induction through PCA over word co-occurrence Matrices

215
Q

type vs token (contexts POS)

A

tokens are instances of type

216
Q

The Prototype Tagger

A
  • another type-level POS induction algorithm

-

217
Q

structured prediction

A

we label the entire sequence

218
Q

1-to-1 accuracy measurement

A

each class is assigned exactly one cluster

219
Q

m-to-1 accuracy measurement

A

a class may be assigned to multiple clusters

220
Q

Markov Chain

A
  • simplest Markov model

- It models the state of a system with a random variable that changes through time

221
Q

conditional likelihood

A

In discriminative models we often talk about the conditional likelihood