Equations Flashcards
conditional probability, p(a|b) =
p(a,b) / p(b)
bayes, p(a|b) =
p(b|a)p(a) / p(b)
independent events, p(a,b) =
p(a)p(b)
total probability/marginalisation, p(X=x) =
sumy: p(x|y)p(y)
conditional independence assumption, p(x|y) =
multiply: p(x|y)
discriminant function, f(x) =
sum: wx - t
perceptron update rule, sigmoid error wj =
wj - (lrate)(f(x)-y)(x)
sigmoid/logistic regression, f(x) =
1 / (1+e^-z)
log loss/cross entropy loss, L(f(x),y) =
-{ylogf(x) + (1-y)log(1-f(x))}
summed log loss/ cross entropy error/ negative log likelihood, E =
- sum i: {ylogf(x) + (1-y)log(1-f(x))}
partial derivative of cross entropy error, dE/dw =
sum: (f(x) - y))(x)
partial derivative of sigmoid, dy/dz =
y(1-y)
partial derivative of cross entropy error, dE/df(x)
-[y(1/f(x)) - (1-y)(1/(1-f(x)))]
specificity =
TN / (FP+TN)
precision = positive predictive value =
TP / (TP + FP)
recall = sensitivity = tp rate =
TP / P
fp rate =
FP / N
f1 measure =
2 / (1/precision) + (1/recall)
pearsons correlation coefficient =
sum:(x-xhat)(y-yhat) / sqrt(sum:(x-xhat)^2)sum:(y-yhat)^2))
information gain/ mutual information =
I(X;Y) = H(Y) - H(Y|X)
euclidean distance =
sqrt(sum:(x1-x2)^2)
hamming distance =
sum: delta(xi not equal xj)
neuron, y(x,w) =
f(wx + b)
softmax =
e^z / sumk: e^z
gradient descent, wnew =
wold - (lrate)(dL/dw)
mean squared error loss, MSE =
1/n sum: (y-t)^2
neuron gradient, with sigmoid loss, and squared loss, dL/dw =
dL/dy dy/dz dz/dw = (y-yhat)(yhat)(1-yhat)(x)
entropy, H(X) =
sum: p(x)logp(x)
sum: p(x)logp(x)
entropy
L = 0.5(y-t)^2
squared error loss
e^z / sumk: e^z
softmax
information gain =
I(X;Y) = H(Y) - H(Y|X)
mutual information =
I(X;Y) = H(Y) - H(Y|X)
recall =
TP / P
sensitivity =
TP / P
tp rate =
TP / P
precision =
TP / (TP + FP)
positive predictive value =
TP / (TP + FP)
- sum i: {ylogf(x) + (1-y)log(1-f(x))}
summed log loss/ cross entropy error/ negative log likelihood
summed log loss =
- sum i: {ylogf(x) + (1-y)log(1-f(x))}
cross entropy error =
- sum i: {ylogf(x) + (1-y)log(1-f(x))}
negative log likelihood =
- sum i: {ylogf(x) + (1-y)log(1-f(x))}
log loss, L(f(x),y) =
-{ylogf(x) + (1-y)log(1-f(x))}
cross entropy loss, L(f(x),y) =
-{ylogf(x) + (1-y)log(1-f(x))}
sigmoid, f(x) =
1 / (1+e^-z)
logistic regression, f(x) =
1 / (1+e^-z)
bias update for logistic regression, t =
t + lrate(f(x) - y)
bias update for perceptron, t =
t = t +lrate(yhat - y)
what is P(A or B) if
a) they are disjoint
b) they are joint
a) P(A) + P(B)
b) P(A) + P(B) - P(A and B)
give the bernoulli distribution
P(X = 0) = 1 - p p(X = 1) = p
give the binomial distribution
P(X = k) = (nCk)(p^k)(1-p)^k
give the geometric distribution
P(X=x) = (1-p)^x-1 (p)
give the poisson distribution
P(X=x) = { lambda^x e(-lambda) } / x!
if a discrete r.v. X has a pmf f(X) what is the expected value E[g(x)]
sum i: g(Xi)f(Xi)
if a discrete r.v. X has a pmf f(X) what is the variance V[g(x)]
E[(g(X) - E(g(X)))^2]
E[g(X)^2] - E[g(X)]^2
properties of Expectations
E[aX + b] =
aE[X] + b
properties of variance:
V[aX+b] =
a^2V[X]
give the equation for hinge loss
sum: -y(wx + b)
= sum: -y(yhat)
sum all the negative values for ONLY the misclassified samples
when we perform minibatch sgd, what do we times sum:dL/dW by to scale it
n / |S|
n samples / batch size
what is the perceptron weight update, with hinge loss?
wj = wj - (lrate)( - yhat x y)(xj)
or if just for the misclassified
wj = wj - (lrate)( -y)(xj)
= wj + (lrate)(y)(xj)
what is the loss function (negative log-likelihood) for SGD for logistic regression
- 1/n sumi->n:[yi log f(xi) + (1-yi) log (1-f(xi))]
same but with 1/n to rescale based on sample size
the decision boundary for logistic regression is given by
d = 1 / (1+e^-z)
wx + b = log(d / 1-d)
give the equation for zero mean, unit variance normalisation
(x - x_mean) / sigma
give the equation for restrict range normalisation
- (x - x_min) / (x_max - x_min)
give the equation for fisher score, F=
v1 + v2
give a kernel for horizontal lines
1 1 1
0 0 0
-1 -1 -1
give a kernel for vertical lines
1 0 -1
1 0 -1
1 0 -1
give the distribution update scheme for adaboost, i.e. what do we multiply Dj(i) by
1 / 2ej if the classification was incorrect
1 / 2(1-ej) if the classification was correct
if we know that A is conditionally independent of B given C, then P(A|B,C) = ?
P(A|C)
if A is conditionally independent of B given C, then P(A|B,C) = P(A|C), prove it
P(A,B|C) = P(A|C)P(B|C), conditional independence
P(A,B,C) / P(C) = P(A,C)/P(C) P(B|C)/P(C), conditional probability
P(A,B,C) = P(A,C)P(B,C) / P(C), times by P(C)
P(A,B,C)/P(B,C) = P(A,C)/P(C), divide by P(B,C)
PA|B,C) = P(A|C)
if A and B are conditionally independent given C then we know?
P(A,B|C) = P(A|C)P(B|C)
d e^x / dx = ?
xâ e^x
d ln x/ dx = ?
1 / x
product rule, d(uv) / dx
u dv/dx + v du/dx
d log f(x) / dx =
fâ(x) / f(x)