Chapter 3- Linear Models Flashcards

1
Q

what is a decision stump

A

single feature

threshold required to switch decision from 0 to 1 is parameter t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is the decision boundary in a decision stump

A

the point at which the decision switches,

the threshold, t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is the learning algorithm for a decision stump

A

for t varied between min(x) and max(x):
count errors
if errors is less than minErr, set as minErr and t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what does linearly separable mean?

A

we can fit a linear model (i.e. draw a linear decision boundary) and perfectly separate the classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is the limitation of a decision stump?

A

it works only on a single feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is the discriminant function f(x)=?

A

(sum for all features: wjxj) - t
or in matrix notation:
wTx - t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what does the discriminant function describe, geometrically?

A

the equation of a plane

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is the gradient and y intercept of the decision boundary from the discriminant function in two dimensions?

A

set equal to zero
m = -(w1/w2)
c = t/w2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is the perceptron decision rule?

A

if f(x) > 0 then yhat=1 else 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is the perceptron parameter update rule, with sigmoid error?

A

wj = wj - (lrate)(yhat - y)(xj)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the perceptron learning algorithm?

A

for each training sample:
update weight: wj = wj - (lrate)(yhat - y)(xj)
t = t + lrate(yhat-y)
until changes to parameters are zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is learning rate?

A

the step size of the update

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is the limitation of the perceptron algorithm?

A

can only solve linearly separable problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

if …. the perceptron algorithm is guaranteed to solve the problem

A

the data is linearly separable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is the perceptron convergence theorem?

A

If a dataset is linearly separable, the perceptron learning algorithm will converge to a perfect classification within a finite number of training steps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

a logistic regression model has the output f(x) = ?

A

1 / 1+e^-z

where z is wT - t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is the name of the function that logistic regression uses?

A

sigmoid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what is the decision rule for logistic regression?

A

if f(x) >0.5 then 1 else 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

what is loss?

A

the cost incurred by a model for a prediction it makes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what loss function does logistic regression use?

A

log loss, or cross-entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

what is the equation for log loss (cross entropy), L(f(x),y) = ?

A

L(f(x),y) = -{ylogf(x) + (1-y)log(1-f(x))}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what is an error function?

A

when the loss function is summed or averaged over all data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what is the error function (summed log loss) for logistic regression E=?

A
  • (sum for each i) {yilog(f(xi)) + (1-yi)log(1-f(xi))}
24
Q

what are the names the error function for logistic regression is known by?

A

cross entropy error

negative log likelihood

25
Q

what is the rule of gradient descent, in words?

A

in order to decrease error, we should update parameters in the direction of the negative gradient

26
Q

what is the partial derivative of the cross entropy error function with the logistic regression model, with respect to parameter wj, dE / dwj = ?

A

dE/df(x) x df(x)/dz x dz/dwj

= sum of i: (f(xi) - yi)xij

27
Q

what is the algorithm for gradient descent?

A

repeat:
for each parameter j do
wj = wj - lrate x dE/dwj
until termination criteria met

28
Q

what is stochastic gradient descent?

A

compute the gradient for each example one by one and modify the parameters for each

29
Q

why is stochastic gradient descent often applied?

A

it works well more effectively in very large datasets

30
Q

what is the algorithm for logistic regression?

A

t = random
w = random vector
set max epochs
lrate = 0.1

for each epoch:
    for each training example x:
        for each parameter j
            wj = wj - lrate(f(x)-y)(xj)
    t = t + lrate(f(x)-y)
31
Q

what loss function does the perceptron use?

A

hinge loss

32
Q

give the equation for hinge loss

A

sum: -y(wx + b)
= sum: -y(yhat)
sum all the negative values for ONLY the misclassified samples

33
Q

stochastic gradient descent is also known as

A

mini batch

34
Q

gradient based optimisation is possible when the loss function is

A

differentiable

35
Q

what are the 4 steps of gradient based minimisation?

A
  1. test for convergence
  2. compute search direction
  3. compute step length
  4. update the variables
36
Q

when we perform minibatch sgd, what do we times sum:dL/dW by to scale it

A

n / |S|

n samples / batch size

37
Q

modern machine learning has given rise to what kind of programming

A

differentiable programming

38
Q

what is differentiable programming

A

If the performance of a computer program can be represented by a loss function, we could seek to optimise that program via its parameters using a gradient based approach

39
Q

the perceptron algorithm is a … … classification algorithm

A

deterministic

binary

40
Q

what is a generative process

A

describes the way in which data is generated

41
Q

what is the perceptron weight update, with hinge loss?

A

wj = wj - (lrate)( - yhat x y)(xj)
or if just for the misclassified
wj = wj - (lrate)( -y)(xj)
= wj + (lrate)(y)(xj)

42
Q

we make the iid assumption for logistic regression, this is that

A

our data are independent and identically distributed (iid).

43
Q

the iid assumption means that the outputs …

A

The outputs do not depend on multiple inputs nor on other outputs.

44
Q

the iid assumption we make for logistic regression means

A

we can perform maximum likelihood estimation

i.e. we can work out the best parameters from the data by maximising
p( W | D) = multiply:p(y | x, w)

45
Q

what is the loss function (negative log-likelihood) for SGD for logistic regression

A
  • 1/n sumi->n:[yi log f(xi) + (1-yi) log (1-f(xi))]

same but with 1/n to rescale based on sample size

46
Q

we can use logistic regression to work out p(y=1 | x, w) =

A

f(x) = 1 / (1 + e^-z)

47
Q

the decision boundary for logistic regression is given by

A

d = 1 / (1+e^-z)

wx + b = log(d / 1-d)

48
Q

what are the 3 data properties that will cause practical challenges for a logistic regression model

A

imbalanced data - anything using MLE will try to fit the dominant class

multicollinearity - two or more predictor variables are highly linearly related.

completely separated training data

49
Q

what step can we take to minimise the impact of multicollinearity in logistic regression

A

feature selection

50
Q

benefits of logistic regression (5)

A
  • Efficient and straightforward,
    • Doesn’t require large computation,
    • Easy to implement, easily interpretable
    • Used widely by data analyst and scientist.
    • Provides a probability for predictions and observations.
51
Q

limitations of logistic regression (2 general, 3 data properties)

A
  • Linear decision boundaries
  • Inability to handle complex inputs (e.g. an image)
  • Multicollinearity (correlated inputs)
  • Sparseness (lots of zero or identical inputs)
  • Complete separation (it is not a probabilistic problem!)
52
Q

limitations of perceptron (4)

A

Challenges with high dimensional multiple correlated input features

linear

Convergence can be tricky depending on the variant of perceptron used

deterministic

53
Q

which algorithm: perceptron or logistic regression, doesnt converge

A

logistic regression

54
Q

why does logistic regression never converge

A

we can never reach the true decision boundary. We are trying to fit an s form to a straight boundary. Eventually we get w1=inf. This is the closest we will get.

55
Q

what property of logarithm means we can take the log of the likelihood

A

logarithm is a monotonically increasing function.

It doesnt affect where out max/min is

56
Q

what is a monotonically increasing function, what does it mean?

A

if the value on the x-axis increases, the value on the y-axis also increases