Chapter 3- Linear Models Flashcards
what is a decision stump
single feature
threshold required to switch decision from 0 to 1 is parameter t
what is the decision boundary in a decision stump
the point at which the decision switches,
the threshold, t
what is the learning algorithm for a decision stump
for t varied between min(x) and max(x):
count errors
if errors is less than minErr, set as minErr and t
what does linearly separable mean?
we can fit a linear model (i.e. draw a linear decision boundary) and perfectly separate the classes
what is the limitation of a decision stump?
it works only on a single feature
what is the discriminant function f(x)=?
(sum for all features: wjxj) - t
or in matrix notation:
wTx - t
what does the discriminant function describe, geometrically?
the equation of a plane
what is the gradient and y intercept of the decision boundary from the discriminant function in two dimensions?
set equal to zero
m = -(w1/w2)
c = t/w2
what is the perceptron decision rule?
if f(x) > 0 then yhat=1 else 0
what is the perceptron parameter update rule, with sigmoid error?
wj = wj - (lrate)(yhat - y)(xj)
what is the perceptron learning algorithm?
for each training sample:
update weight: wj = wj - (lrate)(yhat - y)(xj)
t = t + lrate(yhat-y)
until changes to parameters are zero
what is learning rate?
the step size of the update
what is the limitation of the perceptron algorithm?
can only solve linearly separable problems
if …. the perceptron algorithm is guaranteed to solve the problem
the data is linearly separable
what is the perceptron convergence theorem?
If a dataset is linearly separable, the perceptron learning algorithm will converge to a perfect classification within a finite number of training steps
a logistic regression model has the output f(x) = ?
1 / 1+e^-z
where z is wT - t
what is the name of the function that logistic regression uses?
sigmoid
what is the decision rule for logistic regression?
if f(x) >0.5 then 1 else 0
what is loss?
the cost incurred by a model for a prediction it makes
what loss function does logistic regression use?
log loss, or cross-entropy
what is the equation for log loss (cross entropy), L(f(x),y) = ?
L(f(x),y) = -{ylogf(x) + (1-y)log(1-f(x))}
what is an error function?
when the loss function is summed or averaged over all data points
what is the error function (summed log loss) for logistic regression E=?
- (sum for each i) {yilog(f(xi)) + (1-yi)log(1-f(xi))}
what are the names the error function for logistic regression is known by?
cross entropy error
negative log likelihood
what is the rule of gradient descent, in words?
in order to decrease error, we should update parameters in the direction of the negative gradient
what is the partial derivative of the cross entropy error function with the logistic regression model, with respect to parameter wj, dE / dwj = ?
dE/df(x) x df(x)/dz x dz/dwj
= sum of i: (f(xi) - yi)xij
what is the algorithm for gradient descent?
repeat:
for each parameter j do
wj = wj - lrate x dE/dwj
until termination criteria met
what is stochastic gradient descent?
compute the gradient for each example one by one and modify the parameters for each
why is stochastic gradient descent often applied?
it works well more effectively in very large datasets
what is the algorithm for logistic regression?
t = random
w = random vector
set max epochs
lrate = 0.1
for each epoch: for each training example x: for each parameter j wj = wj - lrate(f(x)-y)(xj) t = t + lrate(f(x)-y)
what loss function does the perceptron use?
hinge loss
give the equation for hinge loss
sum: -y(wx + b)
= sum: -y(yhat)
sum all the negative values for ONLY the misclassified samples
stochastic gradient descent is also known as
mini batch
gradient based optimisation is possible when the loss function is
differentiable
what are the 4 steps of gradient based minimisation?
- test for convergence
- compute search direction
- compute step length
- update the variables
when we perform minibatch sgd, what do we times sum:dL/dW by to scale it
n / |S|
n samples / batch size
modern machine learning has given rise to what kind of programming
differentiable programming
what is differentiable programming
If the performance of a computer program can be represented by a loss function, we could seek to optimise that program via its parameters using a gradient based approach
the perceptron algorithm is a … … classification algorithm
deterministic
binary
what is a generative process
describes the way in which data is generated
what is the perceptron weight update, with hinge loss?
wj = wj - (lrate)( - yhat x y)(xj)
or if just for the misclassified
wj = wj - (lrate)( -y)(xj)
= wj + (lrate)(y)(xj)
we make the iid assumption for logistic regression, this is that
our data are independent and identically distributed (iid).
the iid assumption means that the outputs …
The outputs do not depend on multiple inputs nor on other outputs.
the iid assumption we make for logistic regression means
we can perform maximum likelihood estimation
i.e. we can work out the best parameters from the data by maximising
p( W | D) = multiply:p(y | x, w)
what is the loss function (negative log-likelihood) for SGD for logistic regression
- 1/n sumi->n:[yi log f(xi) + (1-yi) log (1-f(xi))]
same but with 1/n to rescale based on sample size
we can use logistic regression to work out p(y=1 | x, w) =
f(x) = 1 / (1 + e^-z)
the decision boundary for logistic regression is given by
d = 1 / (1+e^-z)
wx + b = log(d / 1-d)
what are the 3 data properties that will cause practical challenges for a logistic regression model
imbalanced data - anything using MLE will try to fit the dominant class
multicollinearity - two or more predictor variables are highly linearly related.
completely separated training data
what step can we take to minimise the impact of multicollinearity in logistic regression
feature selection
benefits of logistic regression (5)
- Efficient and straightforward,
- Doesn’t require large computation,
- Easy to implement, easily interpretable
- Used widely by data analyst and scientist.
- Provides a probability for predictions and observations.
limitations of logistic regression (2 general, 3 data properties)
- Linear decision boundaries
- Inability to handle complex inputs (e.g. an image)
- Multicollinearity (correlated inputs)
- Sparseness (lots of zero or identical inputs)
- Complete separation (it is not a probabilistic problem!)
limitations of perceptron (4)
Challenges with high dimensional multiple correlated input features
linear
Convergence can be tricky depending on the variant of perceptron used
deterministic
which algorithm: perceptron or logistic regression, doesnt converge
logistic regression
why does logistic regression never converge
we can never reach the true decision boundary. We are trying to fit an s form to a straight boundary. Eventually we get w1=inf. This is the closest we will get.
what property of logarithm means we can take the log of the likelihood
logarithm is a monotonically increasing function.
It doesnt affect where out max/min is
what is a monotonically increasing function, what does it mean?
if the value on the x-axis increases, the value on the y-axis also increases