02 - Linear Classifiers Flashcards

1
Q

What is one-hot encoding?

A
  • create a k-dim vector per disired output y and put 0s and 1s in it
  • its a way of having categorical variables in models that require numerical input
  • it is better than label encoding where each label will be assigned a numerical value, since this can add to bias by giving some labels higher numbers than others
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is gradient descent optimization?

A

Optimization → minimizing og maximizing some objective function f(x) (also called criterion) by altering x. The value that minimizes a function is often denoted with a superscript: $x^*=\text{arg min}f(x)$

In this course we often want to minimize the cost-, loss- or error function.

Remember the derivative f’(x) gives the slope of f(x) at point x

“it specifies how to scale a small change in the input to obtain the corresponding change in the output: $f(x+\epsilon) \approx f(x) + \epsilon f’(x)$. ”

So the derivative tells us how to change x to improve y (slightly). We know that $f(x -\epsilon \text{ sign } f’(x))$ is less than f(x) for a small enough $\epsilon$. We can therefore reduce f(x) by moving x in small steps with the opposite sign of the derivative. → Gradient Descent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are critical points, the partial derivative and the directional derivative

A

Critical Points (or stationary points)
- local/global min & max, saddle points
- At critical points the gradient is zero.

Partial Derivative:
- For functions with multiple inputs, we must use partial derivatives d/dx_i f(x), which measures how f changes as only one variable X_i changes at point x.
- The gradient ∇_xf(x) becomes a vector of all partial gradients, with element i being the partial derivative of f with respect to x_i.

Directional Derivative:
- The slope of the function f in direction u (unit vector)
- With the directional derivative we can find the direction in which f decreases the fastest.
- we want to decrease f by moving in the direction of the negative gradient, which is where the directional derivative is minimized (?)

New points are proposed by x’=x-\epsilon ∇_xf(x) where epsilon is the learning rate.
There are different approaches of chosing the learning rate, one is just choosing a small number, another one is solving for the learning rate that makes the directional derivative vanish.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Maximum Likelihood Estimation?

A

Main goal: find the optimal way to fit a distribution to the data.
We choose different models p_model(x;θ) (different thetas) to fit a dataset p_data(x) and calculate the likelihood of observing the datapoints in given class, multiply all those together to a total likelihood. The model that has the maximum likelihood “wins”

The MLE for θ is defined as theta_ML=arg max _theta prod_i=1^m p_model(x^i ; theta)

Product over many probabilities can be inconvinient, fx prone to numerical underflow.
Solution: logarithm of the likelihood results in an equivalent problem: theta_ML=arg max _theta sum_i=1^m log p_model(x^i ; theta)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Linear Least Squares

A
  1. we choose a line/model to predict the scalar response y from an input vector x
    **y_hat = f(x) = w_0+w_1x_1+w_2x_2=w^Tx **
    We learn how to predict by finding the best values for w
  2. For each possible predictor, we calculate a loss. The predictor with the minimal loss is chosen.
    The loss L is calculated as the sum of least squares of the residuals (the residuals are calculated by substracting the estimated y values from the actual y values)

**L(y,y_hat)=1/2 sum_i=1-m (y_i-y_hat_i)^2
** → now differentiate with respect to w and solve by setting to zero, and the result will be the vector:
w=X^Ty

In classification, the output is not a continuous value though, but a dicrete label, which we use one-hot encoding for.
Now each class is approximated by its own regression model:
p(C_k|x) ~ f_k(x)=w_k^Tx
The loss and predictor values are found the same way, but now w is a matrix, with one model pr column (remember each class has a model)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Draw and explain a perceptron.

A

On top of the linear transform (linear least squares) we now apply a non-linear activation function
- f(x) = g(w^Tx), w= [w_0,w_1,w_2…]^T, x=[1,x_1,x_2…]
- Non linear because out output is not continuous anymore but discrete

The perceptron defines a step function: 1 if a >= 0, otherwise -1 (a is w^Tx)
**y_hat=sign(w^T x)
**In context with this, the loss is called the perceptron citerion:
**L(y_y_hat)=sign sum_i=1-m w^T x_i y_i
**we differentiate wrt the weights to see how the loss changes when the weights are changed and get the gradient of the loss:
d L = -x_i y_i

The loss function gives a scalar, while the gradient of the loss is a vector of partial losses.
We use it to update the weights. This learning is Stochast Gradient Descent
- Stochastic: select the training examples one by one in random order
- Gradient descent: use the negative of the gradient to update weights

w <- w-d L
w <- w+x_i y_i

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Linear Regression

A
  • Predicting the value of a dependend variable based on the value of one or more independent variables.
  • This is done by fitting a linear equation to observed data. The goal is to find the line that minimizes the difference between the predicted and actual values of the dependent variable
  • Typical methods to discover the best-fit line for a set of paired data:
    • Least Squares: we find the line that minimizes the sum of the squares of the residuals
    • Maximum Likelihood Estimation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Logistic regression

A
  • similar to linear regression, but it predicts a categorical output (true or false; red, green or blue) instead of something continuous like size).
  • Instead of fitting a line to the data, a s-shaped logistic function is fitted (logistic sigmoid, smooth step function) and it goes from 0 to 1
  • It tells us the probability for the data belongs to class 0 or class 1, depending on the input value
  • like linear regression, logistic regression can have multiple input variables and they can both be continuous (eg. size) and descrete (eg. gender)
  • we use maximum likelihood instead of least squares to fit a model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Logistic regression predicition model

A

**predition: y_hat = sigma(w^t x)
**- Probability for Class 1 given x:
p(C_1 |x)=sigma(w^T x) sigma(a)=1/1+e^-a

  • Probability for Class 2 give x:
    p(C_2 |x)=1-p(C_1 |x)
    This is the definition of a Bernoulli trial (binomial trial)

With ML the likelihood function for all outputs then becomes (putting the bernoulli trial in as model into MLE equation):
L(w)= prod sigma(w^t x) ^{y_1} (1-sigma(w^t x))^{1-y_1}

As mentioned earlier, it is easier to use logarithm in MLE, so we get the Log-Likelihood:

**log L(w)= sum [y_i log sigma(a_i)+(1-y_1)log(1-sigma (a_i) )] where a=w^Tx
**
We now want the gradient, bc like when looking at least squares, we need that to update weights. The original is very long, but it ends up like this:
nabla log L(w)= sum(y_i-sigma(w^Tx_i))x_i
So now that we have the log-likelihood and its gradient, we can do logistic regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Go from Binary to multiclass logistic regression

A
  1. k classes instead of 2
  2. k weight vectors
  3. for each class we model the density by the softmax function

So now because we have multiple outputs as well, we of course have to multiply over all inputs and all outputs when calculating out likelihood and sum over all inputs and outputs when using log:
log L(w_1, …, w_K) = sum_i=1-m sum_k=1-K y_i,k log hat y_i,k
→ we again use one-hot encoding for the targets y

and get the simple gradient by differentiating with respect to w

nabla log L(w_1, …, w_K) = sum_i=1-m(y_i,k-hat_y_i,k)x_i

Simplified, we just have matrix operations:
So in the end we have a bunch of matrix operations:
Y_hat=softmax(W^TX) and nabla_w L=X(Y-Y_hat)^T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the softmax function?

A

softmax function p(C_k|x)=e^a_k/ sum^K_{i=1}e^a_i}, where a_k=w_k^Tx

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

MLE vs LLS

A

Maximum Likelihood Estimation (MLE) and Linear Least Squares are both methods used to estimate parameters in statistical models.

MLE is a general method for estimating the parameters of a statistical model. It aims to find the parameter values that maximize the likelihood function, which represents the probability of the observed data given the model.

Linear Least Squares is a specific method used to estimate the parameters in linear regression models. It minimizes the sum of squared differences between the observed and predicted values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

L1 Loss and L2 Loss

A

There are many different types of loss functions that can be used in DNNs, depending on the specific task and the type of data.

The Mean Absolute Error (MAE): also called L1 Loss, computes the average of the sum of absolute differences between actual values and predicted values.

Mean Squared Error (MSE): also called L2 Loss. This is a commonly used loss function for regression tasks. It measures the average squared difference between the predicted and true output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly