Lecture 8 - (Stochastic) Gradient Descent, Regularization, Artificial Neural Networks, Perceptron Flashcards

1
Q

What is Gradient Descent?

A

Gradient Descent is an optimization algorithm. It is capable of finding optimal solutions to a wide range of problems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the complexity of Gradient Descent?

A

O(ndt) to run for t iterations (n = number of elements, d = number of features)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the process of Gradient Descent step-by-step.

A
  1. Take the derivative of the Loss Function for each parameter in it
  2. Pick random values for the parameters
  3. Plug the parameter values into the derivatives
  4. Calculate the Step Size = Slope x Learning Rate
  5. Calculate New Parameters = Old Parameters - Step Size
  6. Then, go back to step 3 and repeat until step size is very small, or you reach the maximum number of steps
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does Gradient Descent Calculate step size?

A

Slope x Learning Rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does Gradient Descent know where to stop descending on the curve to find the optimal value?

A

When the step size is very close to 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

True or False. The step size is proportional to the Loss Function slope (in Gradient Descent)

A

True. That is how Gradient Descent changes step size constantly (decreases gradually).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What kind of parameter is the learning rate in Gradient Descent?

A

Hyperparameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the disadvantages of Gradient Descent?

A

The key practical problems are:

  1. converging to a local minimum can be quite slow
  2. if there are multiple local minima, then there is no guarantee that the procedure will find the global minimum (Notice: The gradient descent algorithm can work with other error definitions and will not have a global minimum. If we use the sum of squares error, this is not a problem.)

There is a change it is not going to reach the global minimum, either due to a plateau or due to a local minimum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the difference between Gradient Descent and (Stochastic) Gradient Descent

A

As compared to Regular Gradient Descent, Stochastic Gradient Descent would randomly pick one sample for each step, and just use that one sample to calculate the derivatives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

True or False. In practice, as in theory, Stochastic Gradient Descent only works with ONE sample taken for each step.

A

False.
In practice, it is common to select a small subset of data (a mini-batch) for each step → takes the best of both words between using one sample, and the whole data → it is faster than using all of the data, and yields more stable parameters than using only one sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Is Stochastic Gradient Descent going to yield better outcomes than the Gradient Descent?

A

Probably not, because taking just samples of data is not the best approach. But it is good enough when the other option is very time consuming and very heavy computationally

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the purpose of regularization?

A

To add a penalty to the complexity of a model in order to avoid overfitiing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the two ways (we studied) to perform regularization?

A

L2 - Ridge Regression

L1 - Lasso Regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The Ridge Regression line (the equation it minimizes) is equal to …

A

the sum of the squared residuals (if linear regression) + lambda x slope ^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Essentially, what does Ridge Regression do?

A

Ridge Regression can improve predictions from new data by making the prediction less sensitive to the training data. Especially when sample sizes are relatively small

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is lambda in Ridge Regression?

A

Lambda essentially says how harsh the Ridge Regression penalty should be (basically it controls the “strength” of the regularization

It can take values between 0 and infinity.
- when LAMBDA = 0 the Ridge Regression Penalty is also 0 → the RRL will only minimize the sum of squared residuals and the RRL will be the same Least Square Line

  • when LAMBDA = 1 → smaller slope than the sum of squared residuals line
  • … and then it gets smaller, as the larger the LAMBDA.
  • So, the larger we make lambda, the prediction for y becomes less and less sensitive to X
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you decide which lambda to use in a Ridge Regression?

A

We just try a bunch of values for LAMBDA and use Cross-validation (typically 10-fold) to determine which one results in the lowest variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Which regression models does Ridge Regression work with?

A

RR also works with discrete X to predict something continuous & Logistic Regression (but instead of Squared Residuals, RR tries to minimize the sum of the Likelihoods (or negative log-likelihoods, i am not sure) - as LG is solved using Maximum Likelihood) & even more complicated models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

In simple words, what do low bias and high variance mean?

A

Low bias means that the regression line fits training data well.

High variance means that the regression line fits testing data poorly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the difference between Lasso (L1) and Ridge (L2) regression?

A

They have the same goal, but the difference stands in the equation that they try to minimize.

if we take the equation that Ridge Regression minimizes and we plug in the absolute value of the slope instead the squared slope, we get the equation that Lasso Regression minimizes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

The Lasso Regression line (the equation it minimizes) is equal to …

A

the sum of squared residuals (for linear regression) + lambda x |slope|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Ridge regression can only shrink the slope asymptotically, but Lasso regression can shrink the slope all the way to 0. What implication does this have?

A

Therefore, Lasso Regression can exclude useless variables from equations → it is a little better than Ridge Regression at reducing the variance in models that contain a lot of useless variables

In contrast, Ridge Regression tends to do a little better when most variables are useful

23
Q

True or False. There is a trade-off when one does regularization. It increases the training error (bias), but it decreases the approximation error (variance)

A

True

24
Q

True or False. Regularization should only be applied to the cost function (loss function).

A

True

25
Q

True or false. Regularization drives weights further away from the origin by adding a regularization term to the objective function.

A

False. It drives weights closer to the origin

26
Q

True or False. L1 - Regularization is a fast alternative to the “Search and Score” Feature Selection approach.

A

True. Lasso Regression automatically performs feature selection and outputs a sparse model. (because it can drive the weights of the regression parameter to 0, which means that it completely drops them)

27
Q

True or false. In L1-regularization, solution is not unique, while in L2-regularization, solution is unique

A

True

28
Q

What is Early Stopping?

A

Early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent.

Process

  1. monitor the validation error as we run stochastic gradient descent
  2. stops algorithms if validation error starts to increase (if it starts overfitting data)
  3. stop training as soon as the validation error reaches the minimum
29
Q

Briefly define “overfitting” and “underfitting”.

A

Overfitting means that the model performs well on training data, but it does not generalize well

Underfitting means that the model is too simple to learn the underlying structure of the data

30
Q

A way to prevent neural networks from overfitting is called …

A

Dropout (a regularization technique for neural networks)

31
Q

Define a loss function/cost function/objective function, and give some examples

A

It is a real-valued function that provides a measure of how wrong the model is in terms of its ability to estimate a relationship between inputs and outputs

e.g., mean square error, cross-entropy, maximum (classification) likelihood

32
Q

In general, what does an optimization algorithm do?

A

It minimizes or maximizes a function f(x) by altering x

e.g., Gradient Descent reduces f(x) by moving x in the opposite sign of the derivative

33
Q

Briefly outline the component of a neural network.

A

the x represents the input → then you pass this input to the latent features (that have weights w) → then you set the parameters (v) → then you get an output

34
Q

There are two very important hyperparameters in a Neural Network. Which ones?

A

the learning rate and the activation function

35
Q

In Neural Networks, the layers of Nodes between the Input and Output Nodes are called …

A

Hidden Layers

36
Q

How does a Neural Network create a squiggle (the line it fits in the end)?

A

A Neural Network starts with identical Activation Functions but, using different Weights and Biases on the connections, it flips and stretches the Activation Functions into new shapes, which are then added together to fit a squiggle that is shifted to fit the data

37
Q

Define learning rate.

A

A learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving towards a minimum of a loss function.

38
Q

What are the risks of having a learning rate too high? What about having it too low?

IN GRADIENT DESCENT

A

A too high learning rate will make the learning jump over the minimum point
A too low learning rate will either take too long to converge or get stuck into an undesired local minimum

39
Q

How do you set a good learning rate in Neural Networks?

A

A good learning rate is achieved by training the model a few hundred iterations starting with a high learning rate and gradually increase at each iteration (multiplying by a constant factor)

40
Q

Name some types of activation functions.

A

Sigmoid
ReLU
Softmax

41
Q

True or false. All hidden layers use the same activation function, while the output function uses a different activation function

A

True

42
Q

What is a threshold logic unit (TLU)?

A

A TLU is an object that inputs an array of weighted quantities, sums them, and if this sum meets or surpasses some threshold, outputs a quantity.

A TLU can classify. Imagine a TLU that has two inputs, whose weights equal 1, and whose theta equals 1.5. When this TLU inputs <0,0>, <0,1>, <1,0>, and <1,1>, it outputs 0, 0, 0, and 1 respectively.

In a neural network, the Threshold Logic Unit (TLU) Algorithm develops a weight matrix and a threshold matrix that describes lines that separate the various class inputs.

43
Q

What is a perceptron?

A

A Perceptron is an algorithm used for supervised learning of binary classifiers. Binary classifiers decide whether an input, usually represented by a series of vectors, belongs to a specific class. In short, a perceptron is a single-layer neural network.

44
Q

How is a perceptron based on TLU?

A

A single TLU can be used for a simple linear binary classification. A perceptron is composed of a single layer of TLUs. Each TLU is connected to a number of inputs. TLU computes a linear combination of inputs. If the result exceeds a threshold, it outputs a positive class

45
Q

Describe the process of a perceptron.

A
  1. A perceptron is fed one training instance at a time, and for each instance, it makes its predictions
  2. For every output neuron that produces a wrong prediction, it reinforces the connection weights from its inputs that would have contributed to the correct prediction
46
Q

True or False. Perceptrons do not output a class probability. They only make predictions based on a hard threshold

A

True

47
Q

What is a multilayer perceptron?

A

An MLP consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable.

48
Q

What is a feed-forward neural network?

A

A feedforward neural network is an artificial neural network wherein connections between the nodes do not form a cycle. As such, it is different from its descendant: recurrent neural networks.

49
Q

Define backpropagation

A

Backpropagation is an algorithm for supervised learning of artificial neural networks using gradient descent. Given an artificial neural network and an error function, the method calculates the gradient of the error function with respect to the neural network’s weights.

50
Q

Describe the algorithm of backpropagation.

A
  1. For each training instance: backpropagation algorithm first makes a prediction (forward pass) and then measures error
  2. Then goes through each layer in reverse to measure error contribution from each connection (backward pass)
  3. Finally, it tweaks connection weights to reduce the error (Gradient Descent Step)
51
Q

What is the cost (complexity) of a forward pass in backpropagation? what about a backward pass?

A

If you have m layers and all have k elements, then the cost of forward pass is
O(dk + mk2) - whatever that means (check on that)

The cost of forward an backward pass is the same

52
Q

What are the downsides of Perceptron?

A
  1. Perceptron is incapable of learning complex patterns
    - Because decision boundary of each output neuron is linear (just like Logistic Regression Classifiers)
  2. Weaknesses:
    - Incapable of solving Exclusive OR (XOR) classification problem
    • Solution: Can be solved by stacking multiple Perceptrons (Resulting ANN is called a Multilayer Perceptron(MLP)
53
Q

True or False: Perceptron outputs a class probability, like Logistic Regression.

A

FALSE: Perceptrons do not output a class probability. Perceptrons make predictions based on a hard threshold