Chapter 4 - Training models Flashcards

Question

What is the formula for the logistic regression ?

Answer 1

The beta represent the initial term and the alpha the different features.

Answer 2

In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1 and the sum adding to one.

Answer 3

cost={-log(p^) if y = 1 or -log(1-p^) if y = 0} This cost function make sense because -log(t) grows very large when t approaches 0. On the other hand -log(t) is close to - when t is close to 1.

Answer 4

cost(θ) = -(1/m)Σ[y⁽ⁱ⁾log(p^⁽ⁱ⁾) + (1-y⁽ⁱ⁾)log(1-p^⁽ⁱ⁾)] Note y⁽ⁱ⁾ is the output value. It is either 0 or 1.

Answer 5

There is no known closed-form equation to compute the value of θ that minimizes the cost function. The good news is that this cost function is convex, so Gradient Descent (or any other optimization algorithm) is guaranteed to find the global minimum (if the learning rate is not too large and you wait long enough).

Answer 6

It the logistic regression but generalized to support multiple classes. This model support multiple classes directly without having to train and combine multiple binary classifiers. The idea is simple, when given an instance x, the softmax regression model compute the probability for each class and base it prediction on the class with the highest probability.

Answer 7

p_k^ = σ(s(x))_k = exp(s_k(x)) / (Σ_i=1^K exp(s_i(x))) is the softmax function s_k(x) = x^Tθ^(k) is the softmax score for class k Where: K is the number of classes, s(x) is a vector containing the scores of each class for the instance x. And σ(s(x))_kis the estimated probability that the instance x belongs to class k, given the scores of each class for that instance. Note, once you have computed the score of every class for the instance x, you can estimate the probability that the instance belongs to class k by running the scores through the softmax function.

Answer 8

The softmax regression classifier predicts only one class at a time (i.e., it is multiclass, not multioutput), so it should be used only with mutually exclusive classes, such as different types of plants. you cannot use it to recognize multiple people in one picture.

Answer 9

Cost(θ) = -(1/m)Σ_i=1^mΣ_k=1^K y_k⁽ⁱ⁾log(p_k⁽ⁱ⁾) In this equation, y_k⁽ⁱ⁾is the target probability that the ith instance belongs to class k. In general, it is either equal to 1 or 0, depending on whether the instance belongs to the class or not. The objective is to have a model that estimates a high probability for the target class, and a low probability for the other class. The cross entropy cost function penalizes the model when it estimates a low probability for a target class.

Answer 10

You can use stochastic gradient descent or mini batch gradient descent, and perhaps batch gradient descent if the training set fits in memory. But you cannot use the normal equation or the SVD approach because the computational complexity grows quickly (more than quadratically) with the number of features.

Answer 11

If the features in your training set have very different scales, the cost function will have the shape of an elongated bowl, so the gradient descent algo will take a long time to converge. To solve this you should scale the data before training the model.

Answer 12

Gradient descent cannot get stuck in a local minimum when training a logistic regression model because the cost function is convex and has only 1 minimum.

Answer 13

If the optimization problem is convex (such as linear regression or logistic regression), and assuming the learning rate is not too high, then all GD algorithm will approach the global optimum. However, unless you gradually reduce the learning rate Stochastic GD and mini-batch GD will never truly converge; instead they will keep jumping back and forth around the global optimum.

Answer 14

One possibility is that the learning rate is to high and the algorithm is diverging. If the training error also goes up, then this is cleary the problem and you should reduce the learning rate. However, if the training error is not going up, then you model is overfitting and you should stop training.

Answer 15

Due to their random nature, neither stochastic GD nor mini-batch GD is guaranteed to make progress at every single training iteration. So if you immediately stop training stop much too early, before the optimum is reached.

Answer 16

Stochastic gradient descent has the fastest training iteration since it considers only one training instance at a time, so it is generally the first to reach the vicinity of the global optimum. However, only batch Gd will actually converge, given enough training time. As mentioned, Stochastic GD or mini-batch GD will bounce around the optimum, unless you gradually reduce the learning rate.

Answer 17

You are likely overfitting the training set. The three solutions are: 1) Try a model with fewer degress of freedom 2) regularize the model using ridge (L2 penalty) or Lasso (L1 penalty). 3) you can try to increase the size of the training set.

Answer 18

If both the training and validation error are almost equal and fairly high, the model is likely underfitting the training set, which means it has a high bias. You should try reducing the regularization hyperparameter. Note the hyperparameter is used to reduce overfitting. This situation give us a hint that we are underfitting.

Answer 19

A model with some regularization typically performs better than a model without any regularization, so you should generally prefer ridge regression over plain linear regression.

Answer 20

Lasso uses a L1 penalty, which tends to push the weights down to exactly zero. This leads to sparse models, where all weights are zero except for the most important weights. This is a way to perform feature selection automatically, which is good if you suspect that only a few features actually matter. When you are not sure, you should prefer Ridge regression.

Answer 21

Elastic net is generally preferred over Lasso since Lasso may behave erratically in some cases (when several features are strongly correlated or when there are more features than raining instances).

Answer 22

Since these are not mutually exclusive classes you should train two logistic regression classifiers.

Chapter 4 - Training models Flashcards

(46 cards)