Week 2 - Multivariate Linear Regression Flashcards

Question 1

Q

With multivariate linear regression, our hypothesis looks like…

hθ (x)=θ0 + θ1x1 + θ2x2 + θ3x3 + ⋯ + θnxn

How can we write this concisely in matrix form? Explain how we get to the matrix form.

Answer

A

θTX - Transpose of the parameter vector multiplied by the feature vector, X

In order to get to the matrix form, you have to add an element to the feature vector, since with the intercept included there is 1 more parameter than feature. You can’t multipy matrices by one another is they don’t match element-wise

Question 2

Q

What is a “feature” in machine learning?

Answer

A

In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon.

In machine learning, features are individual independent variables that act like a input in your system. Actually, while making the predictions, models use such features to make the predictions. And using the feature engineering process, new features can also be obtained from old features in machine learning

Question 3

Q

What is the gradient descent equation for a model with 2 features?

How does it compare to the gradient descent equation with a single feature?

Do you update each equation 1 at a time?

Answer

A

The formulas are identical with a couple exceptions. First, for the theta subscript zero term, the last component is equal to one, thus it is equal to the equation with a single feature. The ony difference for the other feature equations is that an additional subscipt is needed on the last term as there is now more than one feature to differentiate.

No, you update all equations simultaneously.

Question 4

Q

What is “feature scaling”?

Why is this important when using gradient descent?

Say I have 2 features, house size in squared feet (0-2000 ft squared), and # of bedrooms in a house (1-5 bedrooms). How could I scale these to help gradient descent?

Answer

A

Feature scaling involves dividing the input values by the range (max value - min value), resulting in a range of 1.

Make sure features are on a similar scale. It helps gradient descent converge more quickly. If you have variables that are not scaled relatively closely, it creates very skewed contours which can take gradient descent awhile to navigate.

Measure size of house as size/2000; measure bedrooms as # of bedrooms/5; in this manner both variables take on similar range of values, 0≤x≤1

Question 5

Q

Generally, what range do we want to scale all our features to?

Answer

A

-1 ≤ x_i ≤ 1

But it isn’t a hard and fast rule, as long as things are relatively close to this scale it should be fine. Note that if your #s are super small it can also be a problem for the algorithm. This is a rule of thumb thing. Andrew Ng’s rule of thumb is that it is fine if it is -3 ≤ x_i ≤ 3 and larger than -⅓ ≤ x_i ≤ ⅓

Question 6

Q

What is mean normalization?

Why would we do this?

What is an acceptable range for our features if this is done?

Answer

A

Mean normalization involves subtracting the average value for an input variable, resuling in a new average for the input variable of zero.

Formula: (x_{i -}u_i)/ s_is_i = range of values or the standard deviation

We do this to help gradient descent speed up the process of converging

Ideal range if divided by range is -1 ≤ x_i ≤ 1

Ideal range if dividing by standard deviation is -0.5 ≤ x_i ≤ 0.5

Question 7

Q

How can we check if gradient descent is working correctly?

What can we do if it doesn’t appear to be working correctly?

Answer

A

You can plot gradient descent vs. # of iterations. If you see that GD is getting smaller as the # of iterations rises then it is working well.

If we see the plot of J(θ) vs # of interations rising, that means that we have probalby chosen an α that is too big

Question 8

Q

What are recommended values to choose for the learning rate, α?

Answer

A

0.001, 0.01, 0.1, 1

Basically try a range of values and check the plots of J(θ) vs # of iterations

Question 9

Q

Why is feature scaling vital when using polynomial regression?

Answer

A

Because in polynomial regression, the scale of our different features is often vastly different. For example, say we want our hypothesis to be a cubic function,

hθ(x)=θ0+θ1x₁+θ2x₁²+θ3x₁³

if x₁ has range 1-1000, then x₁² has range 1 - 1,000,000 and x₁³ has range 1 - 1,000,000,000

Question 10

Q

What is the “normal equation” method for computing parameters in linear regression?

Answer

A

Instead of using gradient descent, we can use matrix algebra to solve for the parameters in linear regression.

Normal equation: θ = (X^TX)⁻¹X^Ty

The inverse of matrix X transpose * matrix X * X transpose * vector y

Question 11

Q

Advantages/disadvantages of gradient descent vs normal equation for finding parameters in linear regression.

Answer

A

Gradient Descent: have to choose alpha & can take many iterations, but works well when n is large (# of features)

Normal equation: don’t have to choose alpha & no need to iterate, doesn’t work well when n is large ( > 10,000) because X^Tis an n x n matrix

Question 12

Q

What does it mean if a matrix is “singular” or “degenerate”?

Answer

A

It means that it is non-invertable, meaning it doesn’t have an inverse. This happens pretty rarely.

Question 13

Q

What are the 2 likely causes of having a non-invertible matrix?

Answer

A

Redundant features (linearly dependent), for example if you have a feature for size in feet and a feature for size in meters, these two features will be linearly related. Fix to this is to delete redundant features
Too many features (e.g. m ≤ n), so basically if you have many more features than your sample size. The fixes to this are to either delete some features or use regularization

Week 2 - Multivariate Linear Regression Flashcards

(13 cards)