Data Modeling Flashcards

1
Q

When is linear regression the correct choice for data model?

A

When there is a linear relationship between the dependent and independent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are two ways you can check a linear regression model is the correct choice as a data model?

A

Run np.corrcoef(x, y1), a linear relationship is indicated by a result close to 1 or - 1.
Run plt.scatter(x, y1) and see if a linear relationship exists.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is linear regression with multiple independent variables called?

A

A multiple (or multivariate) linear regression model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the definition of a “cost function”?

A

A cost function is a function that takes in the regression model and outputs a number indicating the quality of the fit. Lower costs are better than higher costs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a common measure of cost function?

A

Sum of squared errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Outline workflow up until linear regression?

A
  1. Collect data
  2. Data study (EDA) and cleaning.
  3. Isolate and study relationships, how do independent variables (features) relate to dependent variables (scatter plots etc).
  4. Modelling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the convention to import statsmodel

A

import statsmodels.formula.api as smf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the syntax for running a linear regression analysis using statsmodels?

A

diamond_reg = smf.ols(“price ~ carat + depth + table”, data=diamonds).fit()
diamond_reg.summary()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the areas of interest in the summary obtained after running a linear regression analysis using statsmodels?

A

The “coeff” for each independent variable.
The “p” value for each independent variable (if large > 0.05 or so) the variable is not useful.
The R squared value (closer to 1 is better).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can I use the Linear Regression analysis from statsmodels to make predictions on the data?

A

This creates a new column with the predicted price. Beware that the column names of “new_diamonds” in this case, must match the order of the independent variables passed into the linear regression analysis (.ols)
new_diamonds[‘predicted’] = diamond_reg.predict(new_diamonds)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is residual?

A

The difference between a measurement and it’s model prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Discuss the x axis in regression analysis.

A

The thing we are trying to explain.
The independent variable
Explanatory variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Discuss the Y axis in regression analysis

A

The dependent variable

Response variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the gradient called on a regression analysis chart?

A

The regression coefficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the relationship between the Pearson coefficient, and the standard deviations of the variables and gradient (regression coefficient)?

A

If the standard deviations of the variables are the same, the gradient and the Pearson coefficient is the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the two main components of a statistical model?

A

A mathematical formula that expresses a deterministic, predictable components. And the residual error.

17
Q

What is logistic regression?

A

A form of regression developed for proportions which ensures a curve which cannot go above 100% or below 0%.

18
Q

What is n and p when describing a data set?

A

n is the number of records

p is the number of attributes (parameters)

19
Q

What is a 1 dimensional vector?

A

A column.

If it contains the outcomes that we are trying to predict by convention it is called ‘y’.

20
Q

What is a matrix of features?

A

The columns of data that will help us predict the vector ‘y’.
The number of rows should equal the rows of ‘y’,

21
Q

What is a binary classification?

A

Predicting the outcome of one of two possibilities.

Usually these two will be classified as 1 and 0, or positive vs. null.

22
Q

What is a Gini Impurity?

A

It measures the quality of a split in a decision tree.
Perfect splits provide a Gini Impurity of 0.

When training a decision tree, the best split is chosen by maximizing the Gini Gain, which is calculated by subtracting the weighted impurities of the branches from the original impurity.

23
Q

What are the steps involved in training a decision tree?

A

First determine the root node (the first feature to split on), this is done by trying each split and determining the best Gini Gain (every feature and every unique threshold), the best gain should be the root node.
Repeat this for each node until they are equally good or have a Gini Gain of 0, then they become a leaf node.

24
Q

Explain RMSE in regression modeling

A

Root Mean Squared Error
The standard deviation of the residuals
Ranges from 0 to infinity, with 0 being a perfect fit and the larger the number the more error
Absolute measure of goodness of fit, and therefore cannot be used to compare across different data sets, but between models with the same dataset
Same units as the dependent variable.
Predictions farther away have a greater impact on the RMSE and is therefore sensitive to outliers

25
Q

Explain MAE in regression models

A

Mean Absolute Error
The average absolute values of the errors between the true value and the predicted value
Similar to RMSE, Ranges from 0 to infinity with 0 being a perfect fit and is an absolute measure of fit
Predicted values father away contribute proportionally to the MAE and is therefore more robust to outliers than RMSE
Relatively intuitive and simple to explain regression metric.

26
Q

Explain R2 in regression models

A

Coefficient of Determination
Proportion of variance in the dependent variable that is predicted by the independent variables
Typically ranges from 0 to 1, with 1 being a perfect fit and 0 being no better than predicting the dependent variable with its mean
Relative measure of goodness of fit

27
Q

What is the “feature” and “target” in machine learning?

A

Features (y) are things that we are using to predict the target (X).

28
Q

When should you use a train/test split when fitting a model?

A

Always.