Data Modeling Flashcards by Chris Andersson

When is linear regression the correct choice for data model?

When there is a linear relationship between the dependent and independent variable.

How well did you know this?

Not at all

Perfectly

What are two ways you can check a linear regression model is the correct choice as a data model?

Run np.corrcoef(x, y1), a linear relationship is indicated by a result close to 1 or - 1.
Run plt.scatter(x, y1) and see if a linear relationship exists.

How well did you know this?

Not at all

Perfectly

What is linear regression with multiple independent variables called?

A multiple (or multivariate) linear regression model

How well did you know this?

Not at all

Perfectly

What is the definition of a “cost function”?

A cost function is a function that takes in the regression model and outputs a number indicating the quality of the fit. Lower costs are better than higher costs.

How well did you know this?

Not at all

Perfectly

What is a common measure of cost function?

Sum of squared errors.

How well did you know this?

Not at all

Perfectly

Outline workflow up until linear regression?

Collect data
Data study (EDA) and cleaning.
Isolate and study relationships, how do independent variables (features) relate to dependent variables (scatter plots etc).
Modelling

How well did you know this?

Not at all

Perfectly

What is the convention to import statsmodel

import statsmodels.formula.api as smf

How well did you know this?

Not at all

Perfectly

What is the syntax for running a linear regression analysis using statsmodels?

diamond_reg = smf.ols(“price ~ carat + depth + table”, data=diamonds).fit()
diamond_reg.summary()

How well did you know this?

Not at all

Perfectly

What are the areas of interest in the summary obtained after running a linear regression analysis using statsmodels?

The “coeff” for each independent variable.
The “p” value for each independent variable (if large > 0.05 or so) the variable is not useful.
The R squared value (closer to 1 is better).

How well did you know this?

Not at all

Perfectly

How can I use the Linear Regression analysis from statsmodels to make predictions on the data?

This creates a new column with the predicted price. Beware that the column names of “new_diamonds” in this case, must match the order of the independent variables passed into the linear regression analysis (.ols)
new_diamonds[‘predicted’] = diamond_reg.predict(new_diamonds)

How well did you know this?

Not at all

Perfectly

What is residual?

The difference between a measurement and it’s model prediction.

How well did you know this?

Not at all

Perfectly

Discuss the x axis in regression analysis.

The thing we are trying to explain.
The independent variable
Explanatory variable.

How well did you know this?

Not at all

Perfectly

Discuss the Y axis in regression analysis

The dependent variable

Response variable

How well did you know this?

Not at all

Perfectly

What is the gradient called on a regression analysis chart?

The regression coefficient.

How well did you know this?

Not at all

Perfectly

What is the relationship between the Pearson coefficient, and the standard deviations of the variables and gradient (regression coefficient)?

If the standard deviations of the variables are the same, the gradient and the Pearson coefficient is the same.

How well did you know this?

Not at all

Perfectly

What are the two main components of a statistical model?

Study These Flashcards

A mathematical formula that expresses a deterministic, predictable components. And the residual error.

What is logistic regression?

Study These Flashcards

A form of regression developed for proportions which ensures a curve which cannot go above 100% or below 0%.

What is n and p when describing a data set?

Study These Flashcards

n is the number of records

p is the number of attributes (parameters)

What is a 1 dimensional vector?

Study These Flashcards

A column.

If it contains the outcomes that we are trying to predict by convention it is called ‘y’.

What is a matrix of features?

Study These Flashcards

The columns of data that will help us predict the vector ‘y’.
The number of rows should equal the rows of ‘y’,

What is a binary classification?

Study These Flashcards

Predicting the outcome of one of two possibilities.

Usually these two will be classified as 1 and 0, or positive vs. null.

What is a Gini Impurity?

Study These Flashcards

It measures the quality of a split in a decision tree.
Perfect splits provide a Gini Impurity of 0.

When training a decision tree, the best split is chosen by maximizing the Gini Gain, which is calculated by subtracting the weighted impurities of the branches from the original impurity.

What are the steps involved in training a decision tree?

Study These Flashcards

First determine the root node (the first feature to split on), this is done by trying each split and determining the best Gini Gain (every feature and every unique threshold), the best gain should be the root node.
Repeat this for each node until they are equally good or have a Gini Gain of 0, then they become a leaf node.

Explain RMSE in regression modeling

Study These Flashcards

Root Mean Squared Error
The standard deviation of the residuals
Ranges from 0 to infinity, with 0 being a perfect fit and the larger the number the more error
Absolute measure of goodness of fit, and therefore cannot be used to compare across different data sets, but between models with the same dataset
Same units as the dependent variable.
Predictions farther away have a greater impact on the RMSE and is therefore sensitive to outliers

Explain MAE in regression models

Mean Absolute Error The average absolute values of the errors between the true value and the predicted value Similar to RMSE, Ranges from 0 to infinity with 0 being a perfect fit and is an absolute measure of fit Predicted values father away contribute proportionally to the MAE and is therefore more robust to outliers than RMSE Relatively intuitive and simple to explain regression metric.

Explain R2 in regression models

Coefficient of Determination Proportion of variance in the dependent variable that is predicted by the independent variables Typically ranges from 0 to 1, with 1 being a perfect fit and 0 being no better than predicting the dependent variable with its mean Relative measure of goodness of fit

What is the "feature" and "target" in machine learning?

Features (y) are things that we are using to predict the target (X).

When should you use a train/test split when fitting a model?

Always.

Data Modeling Flashcards

(28 cards)