Data Modeling Flashcards
When is linear regression the correct choice for data model?
When there is a linear relationship between the dependent and independent variable.
What are two ways you can check a linear regression model is the correct choice as a data model?
Run np.corrcoef(x, y1), a linear relationship is indicated by a result close to 1 or - 1.
Run plt.scatter(x, y1) and see if a linear relationship exists.
What is linear regression with multiple independent variables called?
A multiple (or multivariate) linear regression model
What is the definition of a “cost function”?
A cost function is a function that takes in the regression model and outputs a number indicating the quality of the fit. Lower costs are better than higher costs.
What is a common measure of cost function?
Sum of squared errors.
Outline workflow up until linear regression?
- Collect data
- Data study (EDA) and cleaning.
- Isolate and study relationships, how do independent variables (features) relate to dependent variables (scatter plots etc).
- Modelling
What is the convention to import statsmodel
import statsmodels.formula.api as smf
What is the syntax for running a linear regression analysis using statsmodels?
diamond_reg = smf.ols(“price ~ carat + depth + table”, data=diamonds).fit()
diamond_reg.summary()
What are the areas of interest in the summary obtained after running a linear regression analysis using statsmodels?
The “coeff” for each independent variable.
The “p” value for each independent variable (if large > 0.05 or so) the variable is not useful.
The R squared value (closer to 1 is better).
How can I use the Linear Regression analysis from statsmodels to make predictions on the data?
This creates a new column with the predicted price. Beware that the column names of “new_diamonds” in this case, must match the order of the independent variables passed into the linear regression analysis (.ols)
new_diamonds[‘predicted’] = diamond_reg.predict(new_diamonds)
What is residual?
The difference between a measurement and it’s model prediction.
Discuss the x axis in regression analysis.
The thing we are trying to explain.
The independent variable
Explanatory variable.
Discuss the Y axis in regression analysis
The dependent variable
Response variable
What is the gradient called on a regression analysis chart?
The regression coefficient.
What is the relationship between the Pearson coefficient, and the standard deviations of the variables and gradient (regression coefficient)?
If the standard deviations of the variables are the same, the gradient and the Pearson coefficient is the same.