Data Science using Python and R - 11 Flashcards
What are the two remaining tasks in the Modeling Phase?
Estimation task and Association task
What is the most widespread method for performing the estimation task?
Linear regression
What does simple linear regression approximate?
The relationship between a numeric predictor and a continuous target using a straight line
What does multiple regression modeling approximate?
The relationship between a set of p > 1 predictors and a single continuous target using a p-dimensional plane or hyperplane
What is the usual form of the multiple regression model equation?
y = β0 + β1x1 + β2x2 + … + βpxp
What do the x’s and β’s represent in the multiple regression model?
x’s represent predictor variables and β’s represent unknown model parameters
What approach does the Data Science Methodology use to validate model results?
Cross-validation
What does the regression equation in descriptive regression modeling look like?
ŷ = b0 + b1x1 + b2x2 + … + bpxp
What is the provisional regression equation for estimating Sales_per_Visit?
Sales per Visit = b0 + b1(Days) + b2(CC) + b3(Web)
What is the typical p-value cutoff for retaining variables in a regression model?
0.05
Which predictor variable was determined to not belong in the model?
Web Account
What is the final regression model after removing Web Account?
Sales per Visit = 73.6209 + 0.1637(Days) + 22.1357(CC)
How much does having a store credit card increase Sales per Visit?
$22.1357
What is the estimated increase in Sales per Visit for each additional day between purchases?
$0.1637
What method is used to fit a multiple regression model in Python?
Ordinary Least Squares (OLS)
Which command is used to add a constant term to predictor variables in Python?
sm.add_constant(X)
What is the purpose of the summary() command in regression modeling?
To obtain the results of the regression model
In R, how do you specify the formula for a linear regression model?
lm(formula = Target ~ Predictors, data = dataset)
What is the standard error of the estimate denoted as?
s
What is the Mean Absolute Error (MAE) used for?
To measure the average distance between actual and predicted values
What does R² represent in regression modeling?
The proportion of variability in the response accounted for by the predictors
What does Radj² penalize in multiple regression models?
Having too many unhelpful predictors
What was the Radj² value for the regression model discussed?
0.064
Fill in the blank: The estimated sales per visit for a customer base is $73.6209 plus _______ times the number of days between purchases.
$0.1637
True or False: The typical size of the prediction error is given by the statistic s.
True
What does Radj² represent in a regression model?
The proportion of the variability in the response accounted for by the predictors, adjusted for the number of predictors.
What is the value of Radj² in the regression model discussed?
0.064
What does a Radj² value of 0.064 indicate about the model’s predictors?
6.4% of the variability in Sales per Visit is accounted for by the predictors Days since Purchase and Credit Card.
How do you specify variable values for a customer in a Python regression model?
Use the np.column_stack() command with the values in the order Constant, CC, Days.
What command is used to predict sales using the regression model in Python?
model02.predict(cust01)
How do you obtain predicted values for all customers in the test data set in Python?
Change the input of the predict() command to the test data predictor variable data frame, X_test.
What is the formula to calculate the standard error of the estimate in Python?
np.sqrt(model02.scale)
What is the purpose of the MAE in regression analysis?
To measure the average magnitude of the errors between predicted and actual values.
What is the calculated MAE from the regression model discussed?
53.39
What command is used to run stepwise regression in R?
stepAIC(object = model01)
What does stepwise regression do?
It adds helpful predictors into the model one at a time and checks if they all still belong.
What is a common baseline model for regression comparisons?
The simple y = y model.
How do you compute MAE for the baseline model?
MAE = Error / n
What conclusion can be drawn if MAERegression is less than MAEBaseline?
The regression model outperforms the baseline model.
What is the value of MAEBaseline calculated in the example?
55.53
What is the relationship between Customer A and Customer B in terms of estimated sales per visit?
Customer A has 100 more days between purchases than Customer B.
What is the significance of the variable names in the data frame when predicting in R?
They must exactly match the names of the predictor variables in the model.
True or False: Stepwise regression always finds the optimal set of predictors.
False
What is the first step to perform stepwise regression using R?
Install and open the MASS package.
What command in R generates the predicted number of sales per visit?
predict(object = model02, newdata = cust01)
What does the ‘Residual standard error’ indicate in the output from summary() in R?
It represents the standard error of the estimate.
What is the typical result of using stepwise regression on a dataset?
It converges on a final model by removing unhelpful predictors.
What is the MAE calculated from the regression model based on the test data set?
53.39
What does a higher MAE for the baseline model indicate?
It indicates that the baseline model performed worse than the regression model.
In stepwise regression, what happens if a previously helpful predictor is added that is no longer helpful?
It may be dropped from the model.
What is the importance of verifying that the regression model outperforms the baseline model?
To ensure that the predictors are helpful in estimating the response.
What is the command to calculate MAE in R using the MLmetrics package?
MAE(y_pred = ypred, y_true = ytrue)
What is the significance of the variable ‘Web’ in the context of the regression model?
It may not belong in the model based on stepwise regression results.
What does the command summary(model01_step) provide?
It gives the full summary of the final model after stepwise regression.
What should you do after running a regression model with the training set?
Validate the model using the test data set.
What does the coefficient for Sex represent in regression analysis?
The coefficient for Sex indicates the expected change in the dependent variable for a one-unit increase in the Sex variable, holding all other variables constant.
What does the coefficient for Education indicate in regression analysis?
The coefficient for Education shows the expected change in the dependent variable for a one-unit increase in the Education variable, holding all other variables constant.
What is the significance of the value of s² in regression analysis?
The value of s² represents the variance of the residuals, indicating how much the predicted values deviate from the actual values.
What does Radj represent in regression analysis?
Radj (Adjusted R-squared) measures the proportion of variance explained by the model while adjusting for the number of predictors used.
How do you determine if the regression model outperformed its baseline model?
Compare MAEBaseline and MAERegression; if MAERegression is lower than MAEBaseline, the regression model outperformed the baseline.
What is the purpose of using the white_wine_training and white_wine_test data sets?
These data sets are used to train and validate the regression model predicting Quality based on Alcohol and Sugar.
What should you do after running a regression predicting Quality based on Alcohol and Sugar?
Obtain a summary of the model to evaluate whether both predictors belong in the model.
Complete the sentence: ‘The estimated Quality equals….’
The estimated Quality equals the regression equation using the coefficients for Alcohol and Sugar.
What does the coefficient for Alcohol indicate in the regression model?
The coefficient for Alcohol indicates the expected change in Quality for a one-unit increase in Alcohol, holding Sugar constant.
What does the coefficient for Sugar indicate in the regression model?
The coefficient for Sugar indicates the expected change in Quality for a one-unit increase in Sugar, holding Alcohol constant.