Data Mining - Chapter 6 (Regression) Flashcards
What is the formula of a multiple linear regression model?
y = bo + b1x1 + b2x2 + … + bpxp + e
What is the multiple linear regression model?
It is a model used to fit a relationship between a numerical outcome variable (Y) and a set of predictors (X1, X2, etc).
Which two objectives are there to use multiple linear regression?
- Explanatory
Explaining or quantifying the average effect of inputs on an outcome. (Mostly used in stats. Find a model that best explains the underlying relationship in a poulation.) - Predictive
Predicting the outcome value for new records, given their input values. Finding a model that best predicts new individual records. (Useful for decision-making)
What are the four main characteristics of an explanatory multiple regression model?
- A good model is one that fits the data closely
- The entire dataset is used or estimating the best-fit model, to maximize the amount of information we have about the population.
- Performance measures for this model measures how close the data fits the model and how strong the average relatinship is.
- Focus is on the coefficients (b)
What are the four main characteristics of a predictive multiple regression model?
- A good model is one that predicts new records accurately.
- The dataset is split into a training set and a validation/test set.
- Performance measures for this model measure the predictive accuracy.
- Focus is on the predictions (y-hat).
Which method can be used to estimate the unknown parameters in a regression model?
Ordinary Least Squares (OLS)
-> It minimizes the errors associated with predicting values for the dependent variable.
Why does the OLS make use a least squares criterion?
You are looking at the deviations between the observed and the predicted values. If you do not square those deviations, you allow the positive and the negative deviations to cancel eachother out.
How do you estimate the error for a single outcome?
Ei = Yhat - Yi (The predicted outcome minus the actual observation)
How do you estimate the total error of a multiple regression model?
sum(Yi - Yhat)^2
Which 4 assumptions do we make when using a multiple linear regression model for prediction?
- The noise e follows a normal distribution
- The choice of predictors and their form is correct (linearity)
- The records are independent of each other
- The variability in the outcome values for a given set of predictors is the same regardless of the values of the predictors (homoskedasticity)
What are reasons to reduce the number of predictors in your model?
- May be expensive or not feasible to collect the data for all those predictors.
- Might be unable to measure all these predictors accurately.
- Less parsimony -> we get less insight on the influence of individual parameters
- Multicollinearity
- Using predictors that are uncorrelated with the outcome variable increases the variance of predictors.