Regression Flashcards
What is regression?
- Predict the value of a continuous variable based on other variables assuming a linear or nonlinear model of dependency
Predicted variable (dependent) = ^y (yhead) Other variables (explanatory variables)
Whats the difference between regression and classification?
Difference to classification is that classification predicts nominal class attributes whereas regression predicts continuous attributes
What are regression techniques?
1) Linear Regression
2) Polynomial Regression
3) Local Regression
4) ANN
5) DNN
6) K-nearest-Neighbours Regression
Explain k-nearest neighbor regression
- Use the numeric average of the k-nearest stations
Choose k values between 1 and 20 (Rule of thumb)
How can you evaluate regression models?
Methods for Model Evaluation:
- Cross Validation: 10-fold
- Holdout Validation: 80 % random share for training, 20% for testing
What metrics for Model Evaluation can be applied?
- Mean Absolute Error (MAE): computes the average deviation between predicted value and actual value
- Mean Squared Error (MSE): Places more emphasis on larger deviation
- Root Mean Squared Error (RMSE): similar scale as MAE with more emphasis on larger deviations
- Pearsons Correlation Coefficient: scores well if high (low) actual values get high (low) predictions
- R Squared: measures the part of the variation in y hat that is explainable from the explanatory variables
How do you interpret R2
R2 = 1; perfect model as total variation of y can be completely explained from X
How can you apply regression trees?
- In principle the same as for classification
Differences:
1) splits are selected by maximizing the MSE reduction (not GINI or entropy)
2) prediction is average value of the trainings examples in a specific leaf
What may happen if your tree has a higher depth?
- It may overfit
- The model learns several outliers
What is the assumption of linear regression?
- The target variable y is linearly dependent on explanatory variables x
How do you fit a regression function?
Least-squares approach: Find the weight vector that minimizes the sum of squared error for all training examples
Error: Difference between estimated and real value in training data
What is ridge regularization?
- Variation of least squares approach (another way to fit a regression function)
- Tries to avoid overfitting by keeping weights small
alpha of 0 = normal least squares regression
alpha of 100 = strongly regularized flat curve (strong penalty)
What problems can occur by feature selection for regression?
Problem 1: Highly correlated variables (height in cm and inch)
- weights are meaningless, one variable should be removed
Problem 2: Insignificant variables
- uncorrelated variables get w=0 or relatively small weights assigned
How can you check if a variable with a small weight really is insignificant?
- Statistical test with H0 (w=0 as variable is insignificant)
- t-stat: number of standard deviations that w is away from 0 (center of distribution; high t-stat means that H0 is to reject)
- p-value: Probability of wrongly rejecting H0 (p-value close to 0; variable is significant
What does Interpolation mean?
Interpolating regression:
- predicted values are within the range of the training data values
- is regarded as safe