Linear Regression Flashcards
our job as Machine Learning experts
Choose a model suitable for classifying the data according
to the attributes
• Choose attributes suitable for classifying the data according
to the model
• tune the hyper-parameters of the model
hyper-parameters tuning
• Compared to choice of learner and feature representation,
improvement due to “parameter tuning” is typically small
- Usually used as a final stage, to get slightly higher Accuracy with respect to the development data
- Because we are evaluating lots of models, there is a risk of “over-tuning”
- the best choice of hyper-parameters for the development data may not the best choice of hyper-parameters on the test data
- hyper-parameters tuning -> grid search
ML can be viewed as?
optimisation problem
Maximise D(L, θ; F(T ))
evaluation metric D (like Accuracy), a dataset T ,
a feature representation F(T ), and a learner L with
hyperparameters θ
Holding F(T ) and L fixed:
ˆθ = arg minθ∈Θ Error(θ;L, F(T ))
–> optimizing θ
Linear Regression
continuous attributes -> continuous class
Linear regression captures a relationship between two attributes. It makes the assumption that there is a linear relationship between the two variables.
- An outcome variable (aka response variable, dependent variable, or label)
- A predictor (aka independent variable, explanatory variable, or feature)
how to choose the best line in LinearR?
(1) the line that minimises the distance between all points and the line (Euclidean distance)
(2) Least squares estimation: find the line that minimises the sum of the squares of the vertical distances between
approximated/predicted yˆis and actual yis.
• Minimise the Residual Sum of Squares (RSS)
(aka Sum of Squares Due to Error (SSE))
All attributes are numerical → Grid Search is )-:
• Partial derivatives can be (easily!) calculated
• (RSS is convex — the local optimum is a global minimum)
How to find the line that has the lowest RSS then?
–> Gradient Descent
Gradient Descent
gradient descent is an iterative optimization algorithm to minimize a cost function, and can be applied to minimizing the SSE
We need to pick a certain value of a and a certain value for b that will minimize the cost function
Gradient descent is an algorithm that minimizes a convex function. It turns out that if we denote the above expression as a sort of a cost, then for every choice of a,b, we will get a certain cost. We want this to be minimum. (minimize the value of f(a,b))
Iterative approximation to Error optimisation
Steps in the Gradient Descent algorithm involve:
• making a prediction for each (training) instance
• comparing the prediction with the actual value
• multiplying by the corresponding attribute value
• updating the weights after all of the training instances have been processed
–> evaluation matrix is already built-in in the model as we were going to compare the predictions with the actual values anyway
gradient descent takes as an input some seed values of a and b and iteratively improves it, until we reach the minimum cost, thereby giving us the optimal parameters aoptimal and boptimal, and then we can get the desired line, y=a-optimal⋅x+b-optimal.
α in GD
- α is a parameter of the algorithm, representing the learning rate (how big a step you take in updating θi).
- If α is too small, the algorithm might be slow.
- If it is too large, you might miss the minimum.
Evaluation of Numeric Prediction
• It clearly doesn’t make sense to evaluate numeric prediction
tasks in the same manner as classification tasks, as:
• “direct hits” (true positive matches) are an unreasonable
expectation
• unlike classification, we can make use of the inherent
“ordering” and “scale” of the outputs
- RSS
- MSE
- RMSE
- RRSE
- Corelation Coefficient
Which Evaluation Metric to Use?
The relative ranking of methods each across the different metrics is reasonably stable, such that the actual choice of metric isn’t crucial
Non-linear methods for numeric prediction
- regression trees
- model trees (generalised regression trees)
- locally weighted linear regression
- support vector regression
what to do with discrete attribute in LinearR?
binarization