Linear Models Flashcards
Linear predictors/models
Linear function : Ld = {hw,b : w in R^d, b in R}
Linear predictor : hw,b(x) : <w,x> + b = ( Sum wixi ) + b
Linear Regression: definitions, matrix form, derivation best predictor, use of generalized inverse
Hypothesis class : Hreg = Ld = { x –> <w,x> + b : w in R^d, b in R}
Commonly used function: squared-loss : l(h,(x,y)) = (h(x) - y)^2
Empirical risk = training error = Mean squared error
Ls(h) = 1/m SUM (h(xi) - yi)^2
How we find a ERM hypothesis? Least Squares Algorithm:
algorithm that solves the ERM problem for the hypothesis class of linear regression predictors with respect to the squared loss
Best hypo: argmin Ls(hw) = argmin 1/m SUM (<w,xi> - yi)^2 (we want to find w!)
Equivalent formulation: w minimizing RSS :
argmin SUM (<w,xi> - yi)^2
so we compute the gradient of objective function with repsect to w and compare it to 0
We then obtain w = (XtX)^-1Xty, if XtX is invertible then this is the solution of our ERM problem.
Coefficient of determination
R^2, in regression is a statistical measure of how well the regression predictions approximate the real data points. An R^2 of 1 indicates that the regression predictions perfectly fit the data.
Is a measure how well h performs against the best naive predictor.
Obtained as the Sum of squares residual / total sum of squares.
Linear classification: perceptron
Used in binary classification problems h : R^d –> {-1,1}
Hypothesis class of halfspaces: Hd = sign o Ld = {x –> sign(hw,b(x)) : hw,b in Ld}
The instances that are above the hyperplane are labeled positively, below negatively.
The commonly used loss function is the 0-1
How do we find good hypothesis?
Good = minimizes the training error ERM
–> Perceptron algorithm: algorithm that find a good hypothesis implementing the ERM rule
VC-dimension of linear models
Logistic regression
Used to learn a function h from R^d to [0,1], for classification tasks, h(x) is the probability that label of x is 1 in binary.
Hypothesis class H : sigmasig o Ld where sigmasig : R –> [0,1] is sigmoid function
The sigmoid is the S-shaped function where:
h(x) = 1 –> high confidence that the label is 1
h(x) = 0 –> high confidence that the label is -1
h(x) = 1/2 –> not confident about prediction
sigmoid(z) = 1 / (1+e^-z) = e^z / (1+e^z)
Hsig = sigmasig o Ld = { x –> sigmasig (<w,x>) : w in R^d}
Main difference with classification with halfspaces: when <w,x> sim 0
- halfspace prediction is 1 or -1
-sigmasig(<w,x>) sim 1/2 –> uncertainty in predicted label
Loss function: need to define how bad it is to predict hw(x) in [0,1] given that true label is y = +-1
l(hw,(x,y)) = log(1+exp(-y<w,x>))
if y =+1 then hw is large, y =-1 hw is small
Therefore, given a training set S the ERM problem for logistic regression is
argmin 1/m SUM log(1+e^-(yi<wi,xi>))
ERM formulation is the same as the one arising form maximum likelihood estimation,
MLE is a statistical approach for finding the parameters that maximize the koint probability of a given dataset assuming a specific parametric probability function.