empirical risk Flashcards
simple linear regression model
H(x) = w0 + w1x
slope in simple linear regression model
w1
intercept in simple linear regression model
w0
loss function
quantifies how bad a prediction is for a single data point
if our prediction is close to the actual value
we should have low loss
if our prediction is far from the actual value
we should have high loss
error
difference between actual and predicted values (yi - H(xi)
squared loss function
computes (actual - predicted)^2
constant model
Lsq(yi, h) = (yi - h)^2
another term for average squared loss
mean squared error
best prediction, h*
Rsq(h) = 1/n(Summation of i = 1, n) (yi - h)^2
constant model
H(x) = h
simple linear regression
H(x) = w0 + w1x
how do we find h* that minimizes Rsq(h)
using calculus
minimize Rsq(h)
- take its derivative with respect to h
- set it equal to 0
- solve for the resulting h*
- perform a second derivative test to ensure we found a minimum
derivative of Rsq(h)
-2/n(SUMMATION of n starting w/ i = 1)(yi - h)
Mean minimizes…
mean squared error
absolute loss
Labs(yi, H(xi)) = |yi - H(xi)|
average absolute loss
Rabs(h) = 1/n summation of n from i = 1 |yi - h|
to minimize mean absolute error
- take its derivative with respect to h
- set it equal to 0
- solve for the resulting h*
- perform a second derivative test to ensure we found a minimum
derivative of |yi - h|
it is a piece-wise function, so will be undefined
derivative of Rabs(h)
d/dh(1/n SUMMATION of n from i = 1, |yi - h|) = 1/n[#(h > yi) - #(h < yi)]
median minimizes
mean absolute error
best constant prediction in terms of mean absolute error
median
1. when n is odd, answer is unique
2. when n is even, any number between the middle two data points also minimizes mean absolute error
3. when n is even, define the median to be the mean of the middle two data points
process for minimizing average loss
empirical risk minimization
another name for “average loss”
empirical risk
corresponding empirical risk when using the squared loss function
Rsq(h) = 1/n SUMMATION of n from i = 1 (yi - h)^2
if L(yi, h) is any loss function the corresponding empirical risk is
R(h) = 1/n(SUMMATION Of n from i = 1, L(yi, h)
Modeling recipe
- choose a model
- choose a loss function
- minimize average loss to find optimal model parameters
empirical risk minimization
formal name for the process of minimizing average loss
corresponding squared loss function to Lsq(yi, h) = (yi - h)^2
Rsq(h) = 1/n Summation of n from i = 1 (yi - h)^2
For the mean
sum of distances below = sum of distances above
Mean is the point where
Summation of n from i = 1 (yi - h) = 0
Median is the point where
(yi < h) = #(yi > h)
Lp loss
Lp(yi, h) = |yi - h|^p
Corresponding empirical risk to Lp
Rp(h) = 1/n summation of n from i = 1|yi - h|^p
midrange minimizes
L(infinity loss)
As p –> infinity,
the minimizer of mean Lp loss approached the midpoint of the minimum and maximum values in the dataset or midrange
The general form of empirical risk for any loss function
R(h) = 1/n Summation of n from i = 1 (L(yi , h))
input h* that minimizes R(h) is…
some measure of the center of the dataset
minimum output R(h*) represents
some measure of the spread or variation in the dataset
Minimum value of Rsq(h)
Rsq(h*) = Rsq(Mean(y1, y2…yn))
= 1/n SUMMATION of n starting from i = 1(yi - Mean(y1, y2,… yn))^2
Variance
minimum value of Rsq(h) is the mean squared deviation from the mean, measures the squared distance of each data point from the mean, on average
standard deviation
square root of variance
empirical risk for absolute loss
Rabs(h) = 1/n summation of n starting from i = 1|yi - h|
Rabs(h) is minimized when
h* = Median(y1, y2,… yn)
Minimum value of Rabs(h) is…
mean absolute deviation from the median
(1/n)SUMMATION of n from i = 1|yi - Median(y1, y2,…yn)|
empirical risk for 0-1 Loss
R0,1(h) = 1/n Summation of n starting from i = 1 {0 - yi = h, 1 yi doesn’t equal h
proportion (between 0 and 1) of data points not equal to h
R0,1(h) is minimized when
h* = Mode(y1,y2…yn)
the minimum value of R0,1(h)
proportion of data points not equal to mode
simple linear regression model
H(x) = w0 + w1x
when using squared loss
h* = Mean(y1, y2… yn)
Rsq(h*) = Variance(y1, y2, … yn)
When using absolute loss
h* = Median(y1, y2… yn)
Rabs(h*) = MAD from median
R0,1(h) is minimized when
h* = Mode(y1,y2,… yn)
so therefore R0,1(h*) is the proportion of data points not equal to the mode
minimum value of R0,1(h) is the
proportion of data points not equal to mode
higher value means
less of the data is clustered at the mode
hypothesis function
H, takes in an x as input and returns a predicted y
parameters define
the relationship between the input and output of a hypothesis function
Since linear hypothesis functions are of the form H(x) = w0 + w1x, we can re-write Rsq
Rsq(w0, w1) = 1/n Summation n from i = 1 (yi - (w0 + w1xi))^2
Minimize mean squared error
Take partial derivatives with respect to each variable
set all partial derivatives to 0
solve the resulting system of equations
ensure that you’ve found a minimum, rather than a maximum or saddle point
We have a system of two equations and two unknowns (w0 and w1)
-2/n Summation of n from i = 1 ( yi - (w0 + w1xi)) = 0
-2/n Summation of n from i = 1 (yi - (w0 + w1xi))xi = 0
solve for w0 in first equation, result becomes best intercept
plut w0* into second equation and solve for w1
correlation
linear association, pattern that looks like a line
association
any pattern
correlation coefficient ,r
measure of strength of linear assocaition of two variables, x and y, measures how tightly clustered a scatter plot is around a straight line, between -1 and 1
correlation coefficient, r is defined
average of the product of x and y when both are in standard units
slope : w1* = r(sigma y / sigma x)
units of y per units of x