Logistic regression Flashcards
Data-set
{(x1,y1),…,(xN,yN)}
generated independently according to
P(y|x) = f(x) if y=+1; 1-f(x) if y=-1
objective
To learn the target function
f(x) = P[y=1 | x]
Note that f(x) is not the output y.
def logistic function
θ(s) = e^s / (1 + e^s)
Hypothesis
h(s) = θ(s)
where s is the risk score, modelled as a linear function of the input x
s = w’ x
- > h(x) = θ(w’ x)
Performance index
Maximum likelihood inference: maximization wrt w of Π(n=1,N) Pw(yn|xn) where Pw(yn|xn) = θ(yn w' xn)
Reformulation as a minimization problem:
minimize wrt w -1/N * sum(n=1,N) ln( Pw(yn|xn) )
In general, this is a non-linear minimization problem
In sample error in logistic regression
Ein(w) = 1/N * sum(n=1,N) ln(1+e^-yn w’ xn)
Gradient descent algorithm
- initialize the weights at t=0 as w^0
- for t=1,2,… do
- compute the gradient of the in-sample error using all the data
g(t) = ∇Ein(w^(t))
- set the direction v^t = -g(t)
- update the weights as
w^(t+1) = w^(t) + η*v^t
- iterate until stop - return the weights that minimize Ein(w)
Stopping criteria
1) upper bound on number of iterations
2) threshold on the norm of the gradient ||g(t)||
3) combination of 1 and 2
4) combination of the previous plus a threshold on Ein
Stochastic gradient descent
Gradient descent algorithm uses all the N data to compute ∇Ein
Stochastic GD algorithm, instead: - pick a data point (xn,yn) at random ein(w) = ln( 1+e^-yn w' xn ) ∇ein(w) = … - update rule: w^(t+1) = w^(t) - η*∇ein(w)
- > the minimization proceeds on average in the right direction, with some fluctuations, but with the advantage of requiring many less computations
Selection of the parameter η in the gradient descent algorithm
η is a design parameter:
- it should be small to allow Taylor approximation
- it should be large enough to obtain good convergence performances
- a good choice is a variable η:
η^(t) = η*||∇Ein(w(t))