Machine Learning Flashcards
List ML algorithm categories.
- Supervised learning
- Unsupervised learning
- Reinforcement learning
- Recommender systems
Examples of supervised learning
- Regression: Predicting continuous value output
- Classification (Logistic regression): Predicting discrete value output
Examples of unsupervised learning
- Clustering: Google news, computer cluster analysis, market segmentation, cocktail party problem, SN analysis
Hypothesis (model) and cost function for linear regression with a single variable
hθ(x) = θ0 + θ1*x
J(θ) = 1/(2m) * Σ{i=1~m} (hθ(x(i)) - y(i))2
- m: number of the training data.
How to find the parameter set for a linear regression problem?
Find a parameter set that minimizes the cost function, i.e.,
minθJ(θ)
One way of solving this optimization problem is the gradient descent algorithm.
Describe the gradient descent algorithm.
repeat until convergence {
for all j’s (simultaneously)
for i=1 to m {
θj := θj - a* (d/dθj J(θ))
}
}
}
a: learning parameter - Note that all ‘thetaj’ are updated simultaneously.
Discuss the learning rate of the gradient descent algorithm.
a: too small –> converges too slow a: too big –> might fail to converge or even diverge
Gradient descent algorithm for a linear regression with a single variable.
repeat until convergence {
for i=1 to m {
θ0 := θ0 - a* (h(x(i)-y(i))
θ1 := θ1 - a* (h(x(i)-y(i))*x(i)
}
}
* Note: This is a batch gradient descent.
What is “batch” gradient descent?
Each step of the gradient descent uses all the training samples.
Hypothesis and cost function of a linear regression with multi-variables.
hθ(X) = θT•X
- θT = [θ0, … , θn]
- XT = [1, x1, …, xn]
J(θ) = 1/(2m) * Σ{i=1~m} (hθ(X(i)) - y(i))2
- m: number of the training data.
Gradient descent of a linear regression with multi-variables.
repeat until convergence {
for all j in {0, 1, …, n}
for i=1 to m
θj := θ - a* (hθ(X(i))-y(i))
}
Feature scaling and GD
For GD to work well, features must have a similar scale. Mean normalization can be used. X := (X - mu)/S - mu: mean vector - S: std or (max-min)
How do you make sure GD is working?
Plot the J(θ) as the number of iteration and see if it decreases at each iteration.
How to extend a linear regression to Polynomial regression for non-linear function?
Create new features from the existing ones.
For example,
x1 = x1
x2 = x12
x3 = x13
Then, solve the new feature sets using the linear regression technique.
Normal equation for linear regression.
θ=(XT•X)-1•XT•y
Explain Logistic Regression
In solving a {0, 1} classification problem, we want the hypothesis (model) function value to be in [0 1] range.
- For linear regression, hθ(X) =θT•X
- For logistic regression, hθ(X) = g(θT•X) = 1/(1+exp(-θT•X))
- g(t) = 1/(1+exp(-t)): Sigmoid (logistic) function
- Interpretation: hθ(X) = p(y=1 | x ; θ) –> Probability y = 1, given X, parameterized by θ
Decision boundary for logistic regression
Suppose
- Predict “y=1” if hθ(X) >= 0.5
- Predict “y=0” if hθ(X) < 0.5
Then the decision boundary is θT•X=0.
Cost function for Logistic Regression
J(θ)=1/m*Σ{i=1~m} cost(hθ(X(i)), y(i))
where cost(hθ(X(i)), y(i)) is
- -log(hθ(X(i))) if y = 1
- -log(1-hθ(X(i))) if y = 0
If you combine the above two terms, then
cost(•) = -y*log(hθ(X(i))) -(1-y)*log(1-hθ(X(i)))
After training, your found your ML algorithm produce high prediction error with test data. What can you do?
- Get more examples –> helps to fix high variance
- Not good if you have high bias (underfitting)
- Smaller set of features –> fixes high variance (overfitting)
- Not good if you have high bias
- Try adding additional features –> fixes high bias (because hypothesis is too simple, make hypothesis more specific)
- Add polynomial terms –> fixes high bias problem
- Decreasing λ –> fixes high bias
- Increases λ –> fixes high variance