Week 2: Regression & Classification (Linear & Nonlinear Models) Flashcards
Perceptron Training Rule
Linear classification models can draw decision boundaries between regions, with each region representing its own class.
Error Rate for Classification Models
Error rate = 1 - \frac{1}{m}\sum_{i=1}^m score, with score = 0 for misclassifications and 1 for correct classifications
Error Rate for Regression Models
Error rate = \sum_{i=1}^m (y - y_i)^2
Training Models
This applies to both classification and regression models.
Training:
1. Select the training set
2. Initialise model parameters
3. Apply the model to all training set instances
4. Computer the error rate
5. Adjust the parameters to obtain a model w/ lower error
6. Repeat from step 3 until desirable error rate reached
7. Output the training error
Evaluation:
1. Select the test set
2. Apply the model to all test set instances
3. Compute the error rate
4. Output the evaluation error
Cross-validation
- Split the dataset in N approximately equal-sized folds.
- Perform N repetitions where one fold is used for testing and the remaining folds are used in training.
- Compute the error rate N times after each repetition and average the results to yield the overall error rate.
Spurious Correlations
Just because correlation exists, doesn’t mean there’s a causal relationship between the variables.
Linear Regression
\hat{y}i = x_i w = w_0 + \sum{j=1}^n w_j x_i,j
Nonlinear Regression
These can include interaction terms and polynomial terms. Ex. \hat{y}_i = w_0 + w_1 \cdot x_i + w_2 \cdot x_i^2
Underfitting
When the model doesn’t predict the training data well.
Overfitting
When the model fits the training data relatively well, but fails to generalise to unseen data.
Mean of Squared Errors
S(w) = \frac{1}{m} \sum_{i=1}^m (y_i - \hat{i})^2. Linear Regression models try to minimise Mean of Squared Errors.
Gradient Descent
Initialise weights to 0 or to random values.
Until convergence is achieved:
for i \in {1,…,m}
for j \in {1,…,n}
w_j \leftarrow w_j + \alpha(y_i - \hat{y}i)x{i,j}
Termination criteria: \left\lvert S(w^k) - S(w^{k+1}) \right\rvert < \epsilon
Single-scan/On-line Algorithm
for i \in {1,…,m}:
repeat:
for j \in {1,…,n}:
w_j \leftarrow w_j + \alpha(y_i - \hat{y}i) x{i,j}
until S(w) isn’t significantly changed
This method updates after each individual example. Other names include online approximation, stochastic gradient descent.
Logistic Regression
This method assumes binary classification. If y \le 0.5, then predict 0. EIse y > 0.5, then predict 1. \hat{y} = \frac{1}{1 + exp[-(w_) + w_1 x_1 + … + w_n x_N)]}. Essentially if w^T x > 0, return 1. If w^T x \le 0, return 0.
Perceptrons
This method learns a hyperplane separating two classes. Perceptrons form the building blocks of neural networks, such as single-layer feed-forward neural networks. They use the perceptron learning rule.