Week 2: Regression & Classification (Linear & Nonlinear Models) Flashcards
Perceptron Training Rule
Linear classification models can draw decision boundaries between regions, with each region representing its own class.
Error Rate for Classification Models
Error rate = 1 - \frac{1}{m}\sum_{i=1}^m score, with score = 0 for misclassifications and 1 for correct classifications
Error Rate for Regression Models
Error rate = \sum_{i=1}^m (y - y_i)^2
Training Models
This applies to both classification and regression models.
Training:
1. Select the training set
2. Initialise model parameters
3. Apply the model to all training set instances
4. Computer the error rate
5. Adjust the parameters to obtain a model w/ lower error
6. Repeat from step 3 until desirable error rate reached
7. Output the training error
Evaluation:
1. Select the test set
2. Apply the model to all test set instances
3. Compute the error rate
4. Output the evaluation error
Cross-validation
- Split the dataset in N approximately equal-sized folds.
- Perform N repetitions where one fold is used for testing and the remaining folds are used in training.
- Compute the error rate N times after each repetition and average the results to yield the overall error rate.
Spurious Correlations
Just because correlation exists, doesn’t mean there’s a causal relationship between the variables.
Linear Regression
\hat{y}i = x_i w = w_0 + \sum{j=1}^n w_j x_i,j
Nonlinear Regression
These can include interaction terms and polynomial terms. Ex. \hat{y}_i = w_0 + w_1 \cdot x_i + w_2 \cdot x_i^2
Underfitting
When the model doesn’t predict the training data well.
Overfitting
When the model fits the training data relatively well, but fails to generalise to unseen data.
Mean of Squared Errors
S(w) = \frac{1}{m} \sum_{i=1}^m (y_i - \hat{i})^2. Linear Regression models try to minimise Mean of Squared Errors.
Gradient Descent
Initialise weights to 0 or to random values.
Until convergence is achieved:
for i \in {1,…,m}
for j \in {1,…,n}
w_j \leftarrow w_j + \alpha(y_i - \hat{y}i)x{i,j}
Termination criteria: \left\lvert S(w^k) - S(w^{k+1}) \right\rvert < \epsilon
Single-scan/On-line Algorithm
for i \in {1,…,m}:
repeat:
for j \in {1,…,n}:
w_j \leftarrow w_j + \alpha(y_i - \hat{y}i) x{i,j}
until S(w) isn’t significantly changed
This method updates after each individual example. Other names include online approximation, stochastic gradient descent.
Logistic Regression
This method assumes binary classification. If y \le 0.5, then predict 0. EIse y > 0.5, then predict 1. \hat{y} = \frac{1}{1 + exp[-(w_) + w_1 x_1 + … + w_n x_N)]}. Essentially if w^T x > 0, return 1. If w^T x \le 0, return 0.
Perceptrons
This method learns a hyperplane separating two classes. Perceptrons form the building blocks of neural networks, such as single-layer feed-forward neural networks. They use the perceptron learning rule.
Squared Error
The perceptron learning rule uses this metric. squared_error = \sum_{i=1}^m (y_i - \hat{y}_i)^2
Perceptron Learning Rule
It utilises gradient descent. For each i \in {1,…,m} and j \in {1,…,n}, w_j \leftarrow w_j + \alpha (y_i - \hat{y}i)x{i,j}
Feed-Forward Multilayer Neural Network
These have inputs, hidden units, and outputs. The network only goes in one direction towards the output.
Neural Network Unit
Each node has an input fed into an input function, an activation function, and an output function. Each node has input links and output links.
Sigmoid Activation Function
f(x) = \frac{1}{1+e^{-x}}
Hyperbolic Tangent (Tanh) Function
tanh(x) = \frac{2}{1 + e^{-2x}} - 1
Rectified Linear Unit (ReLU)
f(x) = 0 for x < 0, x for x \ge 0
Activation Functions
Ideal properties include nonlinear (can generalise well), differentiable (can update weights during training), and monotonic (for fast convergence)
Neural Networks
Generally, there’s a specific cost function that is minimised when training neural networks.
Important factors to consider include:
- Number of layers
- Number of nodes per layer
- Number of incoming links per node
- Activation Function
Pros:
1. Great at nonlinear transformations of input.
2. Highly parameterised and can model even small function irregularities.
3. Even a small number of hidden layers can sufficiently model any continuous function.
4. Can be optimised to reduce overfitting.
Cons:
1. Explainibility is challenging, making it difficult to infer causal relationships.
2. Computationally expensive
3. Needs a very large dataset to work properly.
Recurrent Network
This type of neural network feeds outputs back to its own inputs. Network activation levels form a dynamic system that may reach a steady state or show oscillations and potentially chaotic behaviour.
Backpropagation
Given a specific weight w,
w \leftarrow w + \Delta w = w - \eta \frac{\partial J}{\partial w}
Lasso Regression / L1 Regularisation
It uses a diamond-shaped region to eliminate irrelevant variables.
Ridge Regression / L2 Regularisation
It uses a disk-shaped region to reduce the coefficients of irrelevant predictors, which approach 0 but doesn’t equal 0.