1. Split the dataset in N approximately equal-sized folds. 2. Perform N repetitions where one fold is used for testing and the remaining folds are used in training. 3. Compute the error rate N times after each repetition and average the results to yield the overall error rate.

Week 2: Regression & Classification (Linear & Nonlinear Models) Flashcards by Henry Cao

Perceptron Training Rule

Linear classification models can draw decision boundaries between regions, with each region representing its own class.

How well did you know this?

Not at all

Perfectly

Error Rate for Classification Models

Error rate = 1 - \frac{1}{m}\sum_{i=1}^m score, with score = 0 for misclassifications and 1 for correct classifications

How well did you know this?

Not at all

Perfectly

Error Rate for Regression Models

Error rate = \sum_{i=1}^m (y - y_i)^2

How well did you know this?

Not at all

Perfectly

Training Models

This applies to both classification and regression models.

Training:
1. Select the training set
2. Initialise model parameters
3. Apply the model to all training set instances
4. Computer the error rate
5. Adjust the parameters to obtain a model w/ lower error
6. Repeat from step 3 until desirable error rate reached
7. Output the training error

Evaluation:
1. Select the test set
2. Apply the model to all test set instances
3. Compute the error rate
4. Output the evaluation error

How well did you know this?

Not at all

Perfectly

Cross-validation

Split the dataset in N approximately equal-sized folds.
Perform N repetitions where one fold is used for testing and the remaining folds are used in training.
Compute the error rate N times after each repetition and average the results to yield the overall error rate.

How well did you know this?

Not at all

Perfectly

Spurious Correlations

Just because correlation exists, doesn’t mean there’s a causal relationship between the variables.

How well did you know this?

Not at all

Perfectly

Linear Regression

\hat{y}i = x_i w = w_0 + \sum{j=1}^n w_j x_i,j

How well did you know this?

Not at all

Perfectly

Nonlinear Regression

These can include interaction terms and polynomial terms. Ex. \hat{y}_i = w_0 + w_1 \cdot x_i + w_2 \cdot x_i^2

How well did you know this?

Not at all

Perfectly

Underfitting

When the model doesn’t predict the training data well.

How well did you know this?

Not at all

Perfectly

Overfitting

When the model fits the training data relatively well, but fails to generalise to unseen data.

How well did you know this?

Not at all

Perfectly

Mean of Squared Errors

S(w) = \frac{1}{m} \sum_{i=1}^m (y_i - \hat{i})^2. Linear Regression models try to minimise Mean of Squared Errors.

How well did you know this?

Not at all

Perfectly

Gradient Descent

Initialise weights to 0 or to random values.

Until convergence is achieved:
for i \in {1,…,m}
for j \in {1,…,n}
w_j \leftarrow w_j + \alpha(y_i - \hat{y}i)x{i,j}

Termination criteria: \left\lvert S(w^k) - S(w^{k+1}) \right\rvert < \epsilon

How well did you know this?

Not at all

Perfectly

Single-scan/On-line Algorithm

for i \in {1,…,m}:
repeat:
for j \in {1,…,n}:
w_j \leftarrow w_j + \alpha(y_i - \hat{y}i) x{i,j}
until S(w) isn’t significantly changed

This method updates after each individual example. Other names include online approximation, stochastic gradient descent.

How well did you know this?

Not at all

Perfectly

Logistic Regression

This method assumes binary classification. If y \le 0.5, then predict 0. EIse y > 0.5, then predict 1. \hat{y} = \frac{1}{1 + exp[-(w_) + w_1 x_1 + … + w_n x_N)]}. Essentially if w^T x > 0, return 1. If w^T x \le 0, return 0.

How well did you know this?

Not at all

Perfectly

Perceptrons

This method learns a hyperplane separating two classes. Perceptrons form the building blocks of neural networks, such as single-layer feed-forward neural networks. They use the perceptron learning rule.

How well did you know this?

Not at all

Perfectly

Squared Error

Study These Flashcards

The perceptron learning rule uses this metric. squared_error = \sum_{i=1}^m (y_i - \hat{y}_i)^2

Perceptron Learning Rule

Study These Flashcards

It utilises gradient descent. For each i \in {1,…,m} and j \in {1,…,n}, w_j \leftarrow w_j + \alpha (y_i - \hat{y}i)x{i,j}

Feed-Forward Multilayer Neural Network

Study These Flashcards

These have inputs, hidden units, and outputs. The network only goes in one direction towards the output.

Neural Network Unit

Study These Flashcards

Each node has an input fed into an input function, an activation function, and an output function. Each node has input links and output links.

Sigmoid Activation Function

Study These Flashcards

f(x) = \frac{1}{1+e^{-x}}

Hyperbolic Tangent (Tanh) Function

Study These Flashcards

tanh(x) = \frac{2}{1 + e^{-2x}} - 1

Rectified Linear Unit (ReLU)

Study These Flashcards

f(x) = 0 for x < 0, x for x \ge 0

Activation Functions

Study These Flashcards

Ideal properties include nonlinear (can generalise well), differentiable (can update weights during training), and monotonic (for fast convergence)

Neural Networks

Study These Flashcards

Generally, there’s a specific cost function that is minimised when training neural networks.

Important factors to consider include:

Number of layers
Number of nodes per layer
Number of incoming links per node
Activation Function

Pros:
1. Great at nonlinear transformations of input.
2. Highly parameterised and can model even small function irregularities.
3. Even a small number of hidden layers can sufficiently model any continuous function.
4. Can be optimised to reduce overfitting.

Cons:
1. Explainibility is challenging, making it difficult to infer causal relationships.
2. Computationally expensive
3. Needs a very large dataset to work properly.

Recurrent Network

This type of neural network feeds outputs back to its own inputs. Network activation levels form a dynamic system that may reach a steady state or show oscillations and potentially chaotic behaviour.

Backpropagation

Given a specific weight w, w \leftarrow w + \Delta w = w - \eta \frac{\partial J}{\partial w}

Lasso Regression / L1 Regularisation

It uses a diamond-shaped region to eliminate irrelevant variables.

Ridge Regression / L2 Regularisation

It uses a disk-shaped region to reduce the coefficients of irrelevant predictors, which approach 0 but doesn't equal 0.

Week 2: Regression & Classification (Linear & Nonlinear Models) Flashcards

(28 cards)