Topic 1: Loss functions Flashcards

Question 1

Q

What is a target

Answer

A

The output y of a function

Question 2

Q

What is the basic model of the function of sample

Answer

A

y = ftrue(x) + ϵ

ϵ = noise, modelled by a gaussian distribution

Question 3

Q

What is k-nn regression

Answer

A

When a new x value is observed, to calculate y
Take the k nearest neighbours to x and average their y values

Question 4

Q

What is an instance-based algorithm

Answer

A

Not good at generalising beyond the current scenario
i.e. k-nn

Question 5

Q

What is “fine-tuned” the same as

Answer

A

overfitting = complex

Question 6

Q

What are the 6 main types of function approximations

Answer

A

linear/polynomial regression
support vector machines
neural networks (CNN and logistic regression)
naive bayes (probabilistic models)
decision trees (for both regression and classification)
ensemble models

Question 7

Q

How do the 6 main types of function approximations work

Answer

A

By minimising loss functions

Question 8

Q

What are the properites of overfitting

Answer

A

high accuracy on training data
captures noise
high testing errors

Question 9

Q

What are the properties of underfitting

Answer

A

too simple
high training and testing errors

Question 10

Q

What does x ∈ R^d mean

Answer

A

x is a real-valued feature vector of length d

Question 11

Q

What is the common loss function used for classification problems

Answer

A

Cross entropy loss

Question 12

Q

What is the formula for binary CE loss

Answer

A

l(y, f(x)) = -[ylnf(x) + (1-y)ln(1-f(x))]

where y ∈ {0,1} and f(x) ∈ (0,1)

Question 13

Q

What is the common loss function used for regression problems

Answer

A

Squared loss function

Question 14

Q

What is the formula for squared loss function

Answer

A

l(y, f(x)) = (y - f(x))^2

Question 15

Q

Where does the true label y always go

Answer

A

First before the function label f(x)
eg ( y - f(x))

Question 16

Q

What is the formula for squared loss training error

Answer

A

ltrain(f) = 1/n Σ (yi - f(xi))^2

Question 17

Q

What does asterisk denote

Answer

A

Optimal solution or best value

Question 18

Q

What is the general equation for w*

Answer

A

argmin(w) [ltrain(f)]

Question 19

Q

What is a probability simplex

Answer

A

A geometric object that represents all possible probability distributions over a finite set of outcomes
Eg for classification problem with 3 classes
a traingle where each vertex corresponds to a class

Question 20

Q

What does ∈ (0,1) mean

Answer

A

The variable can take any value in this interval but not 0 and 1 themselves

Question 21

Q

what does ∈ {0,1} mean

Answer

A

The variable can take any value in this interval including 0 and 1 themselves

Question 22

Q

What is the purpose of gradient descent

Answer

A

To optimise complex models

Question 23

Q

Theoretically what parameter would we like to adjust to achieve global minimum

Answer

A

wj
too computationally expensive to manually plot
instead we use iterative methdods

Question 24

Q

What is the key principle of gradient descent when gradient is negative

Answer

A

increase the parameter wj

Question 25

Q

What is the key principle of gradient descent when gradient is positive

Answer

A

decrease the parameter wj

Question 26

Q

What does nabla ∇ denote

Question 27

Q

What is the update rule

Answer

A

w ← w - η . ∇l(y, f(x))

Question 28

Q

What is η in the update rule

Answer

A

The learning rate (step size of algorithm)

Question 29

Q

What is full batch gradient descent

Answer

A

The entire dataset is used to compute the loss function gradient
w ← w - η . 1/n Σ [∇l(yi, f(xi))]

Question 30

Q

What is mini batch gradient descent

Answer

A

Randomly picks a sample S of m datapoints
w ← w - η . 1/m Σ [∇l(yi, f(xi))]

Question 31

Q

What is stochastic gradient descent

Answer

A

At the extreme of mini batch where m=1
(sometimes referred to with values m>1)

Question 32

Q

What does SGD help with

Answer

A

Tends to help with escaping local minima

Question 33

Q

What is sgd, gd and mini batch gd all examples of

Answer

A

first order gradient descent algorithms

Question 34

Q

What are first order vs second order gd algorithms

Answer

A

second order use the gradient but also information about the curvature of the loss function (second derivative)

Question 35

Q

When and why are decision trees useful

Answer

A

Fast to train and deploy
good for tabular data (anything that fits a spreadsheet)
not images, speech, videos

Question 36

Q

How do decision trees work

Answer

A

Recursively splits the data into subsets based on values of input features
Then fits each subsets to a simple model (constant label/ linear regression)
When a new label x is observed, traverse the tree to find the right prediction

Question 37

Q

How do classification and regresssion trees compare

Answer

A

Use the same branching
eg X1 > 0.83 (yes or no)
then X4 < 0.3 … etc

Except regression trees predict a continuous value at the final node - usually the mean of the target variable within that node
for classification the node is a class eg class 3

Question 38

Q

What is the main parameter of decision trees

Answer

A

Depth
similar to increasing k for knn
increased depth -> increased complexity

Question 39

Q

what is thew update rule updating/optimising

Answer

A

parameters w
Determines how the parameters should be adjusted in order to minimise the loss function

Brainscape's Knowledge GenomeTM

Topic 1: Loss functions Flashcards

Brainscape's Knowledge Genome^TM