Classification and regression Flashcards

1
Q

classification

A

predicts discrete class labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

example of classification

A

labelling emails spam or ham

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

decision tree classifier

A

flowchart-like structure in which each node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label tree like model that makes decisions by splitting data into subsets based on feature values creating branches that lead to outcomes (class labels)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

decision tree makes a sequence

A

of partitions of training data one attribute at a time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

probability in classification

A

probability helps determines likelihood of each class level given a set of features
relates to confidence in predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

ordering in classification

A

attributes are selected and split based on a measure like information gain creating an order of importance for features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

entropy

A

entropy is a measure of uncertainty or disorder in a system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

info entropy in classification

A

entropy measures how hard it is to guess the label of a randomly taken sample from dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

choose level with ___ entropy as ___

A

lowest
as the data labels are more uniform so its easy to guess

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

how is entropy used in data splits for decision trees?

A

decision trees use information gain based on entropy to decide best feature to split the data at each node
entropy is calculated before and after split to determine how well a feature divides that data into pure sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

3 steps of entropy and data splits

A

1) partition example recursively by choosing one attribute at a time
2) choose attribute based on which attribute can separate classes of training examples best
3) choose goodness function (info gain, gain ratio, gini ratio)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

3 attribute types

A

nominal (categorical values with no order like animal, food)
ordinal (categorical values that have order like hot, warm, cold)
numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

how do you handle numerical attribute in decision tree? and 3 ways you can?

A

convert to a nominal attribute
1) assign category to numerical and keep trying until you find a good split
2) use entropy value till you find the best split
3) frequency bining

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

attribute resulting in ____ info gain is selected for split

A

highest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

process of splitting decision tree by attribtiues is continued recursively ____

A

building tree by splitting data using features that minimise uncertainty at each step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Th is the

A

entropy threshold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the purpose of Th

A

criterion for deciding when to stop splitting the data at a node or to continue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When entropy of a node is below Th?

A

If the entropy of a node is below a certain threshold, it means that the data at that node is sufficiently pure (i.e., it mostly contains examples of one class). As a result, the decision tree can stop splitting further at that node, and the node is labeled with the majority class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

When entropy of a node is above Th?

A

If the entropy is above the threshold, it indicates that the data at the node is still impure, meaning there’s a mix of different class labels. In this case, the decision tree continues splitting by choosing the attribute that reduces entropy the most (maximizing information gain)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

only use Th=0 when

A

example is really simple

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Th=0, Th>0

A

=0 perfect order
>1 can tolerate some mixed levels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

avoid overfitting by using 1) and 2) and 3)

A

entropy threshold
pruning
limit depth of tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

gain ratio formula

A

information gain A/ (#A x A entropy)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

want big or small gain ratio and why?

A

small as prevents selecting attributes that overfit the model by using many small, specific splits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

gini index doesn’t rely on

A

entropy only on class proportion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

when you would use info gain as goodness function?

A

imbalanced dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

when you would use gain ratio as goodness function?

A

imbalanced dataset
high brand attribute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

when you would use gini index as goodness function?

A

binary classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

rank fastest to slowest for goodness function evaluation?

A

fastest gini, middle if IG, slowest is gain ratio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

perceptron is an

A

artificial neuron
fundamental unit in neural networks modelled after biological neuron
activity is weighted sum of its inputs + bias term passed through an activation function to produce output
adjusting weights allows neuron to learn
choice of activation function determines type of computation the neuron performs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

single neuron vs multiple computation ability wise

A

single neuron can only do simple computations but many connected in a large network can deliver any function mapping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

what is the activation function symbol

A

like a hook

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

activation function

A

determines output of neuron based on weighted sumo inputs
introduces non-linearity to make the model network capable of learning more complex patterns
e.g sigmoid, relu, softmax

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

weights

A

coefficients that adjust the influence of certain input attributes on the output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

bias

A

threshold value added to the sum of weighted inputs to shift the activation function’s output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

why is the bias helpful?

A

shifts decision boundary away from the origin making the model more flexible as without the decision boundary would always pass through the origin

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

half space

A

space divided by hyperplane which classifies the data points based on which side they fall on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

one hot encoding

A

converts categorical data variables into a numerical format that machine learning models can use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

binary classification vs multi class classification

A

binary classifies data into 1 of 2 classes
multi classifies data into 1 of many classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

how does multi class classification work?

A

uses K neuron’s and trains each to separate one class from all others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

how is training done on a perceptron?

A

iteratively updating weights in a way that minimises error function (difference between actual and predicted output)

42
Q

how is a perceptron trained?

A

the goal is to learn a set of weights that allows the perceptron to correctly classify input data
trained through supervised learning

43
Q

a perceptron model is guaranteed to converge if

A

1) learning parameters= small enough
2) linearly separable classification

44
Q

learning rate

A

controls magnitude of weight updates during training

45
Q

a too high learning rate can mean

A

computation is faster but can lead to potentially unstable training

46
Q

epoch is a

A

entire passing of training data through the algorithm

47
Q

online training

A

one update per training sample
N updates per epoch

48
Q

batch learning

A

average updates from all training samples
one update per epoch

49
Q

mini batch learning

A

dividing data into fixed size batches and taking the average over mini batches and shuffling data assigned to mini batches between epochs

50
Q

limit of perceptron model and how is it overcome?

A

linear separation
simple single perceptron can only solve linearly separable problems so can’t handle XOR problems
in more complex neural networks MLP

51
Q

MLP is

A

forward feed artificial neural network that generates a set of outputs from a set of inputs with multiple hidden laters to allow model more complex problems and capture non linear patterns

52
Q

MLP structure

A

input layer
each neuron in the input layer corresponds to one feature in the input data

hidden layers
can have 1 or more
Each neuron in a hidden layer is connected to every node in the previous and next layer (fully connected)
Neurons in the hidden layers apply non-linear activation functions to the weighted sum of inputs and produce an output
learn abstract features from input data

output layer
provides the final prediction of the mode

53
Q

number of input layers is the number of

A

features

54
Q

how does a MLP learn

A

adjusting the weights between neurons to minimize error and is able to capture complex patterns in data due to its layered architecture

55
Q

SLP

A

one layer of neutrons
can only solve linearly separable problems

56
Q

3 advantages to MLP

A

can solve non-linearly separable problems
can model more complex decision boundaries and patterns in the data by stacking ML of neutrons
can learn hierarchical features where each hidden layer captures different levels of abstraction from the data

57
Q

universal function approximation

A

technically single layer perceptron is

model that can approximate any continuous function given enough neutrons and layers

58
Q

ufa in classification and regression

A

can produce desired class labelling on any data

hypothesis that fits any day with arbitrarily small MSE

59
Q

3 non linear activation functions

A

relu
sigmoid
tanh

60
Q

relu (what, computation expense, pros and cons)

A

outputs input directly if its positive and gives it value of 0 if not
no exponent so computationally efficient
pros: most widely used, no vanishing/exploding gradient, not 0 centred output
cons: incorrect mapping for negative values, dead relu

61
Q

what is dead relu and what’s an attempted solution?

A

if input is bounded between 0 and 1 large gradient updates can cause bias towards large negative values
difficult to recover as gradient of 0 is 0
leaky relu is an attempt to fix it and it says that if value is +ve just gives it its own value otherwise gives a* value

62
Q

sigmoid (what, computation expense, pros and cons) and what formula and graph roughly looks like

A

predicts probability as squashes real value between 0 and 1
exponent so expensive
pros: guarantees gradient cannot grow pass certain bound
cons: gradient is bounded so can get vanishing gradient, outputs are not 0 centred so all neutrons have same sing in training
1/ 1+ e^-v
s from bottom to top and bottom in line with 0

63
Q

tanh (what, computation expense, pros and cons) and what formula roughly looks like

A

squashes values between -1 and 1
exponent is computationally expensive
gradient is steeper than sigmoid
pros: outputs are 0 centres so means faster learning
cons: gradient is bounded so vanishing gradient
got the most e’s ion the formula and looks like s from bottom to top but bottom in line with -1 not 0

64
Q

which activation function to use?

A

use reLu
binary classification use sigmoid for output layer
multi class classification use softmax for output layer

65
Q

3 weight initialisation techniques

A

set =0 neural network acts as linear model

choose randomly can lead to vanishing/exploding gradient but OK with reLu

heuristic but x random w to some value to avoid grad stuff

66
Q

softmax

A

converts vector of K real numbers into a probability distribution of K possible outcomes

67
Q

backpropogation

A

an algorithm used to train MLP by computing gradient of loss function with respect to each weights in network then systematically propagates error backwards from output to all preceding layers

68
Q

forward pass back propagation

A

input passes through the network layer by layer and output is computed

69
Q

loss calculation in back propagation

A

computes difference between y hat with y and computes error

70
Q

backward pass back propagation

A

error flows backward through network layer by layer computing gradient (partial derivative of loss function with respect to weight) using chain rule
Update the weights by moving in the opposite direction of the gradient

71
Q

back propagation allows

A

network to learn from its mistakes by adjusting weights and biases based on error

72
Q

steepest gradient decent

A

optimization method that uses the gradients computed by backpropagation to update the weights, aiming to minimize the loss function over time

73
Q

choice of output layer act func, activation and loss function for MLP for classification

A

output layer (multi class classification= softmax, binary= sigmoid (or tanh)) softmax takes too long for binary
loss function: cross entropy
activation function: reLu

74
Q

choice of output layer act func, activation and loss function for MLP for regression

A

output later: linear
loss function: MSE (more sensitive to outliers) or MAE
activation function: reLu or tanh in hidden layers

75
Q

impact of architecture on complexity and capability of MLP

A

more hidden layers allows MLP to model more complex relationships but also increase risk of overfitting

76
Q

5 steps of supervised learning and MLP

A

1) examine
- how many inputs
- how many outputs
- what type is desired output (classification or regression)
2) # hidden layers
- usually number attributes
3) activation functions of hidden layers decide
4) for each hidden and output layer
- initialise w (not all 0)
- initialise bais matrices (all 0 is ok)
5) train network on training data test performance on test data

77
Q

example of MLP and requirements

A

MLP with one hidden layer is classified as a universal function approxiamtor with enough neurons in the hidden layer and a sensible non linear activation function

78
Q

filtering and expanding

A

filtering is finding only key features and expanding is feature representation

79
Q

convert pixels to

A

grey scale as don’t want model to learn the colours

80
Q

logistic regression

A

has 2 modifications to normal regression model to make it suitable for binary classification problem where y E {0,1}
output of regression model passed through a sigmoid function to convert it to a continuous value between 0 and 1 with that being the probability of y hat being 1 class

81
Q

how does the sigmoid function work with logistic regression

A

the output is passed through a sigmoid function which converts the number into a probability between 0-1 with it being the probability of belonging to class 1
that number is checked with a hard limiting function P>0.5 then class 1 and P<0.5 class 0

82
Q

cross entropy loss

A

measures difference between predicted probability distribution and actual class labels
penalises incorrect classifications more heavily when model is confident in wrong prediction

83
Q

feature space

A

set of all possible values for a chosen set of features from chosen data
decision boundaries drawn in this feature space to separate classes based on the features

84
Q

degree of polynomial and model complexity

A

increase degree of polynomial increases flexibility of the model however could overfit

85
Q

steepest gradient descent

A

optimization technique that iteratively adjusts model parameters by following the steepest descent direction of the loss surface, aiming to minimize the error

86
Q

steepest gradient descent only works with

A

continuous loss function

87
Q

optimisation in ML

A

goal is to find set of parameters that minimises loss function
similar to parameter search through a space of possible parameter values
optimisation allows algorithm to learn and adapt

88
Q

loss surface

A

given constant dataset loss J evaluated on some h(x,w) for some choice of parameters w as a function J(w) in the space of all possible parameters. manifold J(w) makes in parameter space = loss surface

89
Q

linear in parameters and not

A

independent variables have linear relationship with output
y hat= w1g1+w2g1 and also y hat= w1sinx etc but not for y hat= sin(x1w1) as w1 has sin relationship to y hat

90
Q

convex loss function

A

has one minimum (global) and no local minima
guarantees gradient based optimization methods will always find best solution

91
Q

global minima

A

post on loss surface where loss is at its absolute minimum

92
Q

local minima

A

point of loss surface where loss is lower than neighbouring but not necessary lowest overall

93
Q

limitations to SGD

A

can get stuck in local minima for none-convex functions
if gradient is near 0 the descent slows down significantly
finding right learning rate is critical to performance

94
Q

GA for learning parameters

A

genetic algorithm is a search heuristic
reflects process of natural selcetion
can be used to learn parameters of a model especially when loss surface is non-convex or sgd struggles with local min

95
Q

regression

A

predicting numerical values

96
Q

linear regression

A

model relationship between dependant (output) and 1 or more indepdant variables (inputs) using a straight line
uses model hypothesis that is weighted sum of inputs

97
Q

goal of regression

A

goal is to find optimal weights that minimise differences between predicted values and actual values using loss function

98
Q

MSE and formula

A

average squared difference between predicted and actual values
JMSE= 1/N sum of y-y hat

99
Q

why do we care about MSE when using it for our model?

A

minimising MSE increases accuracy of regression model by decreasing prediction error

100
Q

least squares fit

A

maths procedure for finding best fitting curve to given set of points of minimising sum of squared residuals
weights are adjusted iteratively using steepest gradient descent to find optimal values to minimise loss