Classification and regression Flashcards
classification
predicts discrete class labels
example of classification
labelling emails spam or ham
decision tree classifier
flowchart-like structure in which each node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label tree like model that makes decisions by splitting data into subsets based on feature values creating branches that lead to outcomes (class labels)
decision tree makes a sequence
of partitions of training data one attribute at a time
probability in classification
probability helps determines likelihood of each class level given a set of features
relates to confidence in predictions
ordering in classification
attributes are selected and split based on a measure like information gain creating an order of importance for features
entropy
entropy is a measure of uncertainty or disorder in a system
info entropy in classification
entropy measures how hard it is to guess the label of a randomly taken sample from dataset
choose level with ___ entropy as ___
lowest
as the data labels are more uniform so its easy to guess
how is entropy used in data splits for decision trees?
decision trees use information gain based on entropy to decide best feature to split the data at each node
entropy is calculated before and after split to determine how well a feature divides that data into pure sets
3 steps of entropy and data splits
1) partition example recursively by choosing one attribute at a time
2) choose attribute based on which attribute can separate classes of training examples best
3) choose goodness function (info gain, gain ratio, gini ratio)
3 attribute types
nominal (categorical values with no order like animal, food)
ordinal (categorical values that have order like hot, warm, cold)
numerical
how do you handle numerical attribute in decision tree? and 3 ways you can?
convert to a nominal attribute
1) assign category to numerical and keep trying until you find a good split
2) use entropy value till you find the best split
3) frequency bining
attribute resulting in ____ info gain is selected for split
highest
process of splitting decision tree by attribtiues is continued recursively ____
building tree by splitting data using features that minimise uncertainty at each step
Th is the
entropy threshold
What is the purpose of Th
criterion for deciding when to stop splitting the data at a node or to continue
When entropy of a node is below Th?
If the entropy of a node is below a certain threshold, it means that the data at that node is sufficiently pure (i.e., it mostly contains examples of one class). As a result, the decision tree can stop splitting further at that node, and the node is labeled with the majority class
When entropy of a node is above Th?
If the entropy is above the threshold, it indicates that the data at the node is still impure, meaning there’s a mix of different class labels. In this case, the decision tree continues splitting by choosing the attribute that reduces entropy the most (maximizing information gain)
only use Th=0 when
example is really simple
Th=0, Th>0
=0 perfect order
>1 can tolerate some mixed levels
avoid overfitting by using 1) and 2) and 3)
entropy threshold
pruning
limit depth of tree
gain ratio formula
information gain A/ (#A x A entropy)
want big or small gain ratio and why?
small as prevents selecting attributes that overfit the model by using many small, specific splits
gini index doesn’t rely on
entropy only on class proportion
when you would use info gain as goodness function?
imbalanced dataset
when you would use gain ratio as goodness function?
imbalanced dataset
high brand attribute
when you would use gini index as goodness function?
binary classification
rank fastest to slowest for goodness function evaluation?
fastest gini, middle if IG, slowest is gain ratio
perceptron is an
artificial neuron
fundamental unit in neural networks modelled after biological neuron
activity is weighted sum of its inputs + bias term passed through an activation function to produce output
adjusting weights allows neuron to learn
choice of activation function determines type of computation the neuron performs
single neuron vs multiple computation ability wise
single neuron can only do simple computations but many connected in a large network can deliver any function mapping
what is the activation function symbol
like a hook
activation function
determines output of neuron based on weighted sumo inputs
introduces non-linearity to make the model network capable of learning more complex patterns
e.g sigmoid, relu, softmax
weights
coefficients that adjust the influence of certain input attributes on the output
bias
threshold value added to the sum of weighted inputs to shift the activation function’s output
why is the bias helpful?
shifts decision boundary away from the origin making the model more flexible as without the decision boundary would always pass through the origin
half space
space divided by hyperplane which classifies the data points based on which side they fall on
one hot encoding
converts categorical data variables into a numerical format that machine learning models can use
binary classification vs multi class classification
binary classifies data into 1 of 2 classes
multi classifies data into 1 of many classes
how does multi class classification work?
uses K neuron’s and trains each to separate one class from all others
how is training done on a perceptron?
iteratively updating weights in a way that minimises error function (difference between actual and predicted output)
how is a perceptron trained?
the goal is to learn a set of weights that allows the perceptron to correctly classify input data
trained through supervised learning
a perceptron model is guaranteed to converge if
1) learning parameters= small enough
2) linearly separable classification
learning rate
controls magnitude of weight updates during training
a too high learning rate can mean
computation is faster but can lead to potentially unstable training
epoch is a
entire passing of training data through the algorithm
online training
one update per training sample
N updates per epoch
batch learning
average updates from all training samples
one update per epoch
mini batch learning
dividing data into fixed size batches and taking the average over mini batches and shuffling data assigned to mini batches between epochs
limit of perceptron model and how is it overcome?
linear separation
simple single perceptron can only solve linearly separable problems so can’t handle XOR problems
in more complex neural networks MLP
MLP is
forward feed artificial neural network that generates a set of outputs from a set of inputs with multiple hidden laters to allow model more complex problems and capture non linear patterns
MLP structure
input layer
each neuron in the input layer corresponds to one feature in the input data
hidden layers
can have 1 or more
Each neuron in a hidden layer is connected to every node in the previous and next layer (fully connected)
Neurons in the hidden layers apply non-linear activation functions to the weighted sum of inputs and produce an output
learn abstract features from input data
output layer
provides the final prediction of the mode
number of input layers is the number of
features
how does a MLP learn
adjusting the weights between neurons to minimize error and is able to capture complex patterns in data due to its layered architecture
SLP
one layer of neutrons
can only solve linearly separable problems
3 advantages to MLP
can solve non-linearly separable problems
can model more complex decision boundaries and patterns in the data by stacking ML of neutrons
can learn hierarchical features where each hidden layer captures different levels of abstraction from the data
universal function approximation
technically single layer perceptron is
model that can approximate any continuous function given enough neutrons and layers
ufa in classification and regression
can produce desired class labelling on any data
hypothesis that fits any day with arbitrarily small MSE
3 non linear activation functions
relu
sigmoid
tanh
relu (what, computation expense, pros and cons)
outputs input directly if its positive and gives it value of 0 if not
no exponent so computationally efficient
pros: most widely used, no vanishing/exploding gradient, not 0 centred output
cons: incorrect mapping for negative values, dead relu
what is dead relu and what’s an attempted solution?
if input is bounded between 0 and 1 large gradient updates can cause bias towards large negative values
difficult to recover as gradient of 0 is 0
leaky relu is an attempt to fix it and it says that if value is +ve just gives it its own value otherwise gives a* value
sigmoid (what, computation expense, pros and cons) and what formula and graph roughly looks like
predicts probability as squashes real value between 0 and 1
exponent so expensive
pros: guarantees gradient cannot grow pass certain bound
cons: gradient is bounded so can get vanishing gradient, outputs are not 0 centred so all neutrons have same sing in training
1/ 1+ e^-v
s from bottom to top and bottom in line with 0
tanh (what, computation expense, pros and cons) and what formula roughly looks like
squashes values between -1 and 1
exponent is computationally expensive
gradient is steeper than sigmoid
pros: outputs are 0 centres so means faster learning
cons: gradient is bounded so vanishing gradient
got the most e’s ion the formula and looks like s from bottom to top but bottom in line with -1 not 0
which activation function to use?
use reLu
binary classification use sigmoid for output layer
multi class classification use softmax for output layer
3 weight initialisation techniques
set =0 neural network acts as linear model
choose randomly can lead to vanishing/exploding gradient but OK with reLu
heuristic but x random w to some value to avoid grad stuff
softmax
converts vector of K real numbers into a probability distribution of K possible outcomes
backpropogation
an algorithm used to train MLP by computing gradient of loss function with respect to each weights in network then systematically propagates error backwards from output to all preceding layers
forward pass back propagation
input passes through the network layer by layer and output is computed
loss calculation in back propagation
computes difference between y hat with y and computes error
backward pass back propagation
error flows backward through network layer by layer computing gradient (partial derivative of loss function with respect to weight) using chain rule
Update the weights by moving in the opposite direction of the gradient
back propagation allows
network to learn from its mistakes by adjusting weights and biases based on error
steepest gradient decent
optimization method that uses the gradients computed by backpropagation to update the weights, aiming to minimize the loss function over time
choice of output layer act func, activation and loss function for MLP for classification
output layer (multi class classification= softmax, binary= sigmoid (or tanh)) softmax takes too long for binary
loss function: cross entropy
activation function: reLu
choice of output layer act func, activation and loss function for MLP for regression
output later: linear
loss function: MSE (more sensitive to outliers) or MAE
activation function: reLu or tanh in hidden layers
impact of architecture on complexity and capability of MLP
more hidden layers allows MLP to model more complex relationships but also increase risk of overfitting
5 steps of supervised learning and MLP
1) examine
- how many inputs
- how many outputs
- what type is desired output (classification or regression)
2) # hidden layers
- usually number attributes
3) activation functions of hidden layers decide
4) for each hidden and output layer
- initialise w (not all 0)
- initialise bais matrices (all 0 is ok)
5) train network on training data test performance on test data
example of MLP and requirements
MLP with one hidden layer is classified as a universal function approxiamtor with enough neurons in the hidden layer and a sensible non linear activation function
filtering and expanding
filtering is finding only key features and expanding is feature representation
convert pixels to
grey scale as don’t want model to learn the colours
logistic regression
has 2 modifications to normal regression model to make it suitable for binary classification problem where y E {0,1}
output of regression model passed through a sigmoid function to convert it to a continuous value between 0 and 1 with that being the probability of y hat being 1 class
how does the sigmoid function work with logistic regression
the output is passed through a sigmoid function which converts the number into a probability between 0-1 with it being the probability of belonging to class 1
that number is checked with a hard limiting function P>0.5 then class 1 and P<0.5 class 0
cross entropy loss
measures difference between predicted probability distribution and actual class labels
penalises incorrect classifications more heavily when model is confident in wrong prediction
feature space
set of all possible values for a chosen set of features from chosen data
decision boundaries drawn in this feature space to separate classes based on the features
degree of polynomial and model complexity
increase degree of polynomial increases flexibility of the model however could overfit
steepest gradient descent
optimization technique that iteratively adjusts model parameters by following the steepest descent direction of the loss surface, aiming to minimize the error
steepest gradient descent only works with
continuous loss function
optimisation in ML
goal is to find set of parameters that minimises loss function
similar to parameter search through a space of possible parameter values
optimisation allows algorithm to learn and adapt
loss surface
given constant dataset loss J evaluated on some h(x,w) for some choice of parameters w as a function J(w) in the space of all possible parameters. manifold J(w) makes in parameter space = loss surface
linear in parameters and not
independent variables have linear relationship with output
y hat= w1g1+w2g1 and also y hat= w1sinx etc but not for y hat= sin(x1w1) as w1 has sin relationship to y hat
convex loss function
has one minimum (global) and no local minima
guarantees gradient based optimization methods will always find best solution
global minima
post on loss surface where loss is at its absolute minimum
local minima
point of loss surface where loss is lower than neighbouring but not necessary lowest overall
limitations to SGD
can get stuck in local minima for none-convex functions
if gradient is near 0 the descent slows down significantly
finding right learning rate is critical to performance
GA for learning parameters
genetic algorithm is a search heuristic
reflects process of natural selcetion
can be used to learn parameters of a model especially when loss surface is non-convex or sgd struggles with local min
regression
predicting numerical values
linear regression
model relationship between dependant (output) and 1 or more indepdant variables (inputs) using a straight line
uses model hypothesis that is weighted sum of inputs
goal of regression
goal is to find optimal weights that minimise differences between predicted values and actual values using loss function
MSE and formula
average squared difference between predicted and actual values
JMSE= 1/N sum of y-y hat
why do we care about MSE when using it for our model?
minimising MSE increases accuracy of regression model by decreasing prediction error
least squares fit
maths procedure for finding best fitting curve to given set of points of minimising sum of squared residuals
weights are adjusted iteratively using steepest gradient descent to find optimal values to minimise loss