Classification and regression Flashcards

Question

gini index doesn't rely on

Answer 1

entropy only on class proportion

Answer 2

imbalanced dataset

Answer 3

imbalanced dataset high brand attribute

Answer 4

binary classification

Answer 5

fastest gini, middle if IG, slowest is gain ratio

Answer 6

artificial neuron fundamental unit in neural networks modelled after biological neuron activity is weighted sum of its inputs + bias term passed through an activation function to produce output adjusting weights allows neuron to learn choice of activation function determines type of computation the neuron performs

Answer 7

single neuron can only do simple computations but many connected in a large network can deliver any function mapping

Answer 8

like a hook

Answer 9

determines output of neuron based on weighted sumo inputs introduces non-linearity to make the model network capable of learning more complex patterns e.g sigmoid, relu, softmax

Answer 10

coefficients that adjust the influence of certain input attributes on the output

Answer 11

threshold value added to the sum of weighted inputs to shift the activation function's output

Answer 12

shifts decision boundary away from the origin making the model more flexible as without the decision boundary would always pass through the origin

Answer 13

space divided by hyperplane which classifies the data points based on which side they fall on

Answer 14

converts categorical data variables into a numerical format that machine learning models can use

Answer 15

binary classifies data into 1 of 2 classes multi classifies data into 1 of many classes

Answer 16

uses K neuron's and trains each to separate one class from all others

Answer 17

iteratively updating weights in a way that minimises error function (difference between actual and predicted output)

Answer 18

the goal is to learn a set of weights that allows the perceptron to correctly classify input data trained through supervised learning

Answer 19

1) learning parameters= small enough 2) linearly separable classification

Answer 20

controls magnitude of weight updates during training

Answer 21

computation is faster but can lead to potentially unstable training

Answer 22

entire passing of training data through the algorithm

Answer 23

one update per training sample N updates per epoch

Answer 24

average updates from all training samples one update per epoch

Answer 25

dividing data into fixed size batches and taking the average over mini batches and shuffling data assigned to mini batches between epochs

Answer 26

linear separation simple single perceptron can only solve linearly separable problems so can't handle XOR problems in more complex neural networks MLP

Answer 27

forward feed artificial neural network that generates a set of outputs from a set of inputs with multiple hidden laters to allow model more complex problems and capture non linear patterns

Answer 28

input layer each neuron in the input layer corresponds to one feature in the input data hidden layers can have 1 or more Each neuron in a hidden layer is connected to every node in the previous and next layer (fully connected) Neurons in the hidden layers apply non-linear activation functions to the weighted sum of inputs and produce an output learn abstract features from input data output layer provides the final prediction of the mode

Answer 29

adjusting the weights between neurons to minimize error and is able to capture complex patterns in data due to its layered architecture

Answer 30

one layer of neutrons can only solve linearly separable problems

Answer 31

can solve non-linearly separable problems can model more complex decision boundaries and patterns in the data by stacking ML of neutrons can learn hierarchical features where each hidden layer captures different levels of abstraction from the data

Answer 32

technically single layer perceptron is model that can approximate any continuous function given enough neutrons and layers

Answer 33

can produce desired class labelling on any data hypothesis that fits any day with arbitrarily small MSE

Answer 34

relu sigmoid tanh

Answer 35

outputs input directly if its positive and gives it value of 0 if not no exponent so computationally efficient pros: most widely used, no vanishing/exploding gradient, not 0 centred output cons: incorrect mapping for negative values, dead relu

Answer 36

if input is bounded between 0 and 1 large gradient updates can cause bias towards large negative values difficult to recover as gradient of 0 is 0 leaky relu is an attempt to fix it and it says that if value is +ve just gives it its own value otherwise gives a* value

Answer 37

predicts probability as squashes real value between 0 and 1 exponent so expensive pros: guarantees gradient cannot grow pass certain bound cons: gradient is bounded so can get vanishing gradient, outputs are not 0 centred so all neutrons have same sing in training 1/ 1+ e^-v s from bottom to top and bottom in line with 0

Answer 38

squashes values between -1 and 1 exponent is computationally expensive gradient is steeper than sigmoid pros: outputs are 0 centres so means faster learning cons: gradient is bounded so vanishing gradient got the most e's ion the formula and looks like s from bottom to top but bottom in line with -1 not 0

Answer 39

use reLu binary classification use sigmoid for output layer multi class classification use softmax for output layer

Answer 40

set =0 neural network acts as linear model choose randomly can lead to vanishing/exploding gradient but OK with reLu heuristic but x random w to some value to avoid grad stuff

Answer 41

converts vector of K real numbers into a probability distribution of K possible outcomes

Answer 42

an algorithm used to train MLP by computing gradient of loss function with respect to each weights in network then systematically propagates error backwards from output to all preceding layers

Answer 43

input passes through the network layer by layer and output is computed

Answer 44

computes difference between y hat with y and computes error

Answer 45

error flows backward through network layer by layer computing gradient (partial derivative of loss function with respect to weight) using chain rule Update the weights by moving in the opposite direction of the gradient

Answer 46

network to learn from its mistakes by adjusting weights and biases based on error

Answer 47

optimization method that uses the gradients computed by backpropagation to update the weights, aiming to minimize the loss function over time

Answer 48

output layer (multi class classification= softmax, binary= sigmoid (or tanh)) softmax takes too long for binary loss function: cross entropy activation function: reLu

Answer 49

output later: linear loss function: MSE (more sensitive to outliers) or MAE activation function: reLu or tanh in hidden layers

Answer 50

more hidden layers allows MLP to model more complex relationships but also increase risk of overfitting

Answer 51

1) examine - how many inputs - how many outputs - what type is desired output (classification or regression) 2) # hidden layers - usually number attributes 3) activation functions of hidden layers decide 4) for each hidden and output layer - initialise w (not all 0) - initialise bais matrices (all 0 is ok) 5) train network on training data test performance on test data

Answer 52

MLP with one hidden layer is classified as a universal function approxiamtor with enough neurons in the hidden layer and a sensible non linear activation function

Answer 53

filtering is finding only key features and expanding is feature representation

Answer 54

grey scale as don't want model to learn the colours

Answer 55

has 2 modifications to normal regression model to make it suitable for binary classification problem where y E {0,1} output of regression model passed through a sigmoid function to convert it to a continuous value between 0 and 1 with that being the probability of y hat being 1 class

Answer 56

the output is passed through a sigmoid function which converts the number into a probability between 0-1 with it being the probability of belonging to class 1 that number is checked with a hard limiting function P>0.5 then class 1 and P<0.5 class 0

Answer 57

measures difference between predicted probability distribution and actual class labels penalises incorrect classifications more heavily when model is confident in wrong prediction

Answer 58

set of all possible values for a chosen set of features from chosen data decision boundaries drawn in this feature space to separate classes based on the features

Answer 59

increase degree of polynomial increases flexibility of the model however could overfit

Answer 60

optimization technique that iteratively adjusts model parameters by following the steepest descent direction of the loss surface, aiming to minimize the error

Answer 61

continuous loss function

Answer 62

goal is to find set of parameters that minimises loss function similar to parameter search through a space of possible parameter values optimisation allows algorithm to learn and adapt

Answer 63

given constant dataset loss J evaluated on some h(x,w) for some choice of parameters w as a function J(w) in the space of all possible parameters. manifold J(w) makes in parameter space = loss surface

Answer 64

independent variables have linear relationship with output y hat= w1g1+w2g1 and also y hat= w1sinx etc but not for y hat= sin(x1w1) as w1 has sin relationship to y hat

Answer 65

has one minimum (global) and no local minima guarantees gradient based optimization methods will always find best solution

Answer 66

post on loss surface where loss is at its absolute minimum

Answer 67

point of loss surface where loss is lower than neighbouring but not necessary lowest overall

Answer 68

can get stuck in local minima for none-convex functions if gradient is near 0 the descent slows down significantly finding right learning rate is critical to performance

Answer 69

genetic algorithm is a search heuristic reflects process of natural selcetion can be used to learn parameters of a model especially when loss surface is non-convex or sgd struggles with local min

Answer 70

predicting numerical values

Answer 71

model relationship between dependant (output) and 1 or more indepdant variables (inputs) using a straight line uses model hypothesis that is weighted sum of inputs

Answer 72

goal is to find optimal weights that minimise differences between predicted values and actual values using loss function

Answer 73

average squared difference between predicted and actual values JMSE= 1/N sum of y-y hat

Answer 74

minimising MSE increases accuracy of regression model by decreasing prediction error

Answer 75

maths procedure for finding best fitting curve to given set of points of minimising sum of squared residuals weights are adjusted iteratively using steepest gradient descent to find optimal values to minimise loss

Classification and regression Flashcards

(100 cards)