lecture 5 - supervised learning in predictive modeling Flashcards

1
Q

what do we mean when we say ‘machines can learn’ (Mitchell)

A
  • a computer program is said to learn from experience E with respect to some class of tasks T and performance P, if its performance at tasks in T improves with E.
  • i.e., we say that something learns as long as it gets better at performing the task with new experiences
  • E could be an instance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what do we need for a machine to learn

A
  1. task
  2. historical data
  3. improvement on task performance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

supervised learning: functional relationship

A
  • learning a functional relationship f: X –> Y
  • we map an observation x ∈ X to the target y ∈ Y
  • X represents the space of all inputs, Y is the space of all possible targets
  • not likely to be deterministic due to measurement error and noisy targets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

supervised learning: unknown conditional target distribution p(y|x)

A
  • if we calculate the target function f(x), we add noise from a noise distribution to simulate noisy targets.
  • this makes it more difficult to learn the target function.
  • by accounting for noisy targets, we introduce the conditional target distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

supervised learning: observed data O

A
  • O = {(x_j, y_j)}
  • made up of: p(y|x) & p(x)
  • separate into training, validation, and test set
  • learn a function f-hat(x) that fits our observed data in the training set
  • evaluate generalisability of f-hat(x) upon test set
  • stopping the learning process is based on the validation set. in a small dataset we use cross validation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

supervised learning: unbalanced data

A

in case of unbalanced data, we should stratify the training-set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

supervised learning: hypothesis set h

A
  • all possible functions you can make using the data you have
  • you want to identify the right hypothesis from those options
  • tries to approximate the unknown target function f-hat
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

supervised learning: error measure E()

A
  • assume we have a hypothesis h for the target function
  • loss (e): risk (error) per data point – e(f(x), h(x))
  • risk (E): overall difference between h and f across the input space – (E(f,h))
  • from e to E: the integral e(f(x), h(x)) of each data point multiplied by the pdf of input data x.
  • since we don’t have f, can’t compute e.
  • from e to E can therefore be approximated by summing e(y(x),h(x)) for all points and dividing by the total number of points (average loss)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

definitions for e

A
  1. classification: classification rate, F1, AUC, etc.
  2. regression: MSE, MAE, etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

in-sample error

A
  • the error of the hypothesis f-hat on the training data
  • 1/N x sum(e(y, h(x)))
  • use x and y from our training set
  • we select the hypothesis that minimizes the in-sample error
  • i.e., machines learn by minimizing the in-sample error
  • easy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

out-sample error

A
  • The error of the hypothesis f-hat on the unseen data
  • integrate loss(x) * p(x) dx
  • input space X for out-sample error is not part of the training set (unseen data)
  • used to assess generalizability of the model
  • hard
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

supervised learning: model selection

A
  • we select the hypothesis that has the lowest in-sample error on the validation set
  • we should be careful with overfitting and not using too many features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

learning theory

A
  • though we can’t learn the perfect model, learning is possible if the in-sample error is a good estimator of the out-sample error
  • |E_{out}(f-hat) - E_{in}(f-hat)|
  • the difference with a higher N is most likely very small
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

PAC learnable

A
  • a hypothesis set is PAC learnable if a learning algorithm exists that fulfills the following condition: for every error margin ε>0, and confidence level δ[0,1], there is an m, so that for a random training sample with length larger than m, the following inequality holds:
  • the absolute difference between the in- and out-sample error of f-hat is smaller than ε
  • this inequality states that the generalization error is close to the empirical error within ε, with high probability (at least 1-δ)
  • this is about understanding the conditions under which a learning algorithm can generalize well from training data to unseen data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

PAC learnable: δ

A
  • δ indicates how certain we want to be that this difference is smaller than ε.
  • 0 = completely certain.
  • lower values = more strict in providing guarantees that the difference is smaller than epsilon
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

every finite set of hypotheses is PAC learnable

A
  • This conclusion states that as long as we have a finite number of possible models (hypotheses), we can always find a model that is PAC learnable.
  • This means that with enough training samples, we can ensure that the model’s performance on new data will be close to its performance on the training data with high probability.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

VC dimension: shatter

A
  • in a binary classification problem with N data points, we have 2^N ways to label the data
  • a hypothesis set H (i.e., lines) shatters X when it can represent every possible labeling (i.e., the 2^N)
  • VC dimension d_{vc} of a hypothesis set is defined as: the maximum number of input vectors that can be shattered.
  • i.e., it’s the largest set of points for which you can find a hypothesis in H to perfectly classify every possible labeling of these points.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what does the VC dimension value represent

A
  • it tells us the largest number of points that can be shattered (i.e., classified in all possible ways) by the hypothesis set
  • the VC dimension is infinite if there are arbitrarily large sets of input vectors that can be shattered
  • result: all hypothesis sets with finite VC-dimensions are PAC learnable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

VC dimension: PAC learnable

A
  • Despite having potentially infinite hypotheses, if the hypothesis set has a finite VC dimension, it means the complexity of the model is bounded.
  • This bounded complexity implies that the set is PAC learnable, meaning with enough data, the algorithm will probably (with high probability) and approximately (within some error threshold) learn a hypothesis that generalizes well on unseen data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

VC dimension: m_H(N)

A

the growth of the complexity of the hypothesis space

  • i.e., how many more hypotheses do you need if you make the dataset bigger
  • if this is limited (finite) then you’re able to provide guarantees on the difference between the in- and out-sample error
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

implications of PAC formula

A
  1. with few training examples, it is easy to obtain a low in-sample error, but the out-sample error will be high. as N increases, we will converge to a maximum.
  2. with a fixed N, more complex hypothesis sets lead to better representation of data in the training set (i.e., lower in-sample error).
    - however, we can make it too complex and overfit, leading to increase of the out-sample error. we need to find the sweet spot of complexity.
  3. low complexity stabilizes at a higher error rate compared to high complexity
    - so, the more data you have, the more beneficial the high complexity dataset will be
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

training and test splits: models for individual level

A
  1. take time component into account and split test and train from a specific time point in a qs
  2. ignore time component and use random sampling on the instances of a qs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

training and test splits: models for person level

A
  1. unseen individuals: use a subset of qs as training and the rest as test
    –> here, we generalize across people. i.e., if what we see in the training set generalizes to a new person
  2. unseen data of known person: divide test and train at a certain percentage split per qs
24
Q

feedforward neural networks

A
  1. perceptron
  2. multi layer perceptron
  3. convolutional NNs
25
Q

perceptron

A
  1. contains just 1 neuron
  2. bias input is constant and typically set to 1
  3. inputs are assumed to be numerical or binary values
  4. the inputs and bias are connected to the neuron via arcs that each have a weight associated with it
  5. based on the value of the inputs and the weights, the network provides an output
26
Q

perceptron computation

A
  1. take the weighted sum of the inputs
  2. apply an activation function to determine the output of the neuron
27
Q

perceptron limit

A

can only represent linearly separable cases for classification

  • i.e., only classes we can separate with a hyperplane
28
Q

multi layer perceptron

A
  • with back propagation we can treat cases that are not linearly separable
  • it is always possible to find a combination of hyperplanes that con completely separate the training data, as long as none of the input vectors are identical
29
Q

convolutional neural network

A
  • deep neural network
  • similar to multi-layer NNs, but there are preceding layers that identify features in the input space.
  • followed by a regular NN
  • mostly used for image and sound recognition
  • assume a 3D, 2D, or 1D input space
  • extract features using convolutional layers and pooling layers
30
Q

support vector machine

A
  • find hyperplane that maximizes the distance between classes (outer hyperplanes)
  • mainly targets classification problems
31
Q

SVM: linear separability problem

A
  • solves the linear separability problem by using kernel functions
  • these map inputs to a different (high dimensional) feature space in which the problem is linearly separable
32
Q

k-nearest neighbour

A
  1. find k closest examples using previous distance metrics
  2. assign classes based on some function (e.g., majority or average class)
  • can handle any type of attribute (numerical, binary, categorical) depending on the distance function used
33
Q

decision trees

A
  1. create tree to decide on target value
  2. use criterion to decide on which attributes are most important
34
Q

decision trees: components

A
  1. nodes: decision point associated with an attribute
  2. leavess: outcome value
  3. branches: attribute value
35
Q

decision trees: deciding importance of attributes

A
  1. categorical target: use the information gain (based on entropy)
  2. numerical target: standard deviation reduction (reduce the standard deviation/diversity in target values)
36
Q

entropy

A
  • the amount of bits that are required to send a certain message.
  • the more information the message contains, the more bits are required
  • if all instances are of the same class, this will result in minimal information, and the entropy is 0
  • if instances are evenly spread over all classes, we have maximum information, and the entropy is 1
  • we want leaves that cover a set of instances of the same class to have an entropy of 0.
37
Q

naive bayes

A
  • we use bayes formula to express the probability of a target value g given our observations x
  • bagging: create multiple models on samples of the data and combine output of models using some aggregation function
  • tackes variance problem
38
Q

ensembles: boosting

A

build models sequentially/iteratively and focus models on where we made mistakes before

  • focuses on bias problem
39
Q

forward selection

A
  • iteratively add the most predictive feature
  • results in a simpler model compared to backward selection
40
Q

backward selection

A
  • iteratively remove the least predictive feature
  • more complex model than forward selection
41
Q

regularization

A
  • add term to the error function to punish for more complex models (i.e., simplifying the model)
  • as regularization parameter goes up, so does the accuracy of the test set. but the training set accuracy goes down
  • i.e., generalizability improves if we increase regularization
  • The higher the parameter value, the more complex models are punished, and hence, the more simple models (with low weights) are favored.
42
Q

methods to avoid overfitting

A
  1. forward selection
  2. backward selection
  3. regularization
43
Q

NN parameters

A
  1. hidden layer composition
  2. maximum iterations
44
Q

SVM parameters

A
  1. maximum iterations
  2. C
  3. tolerance
  4. kernel function
45
Q

KNN parameters

A
  1. k
46
Q

decision tree parameters

A
  1. minimum samples per leaf
    - If we set this number too low, overfitting is likely to occur since leafs represent only limited examples. Hence, we could say that this parameter says something about the complexity of the tree.
    - performance on the training set increases when we decrease the minimum number of example per leaf. The same holds for the test set up to a certain extent.
  2. splitting criterion
47
Q

random forest parameters

A
  1. minimum samples per leaf
  2. number of trees
  3. splitting criterion
48
Q

naive bayes parameters

A

none

49
Q

results on the test set

A
  1. compute 95% CIs
  2. compute confusion matrix
50
Q

supervised learning: noise distribution

A
  1. bernoulli or categorical distribution for discrete target
  2. normal distribution for continuous target
51
Q

are we able to learn the perfect model f?

A
  • no, even with infinite training samples
  • because we cannot guarantee that f is a subset of H
  • despite this, learning is possible if the in-sample error is a good estimator of the out-sample error
52
Q

PAC learnable: N

A

with a higher N, the absolute difference between the in- and out-sample error is most likely very small

53
Q

PAC learnable: finite hypothesis space

A

every finite set of hypotheses is PAC learnable

54
Q

predictive models without notion of time

A
  1. NNs: perceptron and CNN
  2. SVM
  3. k-NN
  4. decision trees
  5. naive bayes
  6. ensembles: random forest, gradient boosting
55
Q

why could model outcomes be extremely good

A
  1. dataset has limited variation
  2. spans only short time period
  3. data split results in overlapping temporal windows: data leakage