lecture 5 - supervised learning in predictive modeling Flashcards
what do we mean when we say ‘machines can learn’ (Mitchell)
- a computer program is said to learn from experience E with respect to some class of tasks T and performance P, if its performance at tasks in T improves with E.
- i.e., we say that something learns as long as it gets better at performing the task with new experiences
- E could be an instance
what do we need for a machine to learn
- task
- historical data
- improvement on task performance
supervised learning: functional relationship
- learning a functional relationship f: X –> Y
- we map an observation x ∈ X to the target y ∈ Y
- X represents the space of all inputs, Y is the space of all possible targets
- not likely to be deterministic due to measurement error and noisy targets
supervised learning: unknown conditional target distribution p(y|x)
- if we calculate the target function f(x), we add noise from a noise distribution to simulate noisy targets.
- this makes it more difficult to learn the target function.
- by accounting for noisy targets, we introduce the conditional target distribution
supervised learning: observed data O
- O = {(x_j, y_j)}
- made up of: p(y|x) & p(x)
- separate into training, validation, and test set
- learn a function f-hat(x) that fits our observed data in the training set
- evaluate generalisability of f-hat(x) upon test set
- stopping the learning process is based on the validation set. in a small dataset we use cross validation.
supervised learning: unbalanced data
in case of unbalanced data, we should stratify the training-set
supervised learning: hypothesis set h
- all possible functions you can make using the data you have
- you want to identify the right hypothesis from those options
- tries to approximate the unknown target function f-hat
supervised learning: error measure E()
- assume we have a hypothesis h for the target function
- loss (e): risk (error) per data point – e(f(x), h(x))
- risk (E): overall difference between h and f across the input space – (E(f,h))
- from e to E: the integral e(f(x), h(x)) of each data point multiplied by the pdf of input data x.
- since we don’t have f, can’t compute e.
- from e to E can therefore be approximated by summing e(y(x),h(x)) for all points and dividing by the total number of points (average loss)
definitions for e
- classification: classification rate, F1, AUC, etc.
- regression: MSE, MAE, etc.
in-sample error
- the error of the hypothesis f-hat on the training data
- 1/N x sum(e(y, h(x)))
- use x and y from our training set
- we select the hypothesis that minimizes the in-sample error
- i.e., machines learn by minimizing the in-sample error
- easy
out-sample error
- The error of the hypothesis f-hat on the unseen data
- integrate loss(x) * p(x) dx
- input space X for out-sample error is not part of the training set (unseen data)
- used to assess generalizability of the model
- hard
supervised learning: model selection
- we select the hypothesis that has the lowest in-sample error on the validation set
- we should be careful with overfitting and not using too many features
learning theory
- though we can’t learn the perfect model, learning is possible if the in-sample error is a good estimator of the out-sample error
- |E_{out}(f-hat) - E_{in}(f-hat)|
- the difference with a higher N is most likely very small
PAC learnable
- a hypothesis set is PAC learnable if a learning algorithm exists that fulfills the following condition: for every error margin ε>0, and confidence level δ[0,1], there is an m, so that for a random training sample with length larger than m, the following inequality holds:
- the absolute difference between the in- and out-sample error of f-hat is smaller than ε
- this inequality states that the generalization error is close to the empirical error within ε, with high probability (at least 1-δ)
- this is about understanding the conditions under which a learning algorithm can generalize well from training data to unseen data
PAC learnable: δ
- δ indicates how certain we want to be that this difference is smaller than ε.
- 0 = completely certain.
- lower values = more strict in providing guarantees that the difference is smaller than epsilon
every finite set of hypotheses is PAC learnable
- This conclusion states that as long as we have a finite number of possible models (hypotheses), we can always find a model that is PAC learnable.
- This means that with enough training samples, we can ensure that the model’s performance on new data will be close to its performance on the training data with high probability.
VC dimension: shatter
- in a binary classification problem with N data points, we have 2^N ways to label the data
- a hypothesis set H (i.e., lines) shatters X when it can represent every possible labeling (i.e., the 2^N)
- VC dimension d_{vc} of a hypothesis set is defined as: the maximum number of input vectors that can be shattered.
- i.e., it’s the largest set of points for which you can find a hypothesis in H to perfectly classify every possible labeling of these points.
what does the VC dimension value represent
- it tells us the largest number of points that can be shattered (i.e., classified in all possible ways) by the hypothesis set
- the VC dimension is infinite if there are arbitrarily large sets of input vectors that can be shattered
- result: all hypothesis sets with finite VC-dimensions are PAC learnable
VC dimension: PAC learnable
- Despite having potentially infinite hypotheses, if the hypothesis set has a finite VC dimension, it means the complexity of the model is bounded.
- This bounded complexity implies that the set is PAC learnable, meaning with enough data, the algorithm will probably (with high probability) and approximately (within some error threshold) learn a hypothesis that generalizes well on unseen data.
VC dimension: m_H(N)
the growth of the complexity of the hypothesis space
- i.e., how many more hypotheses do you need if you make the dataset bigger
- if this is limited (finite) then you’re able to provide guarantees on the difference between the in- and out-sample error
implications of PAC formula
- with few training examples, it is easy to obtain a low in-sample error, but the out-sample error will be high. as N increases, we will converge to a maximum.
- with a fixed N, more complex hypothesis sets lead to better representation of data in the training set (i.e., lower in-sample error).
- however, we can make it too complex and overfit, leading to increase of the out-sample error. we need to find the sweet spot of complexity. - low complexity stabilizes at a higher error rate compared to high complexity
- so, the more data you have, the more beneficial the high complexity dataset will be
training and test splits: models for individual level
- take time component into account and split test and train from a specific time point in a qs
- ignore time component and use random sampling on the instances of a qs