Week 4 Flashcards
What are the possible reasons for overfitting when creating a decision tree? How
can overfitting be avoided?
1) too much variance in training data such that data is not a representative sample of the instance space & the tree splits on irrelevant features 2) too much noise in training data: incorrect features or class labels 3) avoid by: - pre-pruning, stop growing the tree at some point when there is insufficient data to make reliable decisions - post-pruning, growing the full decision tree & removing nodes with insufficient evidence - mechanism: prune children of S if all children are leaves & the accuracy on the validation set does not decrease if the most frequent class label is assigned to all items at S
Outline the causes of the different forms of generalisation error
1) approximation error:
- due to hypothesis space being smaller than target space ⇒ underlying function may lie outside hypothesis space.
- poor choice of model space ⇒ large approximation error,
i. e., model mismatch
2) estimation error:
- due to learning procedure which selects non-optimal model from hypothesis space
3) ⇒ empirical risk ˆfn,N (x)
Three types of statistical learning
1) Empirical modelling:
2) Neural network
3) Support Vector Machines (SVMs)
Given a set of data samples what is the aim of a support vector machine?
- embody Structural Risk Minimisation (SRM) principle, which is superior to ERM
- SRM minimises an upper bound on expected risk
Given a set of data samples what are the characteristics of an applied neural network?
- difficulties with generalisation ⇒ models can over fit data due to: optimisation algorithms used for parameter selection,
statistical measures used to select the ‘best’ model - Empirical Risk Minimisation (ERM) principle, minimises error on training data
Given a set of data samples what is the aim of empirical modelling?
- modelling a process of induction ⇒ to deduce system responses that have yet to be observed
- quantity & quality of observations govern performance of model
- observed data is finite & sampled (non-uniform)
- ill posed