Week 4 Flashcards

Question 1

Q

What are the possible reasons for overfitting when creating a decision tree? How
can overfitting be avoided?

Answer

A

1) too much variance in training data such that data is not a representative
sample of the instance space & the tree splits on irrelevant features
2) too much noise in training data: incorrect features or class labels
3) avoid by:
- pre-pruning, stop growing the tree at some point when there is insufficient data to make reliable decisions
- post-pruning, growing the full decision tree & removing nodes with insufficient evidence
- mechanism: prune children of S if all children are leaves & the accuracy on the validation set does not decrease if the
most frequent class label is assigned to all items at S

Question 2

Q

Outline the causes of the different forms of generalisation error

Answer

A

1) approximation error:
- due to hypothesis space being smaller than target space ⇒ underlying function may lie outside hypothesis space.
- poor choice of model space ⇒ large approximation error,
i. e., model mismatch
2) estimation error:
- due to learning procedure which selects non-optimal model from hypothesis space
3) ⇒ empirical risk ˆfn,N (x)

Question 3

Q

Three types of statistical learning

Answer

A

1) Empirical modelling:
2) Neural network
3) Support Vector Machines (SVMs)

Question 4

Q

Given a set of data samples what is the aim of a support vector machine?

Answer

A

embody Structural Risk Minimisation (SRM) principle, which is superior to ERM
SRM minimises an upper bound on expected risk

Question 5

Q

Given a set of data samples what are the characteristics of an applied neural network?

Answer

A

difficulties with generalisation ⇒ models can over fit data due to: optimisation algorithms used for parameter selection,
statistical measures used to select the ‘best’ model
Empirical Risk Minimisation (ERM) principle, minimises error on training data

Question 6

Q

Given a set of data samples what is the aim of empirical modelling?

Answer

A

modelling a process of induction ⇒ to deduce system responses that have yet to be observed
quantity & quality of observations govern performance of model
observed data is finite & sampled (non-uniform)
ill posed

Week 4 Flashcards

(6 cards)