Data Science MODULE 3 Flashcards

1
Q

What does it mean when a statistical model has been overfit?

A

It mirrors the training data extremely well, but doesn’t perform well with unseen data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Hoekom is overfitting so n groot probleem

A

So die doelwit van n data scientist is om responses te predict met n geleerde model. As j overfit, het dit n direkte negatiewe impak op jou vermoe om te predict

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In tree-based models, the algorithm is programmed to?(2)

A
  • reduce the impurity of partitions in a classification problem
  • reduce the mean squared error in regression problems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When fitting data we need to find the balance between?

A

Bias and variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is variance?

A

It indicates an error related to how robust the model is when unseen data is used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does it mean when a model has high variance?

A

Performs well on training data, but not the test set - i.e the model is overfitted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The ideal model in terms of variance and bias

A

Low variance and low bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Models with a high bias, tends to be?

A

Too simple, they don’t capture the shape of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Kan jy die variance en bias meet?

A

Nope, omdat ons nie weet hoe die onderliggende data werklik lyk nie

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Wat is n validation error?

A

Soos ek verstaan, train jy die model. Maar jy hou ook test data eenkant. As hy klaar is, toets jy dit met die unseen data en bepaal dan die fout. Hierdie staan bekend as die validation set approach

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

As die validation error hoog is, maar die training error is small?

A

You have done it, you have overfitted the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

So daar is eintlik drie stelle data waarmee ons train

A

Initial training data, validation data en dan test data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

So met fitting, hoe word die validation data actually gebruik?

A

Ons probeer die beste waardenkry vir die hyperparameter (denoted as alpha)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Different hyperparameters are assessed to determine the one that results in?

A

The lowest validation error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

So wanneer sal ons dan nou die test data begin gebruik?

A

Wanneer die validation error geminimise is, deur die hyperparameter aanpassings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Two of the most common types of validation used?

A

K-fold cross validation
Standard

17
Q

Hoe lyk die data split vir standard validation

A

70% op die training, 20% op die validation, 10% op die test set

18
Q

Oorhoofs, hoe werk k-fold cross validation

A

Aanvanklik word data net gesplit in training data en n test set. Die training data word dan verdeel in k hoeveelheid “folds”. Een word gebruik vir die validasie. Die data word randomly assign aan elke fold, so die oorhoofse verspreiding behoort redelik eweredig te wees

19
Q

So as jy five folds het, hoe werk die training?

A

Elke fold kry n kans om die validation set te wees. As die data klaar gebruik was vir die validasie, word dit weer gebruik vir training. So die model word vyfkeer getrain. Die final validation error is dan die gemiddeld van die 5 errors.

20
Q

2 Voordele kan k-fold validation?

A
  1. Jy kry n beter idee van die final error
  2. Gee jou n beter idee van die variasie in die validation error
21
Q

So we do we manage complexity with tree-based models?

A

Via pruning and validation sets

22
Q

Wat is die hoof doel van cost complexity pruning?

A

To reduce overfitting by penalising large trees

23
Q

How does the penalising work with cost complexity pruning

A

So n penalty term word ingesluit. Hierdie penalty term sluit in die hyperparameter, sowel as |T|, wat die aantal nodes is op die terminale vlak

24
Q

Oorhoofs, wat is die drie stappe wat gevolg word om n decision tree te prune?

A
  1. Generate die groot boom
  2. K-fold cross validation to find the optimum hyperparameter
  3. Apply cost complexity pruning to the large tree with the optimum alpha value
25
Q

As the hyper parameter is increased?

A

Branches are pruned from the tree

26
Q

So die tree wat oorbly na pruning, staan bekend as die

A

Subtree, en dit is afhanklik van die alpha waarde

27
Q

For a given alpha,

A

The subtree is found that minimises the penalised cost function

28
Q

So hoe vind ons nou die beste alpha waarde?

A

Deur k-fold validation te gebruik, per alpha waarde, en kyk waar ons die laagste gemiddelde error kry

29
Q

Met k-fold validation, hoe lyk die modelle wat ontwikkel word met elke stap?

A

Verskillend, omdat ons elke keer n ander subset van die data gebruik

30
Q

So as ons nou klaar die k-fold validation gedoen het om die optimum alpha te bepaal, hoe kry mens die finale decision tree?

A

Deur die hele training set te retrain met die optimum alpha

31
Q

Kan j cost-complexity pruning doen in Python

A

Nee, so dis bietjie pointless eintlik. In python kan ons pre-pruning doen, of early stopping om kleiner bome te kry

32
Q

Jy kry ook pruning by depth (deur in python die max_depth parameter te verander met scikit-learn)

A

Jip, die akkuraatheid word brpaal as n funksie van diepte, die diepte met die hoogste score tydens validasie word gespesifiseer as die maksimum diepte in die finale model

33
Q

Wat is van die ander pre-pruning methods?

A

Minimum samples per node, maximum number of terminal nodes, minimum purity increase, minimum samples per split

34
Q

The resubstitution validation technique

A

Niks snaaks. Basies, die fout, of die error wat bereken is vir daardie datastel

35
Q

Wat is hold-out validation

A

Gebruik stratification, so n seker gedeelte van die data word uitgehou vir validasie doeleindes (soos dit is die standard validation wat vroeer bespreek is)

36
Q

Met k-folds validation, hoeveel folds wors gebruik vir training?

A

K-1, die laaste een is dan die toets data. Proses word dan herhaal

37
Q

Wat is LOOCV

A

Leave one out cross validation
Basies k-folds op steriods, ipv n fold gebruik, word daar n iterasie gedoen vir elke record

38
Q

Wat is random subsampling validation technique?

A

Soortgelyk aan k-folds en LOOCV, in hierdie geval doen jy net n aantal iterasies, maar training set is random data test set os wat oorbly

39
Q

Bootstrapping validation technique?

A

Bietjie chaos, basies dieselfde as random subsampling, maar daar kan ook herhalings wees van sekere rekords