Data Science MODULE 3 Flashcards

1
Q

What does it mean when a statistical model has been overfit?

A

It mirrors the training data extremely well, but doesn’t perform well with unseen data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Hoekom is overfitting so n groot probleem

A

So die doelwit van n data scientist is om responses te predict met n geleerde model. As j overfit, het dit n direkte negatiewe impak op jou vermoe om te predict

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In tree-based models, the algorithm is programmed to?(2)

A
  • reduce the impurity of partitions in a classification problem
  • reduce the mean squared error in regression problems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When fitting data we need to find the balance between?

A

Bias and variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is variance?

A

It indicates an error related to how robust the model is when unseen data is used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does it mean when a model has high variance?

A

Performs well on training data, but not the test set - i.e the model is overfitted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The ideal model in terms of variance and bias

A

Low variance and low bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Models with a high bias, tends to be?

A

Too simple, they don’t capture the shape of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Kan jy die variance en bias meet?

A

Nope, omdat ons nie weet hoe die onderliggende data werklik lyk nie

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Wat is n validation error?

A

Soos ek verstaan, train jy die model. Maar jy hou ook test data eenkant. As hy klaar is, toets jy dit met die unseen data en bepaal dan die fout. Hierdie staan bekend as die validation set approach

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

As die validation error hoog is, maar die training error is small?

A

You have done it, you have overfitted the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

So daar is eintlik drie stelle data waarmee ons train

A

Initial training data, validation data en dan test data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

So met fitting, hoe word die validation data actually gebruik?

A

Ons probeer die beste waardenkry vir die hyperparameter (denoted as alpha)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Different hyperparameters are assessed to determine the one that results in?

A

The lowest validation error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

So wanneer sal ons dan nou die test data begin gebruik?

A

Wanneer die validation error geminimise is, deur die hyperparameter aanpassings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Two of the most common types of validation used?

A

K-fold cross validation
Standard

17
Q

Hoe lyk die data split vir standard validation

A

70% op die training, 20% op die validation, 10% op die test set

18
Q

Oorhoofs, hoe werk k-fold cross validation

A

Aanvanklik word data net gesplit in training data en n test set. Die training data word dan verdeel in k hoeveelheid “folds”. Een word gebruik vir die validasie. Die data word randomly assign aan elke fold, so die oorhoofse verspreiding behoort redelik eweredig te wees

19
Q

So as jy five folds het, hoe werk die training?

A

Elke fold kry n kans om die validation set te wees. As die data klaar gebruik was vir die validasie, word dit weer gebruik vir training. So die model word vyfkeer getrain. Die final validation error is dan die gemiddeld van die 5 errors.

20
Q

2 Voordele kan k-fold validation?

A
  1. Jy kry n beter idee van die final error
  2. Gee jou n beter idee van die variasie in die validation error
21
Q

So we do we manage complexity with tree-based models?

A

Via pruning and validation sets

22
Q

Wat is die hoof doel van cost complexity pruning?

A

To reduce overfitting by penalising large trees

23
Q

How does the penalising work with cost complexity pruning

A

So n penalty term word ingesluit. Hierdie penalty term sluit in die hyperparameter, sowel as |T|, wat die aantal nodes is op die terminale vlak

24
Q

Oorhoofs, wat is die drie stappe wat gevolg word om n decision tree te prune?

A
  1. Generate die groot boom
  2. K-fold cross validation to find the optimum hyperparameter
  3. Apply cost complexity pruning to the large tree with the optimum alpha value
25
As the hyper parameter is increased?
Branches are pruned from the tree
26
So die tree wat oorbly na pruning, staan bekend as die
Subtree, en dit is afhanklik van die alpha waarde
27
For a given alpha,
The subtree is found that minimises the penalised cost function
28
So hoe vind ons nou die beste alpha waarde?
Deur k-fold validation te gebruik, per alpha waarde, en kyk waar ons die laagste gemiddelde error kry
29
Met k-fold validation, hoe lyk die modelle wat ontwikkel word met elke stap?
Verskillend, omdat ons elke keer n ander subset van die data gebruik
30
So as ons nou klaar die k-fold validation gedoen het om die optimum alpha te bepaal, hoe kry mens die finale decision tree?
Deur die hele training set te retrain met die optimum alpha
31
Kan j cost-complexity pruning doen in Python
Nee, so dis bietjie pointless eintlik. In python kan ons pre-pruning doen, of early stopping om kleiner bome te kry
32
Jy kry ook pruning by depth (deur in python die max_depth parameter te verander met scikit-learn)
Jip, die akkuraatheid word brpaal as n funksie van diepte, die diepte met die hoogste score tydens validasie word gespesifiseer as die maksimum diepte in die finale model
33
Wat is van die ander pre-pruning methods?
Minimum samples per node, maximum number of terminal nodes, minimum purity increase, minimum samples per split
34
The resubstitution validation technique
Niks snaaks. Basies, die fout, of die error wat bereken is vir daardie datastel
35
Wat is hold-out validation
Gebruik stratification, so n seker gedeelte van die data word uitgehou vir validasie doeleindes (soos dit is die standard validation wat vroeer bespreek is)
36
Met k-folds validation, hoeveel folds wors gebruik vir training?
K-1, die laaste een is dan die toets data. Proses word dan herhaal
37
Wat is LOOCV
Leave one out cross validation Basies k-folds op steriods, ipv n fold gebruik, word daar n iterasie gedoen vir elke record
38
Wat is random subsampling validation technique?
Soortgelyk aan k-folds en LOOCV, in hierdie geval doen jy net n aantal iterasies, maar training set is random data test set os wat oorbly
39
Bootstrapping validation technique?
Bietjie chaos, basies dieselfde as random subsampling, maar daar kan ook herhalings wees van sekere rekords