Preprocess Flashcards

1
Q

The 6 steps of CRISP-DM

A
  1. Business Understanding
  2. Data understanding
  3. Data preparation
  4. Modelling
  5. Evaluation
  6. Deployment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

EDA GOAL

A

Frame hypothesis, visualies, discover patterns, spot anomalies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

3 steps of EDA

A
  1. Generate questions
  2. Search for the answers with visualising, transforming, modelling.
  3. Refine/generate new questions with the answers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

One-hot enconding

A

For each different value, a new feature is added and the values is either 1 or 0 for the feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Normalisation

A

KNN, SVM, gradient descent are affected by scale of data. Normalisation brings to center around 0. Kinda makes it normally distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Standardisation

A

Use if the distribution is normal. Brings the data to the same range.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Discretisation

A

Bring continuous values to discrete. E.g. for decision tree you would need discrete values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

MAR

A

Missing data depends on other observed variables but not on the value of the missing data. e.g. missing income depends on the education level but not on the income itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

MNAR

A

e.g. high income individuals are less likely to disclose their income. The missing data depends on the missing values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

DELETION

A
  • When less than 5% contains missing values
  • Or MCAR
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The curse of dimensionality:

A

High dimensional data has irrelevant and redundant features. High correlation between features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Filter FS

A

uses existing measures, no learning algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Wrapper FS

A

Uses learning performance to train ML models many times

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Embedded FS

A

Train the ML model once. Select features based on learned model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Feature Raking

A

Evaluate features individually
rank features and select top-ranked feautres
Simple, efficient
But ignores feature interactions (it can select redundant features)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Feature Subset Selection

A

Iteratively evaluate the whole feature subset. Considers feature interactions and ahs better performance but is more expensive/complicated

17
Q

SFS steps

A
  1. Start with emptey feature set
  2. Sequentially add the feature X that results in the highest value when combined with the current set.
  3. Stop when the defined N amount of features is selected.
    (works best when the optimal subset has a small N number of features)
18
Q

SBFS steps

A
  1. start from the full set
  2. sequentially remove the feature X that results in the highets objective value
  3. Stop when a pre-defined number of features is selected.
    (Works best when the N number of features is large)
19
Q

Filter FC

A

Class. Accuracy: low, computational cost: low, generality: high

20
Q

Embedded FC

A

Class. acc: medium, comp. cost: medium, Generality : medium

21
Q

Wrapper FC

A

Class. acc: high, Comp. cost: high, Generality: low.

22
Q

PCA

A

a mathematical procedure that linearly transforms
(possibly) correlated features into a (smaller) number of
uncorrelated features called principal components.
Goal is to achieve high data variance

23
Q

What is data variance? Why higher variance?

A

Variance: squared deviations from the mean. Spread of your dataset. We want higher varaince because we want to retain as much informationa s possible.

24
Q
A