Preprocess Flashcards

Question 1

Q

The 6 steps of CRISP-DM

Answer

A

Business Understanding
Data understanding
Data preparation
Modelling
Evaluation
Deployment

Question 2

Q

EDA GOAL

Answer

A

Frame hypothesis, visualies, discover patterns, spot anomalies

Question 3

Q

3 steps of EDA

Answer

A

Generate questions
Search for the answers with visualising, transforming, modelling.
Refine/generate new questions with the answers

Question 4

Q

One-hot enconding

Answer

A

For each different value, a new feature is added and the values is either 1 or 0 for the feature.

Question 5

Q

Normalisation

Answer

A

KNN, SVM, gradient descent are affected by scale of data. Normalisation brings to center around 0. Kinda makes it normally distributed.

Question 6

Q

Standardisation

Answer

A

Use if the distribution is normal. Brings the data to the same range.

Question 7

Q

Discretisation

Answer

A

Bring continuous values to discrete. E.g. for decision tree you would need discrete values.

Question 8

Q

MAR

Answer

A

Missing data depends on other observed variables but not on the value of the missing data. e.g. missing income depends on the education level but not on the income itself.

Question 9

Q

MNAR

Answer

A

e.g. high income individuals are less likely to disclose their income. The missing data depends on the missing values.

Question 10

Q

DELETION

Answer

A

When less than 5% contains missing values
Or MCAR

Question 11

Q

The curse of dimensionality:

Answer

A

High dimensional data has irrelevant and redundant features. High correlation between features.

Question 12

Q

Filter FS

Answer

A

uses existing measures, no learning algorithm

Question 13

Q

Wrapper FS

Answer

A

Uses learning performance to train ML models many times

Question 14

Q

Embedded FS

Answer

A

Train the ML model once. Select features based on learned model.

Question 15

Q

Feature Raking

Answer

A

Evaluate features individually
rank features and select top-ranked feautres
Simple, efficient
But ignores feature interactions (it can select redundant features)

Question 16

Q

Feature Subset Selection

Answer

Study These Flashcards

A

Iteratively evaluate the whole feature subset. Considers feature interactions and ahs better performance but is more expensive/complicated

Question 17

Q

SFS steps

Answer

Study These Flashcards

A

Start with emptey feature set
Sequentially add the feature X that results in the highest value when combined with the current set.
Stop when the defined N amount of features is selected.
(works best when the optimal subset has a small N number of features)

Question 18

Q

SBFS steps

Answer

Study These Flashcards

A

start from the full set
sequentially remove the feature X that results in the highets objective value
Stop when a pre-defined number of features is selected.
(Works best when the N number of features is large)

Question 19

Q

Filter FC

Answer

Study These Flashcards

A

Class. Accuracy: low, computational cost: low, generality: high

Question 20

Q

Embedded FC

Answer

Study These Flashcards

A

Class. acc: medium, comp. cost: medium, Generality : medium

Question 21

Q

Wrapper FC

Answer

Study These Flashcards

A

Class. acc: high, Comp. cost: high, Generality: low.

Question 22

Q

PCA

Answer

Study These Flashcards

A

a mathematical procedure that linearly transforms
(possibly) correlated features into a (smaller) number of
uncorrelated features called principal components.
Goal is to achieve high data variance

Question 23

Q

What is data variance? Why higher variance?

Answer

Study These Flashcards

A

Variance: squared deviations from the mean. Spread of your dataset. We want higher varaince because we want to retain as much informationa s possible.

Question 24

Q

Answer

Study These Flashcards

A

Preprocess Flashcards

(24 cards)