week 1 Flashcards

1
Q

prosecutor’s fallacy

A

P(A|B) != P(B|A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

data dredging

A

performing many statistical results and only reporting those with significant results / claiming that a model learned from data represents the truth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

bonferroni’s principle

A

even in completely random datasets, you can expect particular events of interest to occur, and to occur in increasing numbers as the amount of data grows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

power laws

A

linear relationship between the logs of 2 variables, a lot of real-world data follows this rule. to lessen the effect of very large values, NEED TO log

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

data transformation

A

box-cox transform - family of power functions - result depends on lambda

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

feature extraction

A

embedding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

feature transformation

A

categorical : label encoding, target encoding (sort by class probability), one-hot encoding(1 column for each class, binary can also be possible)
continuous : equal range, equal frequency, clustering, classification, reduce dimensionality and cluster latent space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

feature addition

A

x^2 or x(t) - x(t - 1), but be careful to NEVER USE CLASS LABELS FROM TEST SET

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

data sampling

A

99% of data is benign, 1% is fraudulent => 99% accuracy => data sampling
techniques : oversample minority class, undersample majority class, reweighting, synthesizing
IMPORTANT : NEVER MODIFY TEST SET

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

SMOTE

A

for every minority instance, take a random neighbour, construct vector, choose new random artificial data point along that vector. before this algorithm you can remove tomek links (links between nearest neighbours of different classes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

missing values

A

imputation (replacing missing values) :
hot-deck - replace with values from similar rows
multiple imputing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly