week 1 Flashcards

Question 1

Q

prosecutor’s fallacy

Answer

A

P(A|B) != P(B|A)

Question 2

Q

data dredging

Answer

A

performing many statistical results and only reporting those with significant results / claiming that a model learned from data represents the truth

Question 3

Q

bonferroni’s principle

Answer

A

even in completely random datasets, you can expect particular events of interest to occur, and to occur in increasing numbers as the amount of data grows

Question 4

Q

power laws

Answer

A

linear relationship between the logs of 2 variables, a lot of real-world data follows this rule. to lessen the effect of very large values, NEED TO log

Question 5

Q

data transformation

Answer

A

box-cox transform - family of power functions - result depends on lambda

Question 6

Q

feature extraction

Answer

A

embedding

Question 7

Q

feature transformation

Answer

A

categorical : label encoding, target encoding (sort by class probability), one-hot encoding(1 column for each class, binary can also be possible)
continuous : equal range, equal frequency, clustering, classification, reduce dimensionality and cluster latent space

Question 8

Q

feature addition

Answer

A

x^2 or x(t) - x(t - 1), but be careful to NEVER USE CLASS LABELS FROM TEST SET

Question 9

Q

data sampling

Answer

A

99% of data is benign, 1% is fraudulent => 99% accuracy => data sampling
techniques : oversample minority class, undersample majority class, reweighting, synthesizing
IMPORTANT : NEVER MODIFY TEST SET

Question 10

Q

SMOTE

Answer

A

for every minority instance, take a random neighbour, construct vector, choose new random artificial data point along that vector. before this algorithm you can remove tomek links (links between nearest neighbours of different classes)

Question 11

Q

missing values

Answer

A

imputation (replacing missing values) :
hot-deck - replace with values from similar rows
multiple imputing

week 1 Flashcards

(11 cards)