week 1 Flashcards
prosecutor’s fallacy
P(A|B) != P(B|A)
data dredging
performing many statistical results and only reporting those with significant results / claiming that a model learned from data represents the truth
bonferroni’s principle
even in completely random datasets, you can expect particular events of interest to occur, and to occur in increasing numbers as the amount of data grows
power laws
linear relationship between the logs of 2 variables, a lot of real-world data follows this rule. to lessen the effect of very large values, NEED TO log
data transformation
box-cox transform - family of power functions - result depends on lambda
feature extraction
embedding
feature transformation
categorical : label encoding, target encoding (sort by class probability), one-hot encoding(1 column for each class, binary can also be possible)
continuous : equal range, equal frequency, clustering, classification, reduce dimensionality and cluster latent space
feature addition
x^2 or x(t) - x(t - 1), but be careful to NEVER USE CLASS LABELS FROM TEST SET
data sampling
99% of data is benign, 1% is fraudulent => 99% accuracy => data sampling
techniques : oversample minority class, undersample majority class, reweighting, synthesizing
IMPORTANT : NEVER MODIFY TEST SET
SMOTE
for every minority instance, take a random neighbour, construct vector, choose new random artificial data point along that vector. before this algorithm you can remove tomek links (links between nearest neighbours of different classes)
missing values
imputation (replacing missing values) :
hot-deck - replace with values from similar rows
multiple imputing