Distance & Scaling Flashcards
Euclidean Measure
square root of (x2-x1)^2 + (y2-y1)^2
Manhattan Distance
Sum of the absolute differences between the coordinates of two points
Minkowski Distance is
A generalization
Minkowski Distance r = 1
Manhattan
Minkowski Distance r = 2
Euclidean
Minkowski Distance r = inf
Supremum Distance
As r is higher in minkowski distance
Gives more weight to larger differences, emphasizing outliers
Mahalanobis Distance
Uses variance to plot how close two points are
Standardization
Transforms data with mean=0 and std. dev = 1 (z-score_
Normalization
Scales a variable to have value between 0 and 1
When do you want to standardize distances
If the scales differ significantly
One Hot Encoding
Technique to convert categorial data into a binary matrix (numerical representation)
What happens if we dont scale
Some algorithms will be slow in converging
Features with high magnitudes will dominate the distance calculations
In one hot encoding, each category is represented as
a binary vector with a single 1 and all other positions as 0
When we scale, do we lose geometric representation of each point with respect to its neighbors?
No, the points stay exactly the same
Should we scale with decision trees?
No, don’t need to scale and we lose the original value of the predictor which would make decision less explainable
Scale then split or split then scale?
Split then scale. So that you don’t leak information from the training set into the test set
Discretization
Sort numbers, create split points, map split values to discrete categorical variables
Standardization is essentially
Feature Scaling
Standardization formula
replace values with z score, mean of 0 and std dev of 1
Mean Normalization
Redistributes with range [-1, 1] mean =0
Min-max scaling
Redistributes with range [0,1]