How to Scale Numerical Data Flashcards
WHICH TYPE OF ALGORITHMS BENEFIT FROM SCALING NUMERICAL VALUES TO A STANDARD RAGE? P230
Algorithms that use a weighted sum of the input (e.g. Linear Regression, ANN)
Algorithms that use distance measures (e.g. K-nearest neighbors, SVM)
WHAT ARE THE TWO MOST POPULAR TECHNIQUES FOR SCALING NUMERICAL DATA? P230
Normalization
Standardization
WHAT DOES NORMALIZATION DO? P230
Scales each input variable separately to the range 0-1
WHY DO WE CHOOSE THE RANGE 0-1 FOR NORMALIZING? P230
Floating-point values have the most precision in this range
WHAT DOES STANDARDIZATION DO? P230
Scales each input variable separately by subtracting the mean and dividing by STD so that the distribution has a mean of 0 and STD of 1
FOR WHICH MODEL IS IT A CRITICAL STEP TO SCALE THE TARGET? P231
For regression predictive modeling problems, to make it easier to learn; most notably in the case of neural network models
WHY DO WE NEED TO KNOW THE RANGE OF VALUES, BEFORE NORMALIZING? P232
Because we need all the values to be between 0 and 1, we divide all values by the maximum amount, or we use the range of change (maximum value-minimum value).
Normalization formula: y = x − min /max − min
WHAT SHOULD WE DO IF THERE’S AN OBSERVATION HIGHER OR LOWER THAN THE MAX – MIN VALUES? P232
Either remove them or limit them to the pre-defined maximum and minimum values
USING WHICH CLASS CAN WE NORMALIZE DATA IN SCIKIT-LEARN? P232
MinMaxScaler
USING WHICH METHOD OF MINMAXSCALER CAN WE REVERSE THE TRANSFORMATION? P233
Inverse_transform ()
WHAT IS THE ASSUMPTION OF STANDARDIZATION? P234
That the observations fit a Gaussian distribution, with a well-behaved mean and STD.
WHAT’S ANOTHER NAME FOR STANDARDIZATION? P234
Center scaling
WHAT IS SCIKIT-LEARN’S CLASS FOR STANDARDIZATION? P234
StandardScaler
WHEN IS IT BETTER TO NORMALIZE AND WHEN IS IT BETTER TO STANDARDIZE? P244
Whether input variables require scaling depends on the specifics of your problem and of each variable. You may have a sequence of quantities as inputs, such as prices or temperatures. If the distribution of the quantity is normal, then it should be standardized, otherwise, the data should be normalized. This applies if the range of quantity values is large (10s, 100s, etc.) or small (0.01, 0.0001).