How to Change Numerical Data Distributions Flashcards
WHAT ARE THE CAUSES OF HIGHLY SKEWED OR NON-STANDARD DISTRIBUTION? P288
Outliers
Multi-modal distributions
Highly exponential distributions, etc…
DOES STANDARD DISTRIBUTION FOR THE TARGET VALUES HELP THE PERFORMANCE? P288
Yes, many ML algorithms prefer or perform better when numerical input variables and even output variables in the case of regression have a standard probability distribution
WHAT DOES A QUANTILE TRANSFORM DO? P289
It’ll map a variable’s probability distribution to another probability distribution.
WHAT IS A CUMULATIVE DISTRIBUTION FUNCTION? P289
The cumulative distribution function (CDF) is the probability that a random variable, say X, will take a value less than or equal to x.
WHAT IS A PERCENT-POINT FUNCTION (PPF)? P289
It’s also called quantile function, it’s the inverse of the cumulative probability distribution (CDF). It returns the value at or below a given probability.
WWW: think you have kde plot, you want to know the probability of values in a distribution, being below a certain value (CDF), or the inverse case (PPF): you have a probability value, you want to know up to what value we need to calculate the area under the kde (probability) to amount to the input probability. It’s basically CDF: x=variable belonging to a certain distribution, y=area under the kde curve (probability) and PPF: Inverse of CDF so x=area under the kde curve (probability) y= variable belonging to a certain distribution
WHAT DOES QUANTILE TRANSFORMER IN SCIKIT-LEARN DO? P289
This method transforms the features to follow a uniform or a normal distribution.
First an estimate of the cumulative distribution function of a feature is used to map the original values to a uniform distribution. The obtained values are then mapped to the desired output distribution using the associated quantile function
WHAT IS THE MEANING OF “N_QUANTILES” PARAMETER IN QUANTILE TRANSFORMER? WHAT RANGE OF VALUES CAN IT HAVE? P289
The resolution of the mapping or ranking of the observations in the dataset. This must be set to a value less than the number of observations in the dataset and defaults to 1000
WHEN CAN IT BE BENEFICIAL TO USE OUTPUT_DISTRIBUTION= UNIFORM DISTRIBUTION FOR QUANTILE TRANSFORM? P296
Sometimes it can be beneficial to transform a highly exponential or multi-modal distribution to have a uniform distribution. This is especially useful for data with a large and sparse range of values, e.g. outliers that are common rather than rare.