Lecture 3 - Normal Distribution & Outliers Flashcards
What is distribution?
An arrangement of values of a variable showing their observed or theoretical frequency of occurrence
What is frequency distribution?
A graph plotting values of observations on the horizontal axis and the frequency with which each value occurs in the data set on the vertical axis
What is a discrete variable?
Variable that can take on only certain values (usually whole numbers)
What is a frequency distribution (discrete variable)?
A distribution from which we can calculate the probability of occurrence of specific values of a variable
What is a probability distribution (discrete variable)?
Probability of a specific outcome
A curve from which the probability of occurrence of specific values of a variable can be ascertained
What is a continuous variable?
Variable that can take on an infinite number (or at least many) values between the lowest and highest values
What is a probability distribution (continuous variable)?
Probability of obtaining a value that falls within a specific interval
Probability = area under the curve
What are the 3 characteristics of a normal distribution/curve?
Bell-shaped curve
symmetrical
mean = median = mode
2 things that describes a normal distribution fully
mean: determines location of centre of the graph
SD: determines height and width of the graph
Probability and SD features of a normal distribution
Probability: total area under curve = 1 & probability that a variable = any particular value is 0
SD
- 1 SD of mean: 68.26% of area under curve
- 2 SD: 95.44%
- 3 SD: 99.74%
What is a special case of normal distribution and its features?
Standard normal distribution and Z scores (different from skewness and kurtosis Z scores)
mean = 0, SD = 1
Z scores = (X - mean)/SD
95% of Z scores lie between -1.96 & 1.96
99%, -2.58 & 2.58
99.9%, -3.29 & 3.29
What are the usefulness of a normal distribution? Give 3 reasons as to why the assumption of normality is useful (& thus many statistical procedures are based on the assumption of normality)
usefulness would definitely also be the benefits/pros/advantages
commonly observed distribution
assumption of normality central in inferential statistics (concerned with probability)
characteristics of normal curve well known –> if assumption is valid, characteristics of normality may be applied to infer information about the population parameter and perform hypothesis testing
What is a sampling distribution? Differentiate sample distribution and sampling distribution
sampling distribution: the distribution of a sample statistic obtained by infinite repeated sampling (or considering all possible outcomes)
sample distribution: frequency distribution that is obtained from the sample
assumption that the sampling distribution (of any statistics) is normal
every parameter that can be calculated in a sample i.e. every sample statistic can have a sampling distribution
What is Central Limit Theorem (CLT)?
Regardless of the shape of the population, parameter estimates of that population will have a normal distribution if the sample size is large enough (i.e. sampling distribution of any statistic will be (nearly) normal if the sample size is large enough
What are the 4 guidelines/conditions in the application of CLT? What are the 3 general criteria to look for?
Sample size is large enough if
- ) population distribution is normal
- ) n = 15 or less + data distribution is symmetric, unimodal and without outliers –> sample size is large enough –> sampling distribution will be (nearly) normal
- ) n = 16 or more to 40 or less + data distribution is unimodal, without outliers and extreme skewness/kurtosis
- ) n > 40 + data distribution is unimodal and without extreme outliers
unimodal
symmetry/extreme skewness/kurtosis
outliers
What are the 3 methods to test for normality? Why do we not use/Why is it not advisable to use normality tests?
CLT and sample size
Numerical methods (summary statistics) in EDA
Graphical Methods in EDA
normality tests e.g. shapiro-wilks are not reliable
- when n is small (30 and below): lack power, not sensitive enough –> false -ve (shows no deviation from normality when in reality there is deviation)
- when n is large (>30): too sensitive –> false +ve
What do we look out for in numerical methods in EDA?
Compare mean and median - similar? –> symmetrical
Calculate skewness and kurtosis |Z scores| - < 2? –> no extreme skewness/kurtosis
- (raw score or statistic)/standard error
What are the 3 types of graphs that we look out for in graphical methods in EDA? What information does each graph show?
Histogram: shows all features but better especially for GAPS, SYMMETRY & NO. OF MODES
Normal probability (Q-Q) plot: graph plotting the quantiles of a variable against the quantiles of a normal dsitribution; great for NORMALITY/NORMAL DISTRIBUTION based on how close data points are to the best-fit line, SKEWNESS, KURTOSIS & OUTLIERS
Box plot: no gaps and no. of modes but great for OUTLIERS & SYMMETRY
What are the respective methods for identifying the following characteristics/criteria:
- outlier
- extreme outlier
- skewness/kurtosis
- Extreme skewness/kurtosis
- no. of modes (least important)
- box plot, maybe normal probability Q-Q plot
- box plot
- Z scores; histogram (sample size must be large enough to be reliable); normal probability Q-Q plot; box plot
- Z scores
- histogram
What are the 2 things to do if the normality assumption is questionable?
- ) Check data: erros in data entry, measurement etc.
2. ) Decide on appropriate actions on a case by case basis
What are the 4 appropriate actions that we can undertake when the normality assumption is questionable and we have already checked the data and found no errors? Describe them
remember, only can choose one of them
- ) Leave the data as they are
- depends on what we are using the data for - normality might not matter
- some inferential tests requiring the assumption of normal distribution (i.e. parametric tests) may be robust to minor violation of normality e.g. independent t-test - ) Data transformation
- use of a non-linear function to change the size of large values differently from the change in size of small values e.g. +skewed data distribution: square root; -skewed: square, cube
- always use the least powerful transformation (first) necessary to reduce the impact/presence of outliers or improve symmetry
- if data transformation of -ve values are not possible, add a constant to all data values that will make them all positive prior to trasnformation
3.) non-parametric inferential tests
- ) robust or other modern inferential tests
e. g. independent t-test
What are the 2 benefits/pros/advantages and 3 problems/cons/disadvantages of data transformation?
Benefits:
an appropriate transformation may
- reduce the impact of outliers by turning extraordinary points into merely the largest or smallest values in the data set
- improve symmetry
Problems:
- not applicable if have outliers in both directions
- back transformation may be needed for meaningful interpretation
- introducing a bias when we change linear to a non-linear function –> either favour the more =ve no.s or -ve no.s
What are parametric tests? Give some examples
Inferential tests that make assumptions of normality about the parameters of the population distribution from which the samples are drawn.
e.g. t-test, Pearson’s Correlation
(Common assumptions made are that of normality and quantitative - interval and ratio - data)
What are some (3-4) benefits/advantages/pros of parametric tests?
usefulness would definitely be the benefit
More flexible
More options available
More powerful if data distribution is normal
- power: ability to find/detect the true/significant difference/relationship if there is actually one
Always start with parametric tests first i.e. always select a parametric test as the preliminary inferential test as there are lots of patterns and details (i.e. characteristics) well known about a normal distribution curve
[refer to question on the usefulness of normal distribution - card 12]
What are non-parametric tests?
Inferential tests that do not require the assumption of normality and can be used for nominal and ordinal data as well as quantitative data
What is the disadvantage of parametric tests?
Assumption of normality of distribution needed
What is the advantage of non-parametric tests?
No assumption of normality of distribution (needed)
What are the 3 disadvantages of non-parametric tests?
Generally have reduced statistical power when distributions are truly normal
Limited or more complex follow up tests in event of significant findings (i.e. not many options)
Limited software available to produce confidence intervals (i.e. not much options)
What are outliers? Definition
An observation/score that is very different from the rest of the data
An extreme point that stands out from the rest of the distribution
Present in both normal and non-normal distributions
Why do outliers matter? What are the 3 consequences of having an outlier?
Outliers affect any statistical test based on sample mean and variance (which are commonly used)
- ) Biased parameter estimates
- ) Biased confidence interval
- ) Faulty conclusions in hypothesis testing
What are the 2 ways of handling outliers? How do we handle them?
[see card 20, 21]
Depends on the source of the outlier: error or genuine case due to variability of data
1.) Check if outlier is due to error (Data errors: measurement, instrument, data entry errors) –>
If outlier is due to error: rectify or drop
2.) Outlier is a genuine case –>
Descriptive stats: use median and IQR to describe stats
Inferential stats: data transformation or non-parametric tests that are robust to outliers
Why can’t we leave outliers as they are? When can we leave them a they are?
Result in bias and other problems (e.g. cause distribution of data to be non-normal)\
When using parametric/non-parametric tests that are robust to minor violation of normality/outliers
When can we delete outliers? 2 reasons for being able to do so
Only as a last resort
Only if they lie so far outside the range of the remainder of the data that they distort statistical inferences
Not good to delete them at will
What are the 2 things that we must take note of when/with regards to reporting the results and deletion of outliers?
Must report deletion and offer reasons/show why
Report model results both with and without outliers –> compare and make a conclusion (choose one)