Lecture 3 - Normal Distribution & Outliers Flashcards
What is distribution?
An arrangement of values of a variable showing their observed or theoretical frequency of occurrence
What is frequency distribution?
A graph plotting values of observations on the horizontal axis and the frequency with which each value occurs in the data set on the vertical axis
What is a discrete variable?
Variable that can take on only certain values (usually whole numbers)
What is a frequency distribution (discrete variable)?
A distribution from which we can calculate the probability of occurrence of specific values of a variable
What is a probability distribution (discrete variable)?
Probability of a specific outcome
A curve from which the probability of occurrence of specific values of a variable can be ascertained
What is a continuous variable?
Variable that can take on an infinite number (or at least many) values between the lowest and highest values
What is a probability distribution (continuous variable)?
Probability of obtaining a value that falls within a specific interval
Probability = area under the curve
What are the 3 characteristics of a normal distribution/curve?
Bell-shaped curve
symmetrical
mean = median = mode
2 things that describes a normal distribution fully
mean: determines location of centre of the graph
SD: determines height and width of the graph
Probability and SD features of a normal distribution
Probability: total area under curve = 1 & probability that a variable = any particular value is 0
SD
- 1 SD of mean: 68.26% of area under curve
- 2 SD: 95.44%
- 3 SD: 99.74%
What is a special case of normal distribution and its features?
Standard normal distribution and Z scores (different from skewness and kurtosis Z scores)
mean = 0, SD = 1
Z scores = (X - mean)/SD
95% of Z scores lie between -1.96 & 1.96
99%, -2.58 & 2.58
99.9%, -3.29 & 3.29
What are the usefulness of a normal distribution? Give 3 reasons as to why the assumption of normality is useful (& thus many statistical procedures are based on the assumption of normality)
usefulness would definitely also be the benefits/pros/advantages
commonly observed distribution
assumption of normality central in inferential statistics (concerned with probability)
characteristics of normal curve well known –> if assumption is valid, characteristics of normality may be applied to infer information about the population parameter and perform hypothesis testing
What is a sampling distribution? Differentiate sample distribution and sampling distribution
sampling distribution: the distribution of a sample statistic obtained by infinite repeated sampling (or considering all possible outcomes)
sample distribution: frequency distribution that is obtained from the sample
assumption that the sampling distribution (of any statistics) is normal
every parameter that can be calculated in a sample i.e. every sample statistic can have a sampling distribution
What is Central Limit Theorem (CLT)?
Regardless of the shape of the population, parameter estimates of that population will have a normal distribution if the sample size is large enough (i.e. sampling distribution of any statistic will be (nearly) normal if the sample size is large enough
What are the 4 guidelines/conditions in the application of CLT? What are the 3 general criteria to look for?
Sample size is large enough if
- ) population distribution is normal
- ) n = 15 or less + data distribution is symmetric, unimodal and without outliers –> sample size is large enough –> sampling distribution will be (nearly) normal
- ) n = 16 or more to 40 or less + data distribution is unimodal, without outliers and extreme skewness/kurtosis
- ) n > 40 + data distribution is unimodal and without extreme outliers
unimodal
symmetry/extreme skewness/kurtosis
outliers