Lecture 3 - Normal Distribution & Outliers Flashcards by Michael Hommie Chan

What is distribution?

An arrangement of values of a variable showing their observed or theoretical frequency of occurrence

How well did you know this?

Not at all

Perfectly

What is frequency distribution?

A graph plotting values of observations on the horizontal axis and the frequency with which each value occurs in the data set on the vertical axis

How well did you know this?

Not at all

Perfectly

What is a discrete variable?

Variable that can take on only certain values (usually whole numbers)

How well did you know this?

Not at all

Perfectly

What is a frequency distribution (discrete variable)?

A distribution from which we can calculate the probability of occurrence of specific values of a variable

How well did you know this?

Not at all

Perfectly

What is a probability distribution (discrete variable)?

Probability of a specific outcome

A curve from which the probability of occurrence of specific values of a variable can be ascertained

How well did you know this?

Not at all

Perfectly

What is a continuous variable?

Variable that can take on an infinite number (or at least many) values between the lowest and highest values

How well did you know this?

Not at all

Perfectly

What is a probability distribution (continuous variable)?

Probability of obtaining a value that falls within a specific interval

Probability = area under the curve

How well did you know this?

Not at all

Perfectly

What are the 3 characteristics of a normal distribution/curve?

Bell-shaped curve

symmetrical

mean = median = mode

How well did you know this?

Not at all

Perfectly

2 things that describes a normal distribution fully

mean: determines location of centre of the graph

SD: determines height and width of the graph

How well did you know this?

Not at all

Perfectly

Probability and SD features of a normal distribution

Probability: total area under curve = 1 & probability that a variable = any particular value is 0

1 SD of mean: 68.26% of area under curve
2 SD: 95.44%
3 SD: 99.74%

How well did you know this?

Not at all

Perfectly

What is a special case of normal distribution and its features?

Standard normal distribution and Z scores (different from skewness and kurtosis Z scores)

mean = 0, SD = 1

Z scores = (X - mean)/SD

95% of Z scores lie between -1.96 & 1.96
99%, -2.58 & 2.58
99.9%, -3.29 & 3.29

How well did you know this?

Not at all

Perfectly

What are the usefulness of a normal distribution? Give 3 reasons as to why the assumption of normality is useful (& thus many statistical procedures are based on the assumption of normality)

usefulness would definitely also be the benefits/pros/advantages

commonly observed distribution

assumption of normality central in inferential statistics (concerned with probability)

characteristics of normal curve well known –> if assumption is valid, characteristics of normality may be applied to infer information about the population parameter and perform hypothesis testing

How well did you know this?

Not at all

Perfectly

What is a sampling distribution? Differentiate sample distribution and sampling distribution

sampling distribution: the distribution of a sample statistic obtained by infinite repeated sampling (or considering all possible outcomes)

sample distribution: frequency distribution that is obtained from the sample

assumption that the sampling distribution (of any statistics) is normal

every parameter that can be calculated in a sample i.e. every sample statistic can have a sampling distribution

How well did you know this?

Not at all

Perfectly

What is Central Limit Theorem (CLT)?

Regardless of the shape of the population, parameter estimates of that population will have a normal distribution if the sample size is large enough (i.e. sampling distribution of any statistic will be (nearly) normal if the sample size is large enough

How well did you know this?

Not at all

Perfectly

What are the 4 guidelines/conditions in the application of CLT? What are the 3 general criteria to look for?

Sample size is large enough if

) population distribution is normal
) n = 15 or less + data distribution is symmetric, unimodal and without outliers –> sample size is large enough –> sampling distribution will be (nearly) normal
) n = 16 or more to 40 or less + data distribution is unimodal, without outliers and extreme skewness/kurtosis
) n > 40 + data distribution is unimodal and without extreme outliers

unimodal
symmetry/extreme skewness/kurtosis
outliers

How well did you know this?

Not at all

Perfectly

What are the 3 methods to test for normality? Why do we not use/Why is it not advisable to use normality tests?

Study These Flashcards

CLT and sample size

Numerical methods (summary statistics) in EDA

Graphical Methods in EDA

normality tests e.g. shapiro-wilks are not reliable

when n is small (30 and below): lack power, not sensitive enough –> false -ve (shows no deviation from normality when in reality there is deviation)
when n is large (>30): too sensitive –> false +ve

What do we look out for in numerical methods in EDA?

Study These Flashcards

Compare mean and median - similar? –> symmetrical

Calculate skewness and kurtosis |Z scores| - < 2? –> no extreme skewness/kurtosis
- (raw score or statistic)/standard error

What are the 3 types of graphs that we look out for in graphical methods in EDA? What information does each graph show?

Study These Flashcards

Histogram: shows all features but better especially for GAPS, SYMMETRY & NO. OF MODES

Normal probability (Q-Q) plot: graph plotting the quantiles of a variable against the quantiles of a normal dsitribution; great for NORMALITY/NORMAL DISTRIBUTION based on how close data points are to the best-fit line, SKEWNESS, KURTOSIS & OUTLIERS

Box plot: no gaps and no. of modes but great for OUTLIERS & SYMMETRY

What are the respective methods for identifying the following characteristics/criteria:

outlier
extreme outlier
skewness/kurtosis
Extreme skewness/kurtosis
no. of modes (least important)

Study These Flashcards

box plot, maybe normal probability Q-Q plot
box plot
Z scores; histogram (sample size must be large enough to be reliable); normal probability Q-Q plot; box plot
Z scores
histogram

What are the 2 things to do if the normality assumption is questionable?

Study These Flashcards

) Check data: erros in data entry, measurement etc.

2. ) Decide on appropriate actions on a case by case basis

What are the 4 appropriate actions that we can undertake when the normality assumption is questionable and we have already checked the data and found no errors? Describe them

remember, only can choose one of them

Study These Flashcards

) Leave the data as they are
- depends on what we are using the data for - normality might not matter
- some inferential tests requiring the assumption of normal distribution (i.e. parametric tests) may be robust to minor violation of normality e.g. independent t-test
) Data transformation
- use of a non-linear function to change the size of large values differently from the change in size of small values e.g. +skewed data distribution: square root; -skewed: square, cube
- always use the least powerful transformation (first) necessary to reduce the impact/presence of outliers or improve symmetry
- if data transformation of -ve values are not possible, add a constant to all data values that will make them all positive prior to trasnformation

3.) non-parametric inferential tests

) robust or other modern inferential tests
e. g. independent t-test

What are the 2 benefits/pros/advantages and 3 problems/cons/disadvantages of data transformation?

Study These Flashcards

Benefits:
an appropriate transformation may
- reduce the impact of outliers by turning extraordinary points into merely the largest or smallest values in the data set
- improve symmetry

Problems:

not applicable if have outliers in both directions
back transformation may be needed for meaningful interpretation
introducing a bias when we change linear to a non-linear function –> either favour the more =ve no.s or -ve no.s

What are parametric tests? Give some examples

Study These Flashcards

Inferential tests that make assumptions of normality about the parameters of the population distribution from which the samples are drawn.

e.g. t-test, Pearson’s Correlation

(Common assumptions made are that of normality and quantitative - interval and ratio - data)

What are some (3-4) benefits/advantages/pros of parametric tests?

usefulness would definitely be the benefit

Study These Flashcards

More flexible

More options available

More powerful if data distribution is normal
- power: ability to find/detect the true/significant difference/relationship if there is actually one

Always start with parametric tests first i.e. always select a parametric test as the preliminary inferential test as there are lots of patterns and details (i.e. characteristics) well known about a normal distribution curve

[refer to question on the usefulness of normal distribution - card 12]

What are non-parametric tests?

Inferential tests that do not require the assumption of normality and can be used for nominal and ordinal data as well as quantitative data

What is the disadvantage of parametric tests?

Assumption of normality of distribution needed

What is the advantage of non-parametric tests?

No assumption of normality of distribution (needed)

What are the 3 disadvantages of non-parametric tests?

Generally have reduced statistical power when distributions are truly normal Limited or more complex follow up tests in event of significant findings (i.e. not many options) Limited software available to produce confidence intervals (i.e. not much options)

What are outliers? Definition

An observation/score that is very different from the rest of the data An extreme point that stands out from the rest of the distribution Present in both normal and non-normal distributions

Why do outliers matter? What are the 3 consequences of having an outlier?

Outliers affect any statistical test based on sample mean and variance (which are commonly used) 1. ) Biased parameter estimates 2. ) Biased confidence interval 3. ) Faulty conclusions in hypothesis testing

What are the 2 ways of handling outliers? How do we handle them? [see card 20, 21] Depends on the source of the outlier: error or genuine case due to variability of data

1.) Check if outlier is due to error (Data errors: measurement, instrument, data entry errors) --> If outlier is due to error: rectify or drop 2.) Outlier is a genuine case --> Descriptive stats: use median and IQR to describe stats Inferential stats: data transformation or non-parametric tests that are robust to outliers

Why can't we leave outliers as they are? When can we leave them a they are?

Result in bias and other problems (e.g. cause distribution of data to be non-normal)\ When using parametric/non-parametric tests that are robust to minor violation of normality/outliers

When can we delete outliers? 2 reasons for being able to do so

Only as a last resort Only if they lie so far outside the range of the remainder of the data that they distort statistical inferences Not good to delete them at will

What are the 2 things that we must take note of when/with regards to reporting the results and deletion of outliers?

Must report deletion and offer reasons/show why Report model results both with and without outliers --> compare and make a conclusion (choose one)

Lecture 3 - Normal Distribution & Outliers Flashcards

(34 cards)