Lecture 3 - Normal Distribution & Outliers Flashcards

1
Q

What is distribution?

A

An arrangement of values of a variable showing their observed or theoretical frequency of occurrence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is frequency distribution?

A

A graph plotting values of observations on the horizontal axis and the frequency with which each value occurs in the data set on the vertical axis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a discrete variable?

A

Variable that can take on only certain values (usually whole numbers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a frequency distribution (discrete variable)?

A

A distribution from which we can calculate the probability of occurrence of specific values of a variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a probability distribution (discrete variable)?

A

Probability of a specific outcome

A curve from which the probability of occurrence of specific values of a variable can be ascertained

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a continuous variable?

A

Variable that can take on an infinite number (or at least many) values between the lowest and highest values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a probability distribution (continuous variable)?

A

Probability of obtaining a value that falls within a specific interval

Probability = area under the curve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 3 characteristics of a normal distribution/curve?

A

Bell-shaped curve

symmetrical

mean = median = mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

2 things that describes a normal distribution fully

A

mean: determines location of centre of the graph

SD: determines height and width of the graph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Probability and SD features of a normal distribution

A

Probability: total area under curve = 1 & probability that a variable = any particular value is 0

SD

  • 1 SD of mean: 68.26% of area under curve
  • 2 SD: 95.44%
  • 3 SD: 99.74%
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a special case of normal distribution and its features?

A

Standard normal distribution and Z scores (different from skewness and kurtosis Z scores)

mean = 0, SD = 1

Z scores = (X - mean)/SD

95% of Z scores lie between -1.96 & 1.96
99%, -2.58 & 2.58
99.9%, -3.29 & 3.29

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the usefulness of a normal distribution? Give 3 reasons as to why the assumption of normality is useful (& thus many statistical procedures are based on the assumption of normality)

usefulness would definitely also be the benefits/pros/advantages

A

commonly observed distribution

assumption of normality central in inferential statistics (concerned with probability)

characteristics of normal curve well known –> if assumption is valid, characteristics of normality may be applied to infer information about the population parameter and perform hypothesis testing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a sampling distribution? Differentiate sample distribution and sampling distribution

A

sampling distribution: the distribution of a sample statistic obtained by infinite repeated sampling (or considering all possible outcomes)

sample distribution: frequency distribution that is obtained from the sample

assumption that the sampling distribution (of any statistics) is normal

every parameter that can be calculated in a sample i.e. every sample statistic can have a sampling distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Central Limit Theorem (CLT)?

A

Regardless of the shape of the population, parameter estimates of that population will have a normal distribution if the sample size is large enough (i.e. sampling distribution of any statistic will be (nearly) normal if the sample size is large enough

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the 4 guidelines/conditions in the application of CLT? What are the 3 general criteria to look for?

A

Sample size is large enough if

  1. ) population distribution is normal
  2. ) n = 15 or less + data distribution is symmetric, unimodal and without outliers –> sample size is large enough –> sampling distribution will be (nearly) normal
  3. ) n = 16 or more to 40 or less + data distribution is unimodal, without outliers and extreme skewness/kurtosis
  4. ) n > 40 + data distribution is unimodal and without extreme outliers

unimodal
symmetry/extreme skewness/kurtosis
outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the 3 methods to test for normality? Why do we not use/Why is it not advisable to use normality tests?

A

CLT and sample size

Numerical methods (summary statistics) in EDA

Graphical Methods in EDA

normality tests e.g. shapiro-wilks are not reliable

  • when n is small (30 and below): lack power, not sensitive enough –> false -ve (shows no deviation from normality when in reality there is deviation)
  • when n is large (>30): too sensitive –> false +ve
17
Q

What do we look out for in numerical methods in EDA?

A

Compare mean and median - similar? –> symmetrical

Calculate skewness and kurtosis |Z scores| - < 2? –> no extreme skewness/kurtosis
- (raw score or statistic)/standard error

18
Q

What are the 3 types of graphs that we look out for in graphical methods in EDA? What information does each graph show?

A

Histogram: shows all features but better especially for GAPS, SYMMETRY & NO. OF MODES

Normal probability (Q-Q) plot: graph plotting the quantiles of a variable against the quantiles of a normal dsitribution; great for NORMALITY/NORMAL DISTRIBUTION based on how close data points are to the best-fit line, SKEWNESS, KURTOSIS & OUTLIERS

Box plot: no gaps and no. of modes but great for OUTLIERS & SYMMETRY

19
Q

What are the respective methods for identifying the following characteristics/criteria:

  • outlier
  • extreme outlier
  • skewness/kurtosis
  • Extreme skewness/kurtosis
  • no. of modes (least important)
A
  • box plot, maybe normal probability Q-Q plot
  • box plot
  • Z scores; histogram (sample size must be large enough to be reliable); normal probability Q-Q plot; box plot
  • Z scores
  • histogram
20
Q

What are the 2 things to do if the normality assumption is questionable?

A
  1. ) Check data: erros in data entry, measurement etc.

2. ) Decide on appropriate actions on a case by case basis

21
Q

What are the 4 appropriate actions that we can undertake when the normality assumption is questionable and we have already checked the data and found no errors? Describe them

remember, only can choose one of them

A
  1. ) Leave the data as they are
    - depends on what we are using the data for - normality might not matter
    - some inferential tests requiring the assumption of normal distribution (i.e. parametric tests) may be robust to minor violation of normality e.g. independent t-test
  2. ) Data transformation
    - use of a non-linear function to change the size of large values differently from the change in size of small values e.g. +skewed data distribution: square root; -skewed: square, cube
    - always use the least powerful transformation (first) necessary to reduce the impact/presence of outliers or improve symmetry
    - if data transformation of -ve values are not possible, add a constant to all data values that will make them all positive prior to trasnformation

3.) non-parametric inferential tests

  1. ) robust or other modern inferential tests
    e. g. independent t-test
22
Q

What are the 2 benefits/pros/advantages and 3 problems/cons/disadvantages of data transformation?

A

Benefits:
an appropriate transformation may
- reduce the impact of outliers by turning extraordinary points into merely the largest or smallest values in the data set
- improve symmetry

Problems:

  • not applicable if have outliers in both directions
  • back transformation may be needed for meaningful interpretation
  • introducing a bias when we change linear to a non-linear function –> either favour the more =ve no.s or -ve no.s
23
Q

What are parametric tests? Give some examples

A

Inferential tests that make assumptions of normality about the parameters of the population distribution from which the samples are drawn.

e.g. t-test, Pearson’s Correlation

(Common assumptions made are that of normality and quantitative - interval and ratio - data)

24
Q

What are some (3-4) benefits/advantages/pros of parametric tests?

usefulness would definitely be the benefit

A

More flexible

More options available

More powerful if data distribution is normal
- power: ability to find/detect the true/significant difference/relationship if there is actually one

Always start with parametric tests first i.e. always select a parametric test as the preliminary inferential test as there are lots of patterns and details (i.e. characteristics) well known about a normal distribution curve

[refer to question on the usefulness of normal distribution - card 12]

25
Q

What are non-parametric tests?

A

Inferential tests that do not require the assumption of normality and can be used for nominal and ordinal data as well as quantitative data

26
Q

What is the disadvantage of parametric tests?

A

Assumption of normality of distribution needed

27
Q

What is the advantage of non-parametric tests?

A

No assumption of normality of distribution (needed)

28
Q

What are the 3 disadvantages of non-parametric tests?

A

Generally have reduced statistical power when distributions are truly normal

Limited or more complex follow up tests in event of significant findings (i.e. not many options)

Limited software available to produce confidence intervals (i.e. not much options)

29
Q

What are outliers? Definition

A

An observation/score that is very different from the rest of the data
An extreme point that stands out from the rest of the distribution
Present in both normal and non-normal distributions

30
Q

Why do outliers matter? What are the 3 consequences of having an outlier?

A

Outliers affect any statistical test based on sample mean and variance (which are commonly used)

  1. ) Biased parameter estimates
  2. ) Biased confidence interval
  3. ) Faulty conclusions in hypothesis testing
31
Q

What are the 2 ways of handling outliers? How do we handle them?

[see card 20, 21]

Depends on the source of the outlier: error or genuine case due to variability of data

A

1.) Check if outlier is due to error (Data errors: measurement, instrument, data entry errors) –>
If outlier is due to error: rectify or drop

2.) Outlier is a genuine case –>
Descriptive stats: use median and IQR to describe stats
Inferential stats: data transformation or non-parametric tests that are robust to outliers

32
Q

Why can’t we leave outliers as they are? When can we leave them a they are?

A

Result in bias and other problems (e.g. cause distribution of data to be non-normal)\

When using parametric/non-parametric tests that are robust to minor violation of normality/outliers

33
Q

When can we delete outliers? 2 reasons for being able to do so

A

Only as a last resort

Only if they lie so far outside the range of the remainder of the data that they distort statistical inferences

Not good to delete them at will

34
Q

What are the 2 things that we must take note of when/with regards to reporting the results and deletion of outliers?

A

Must report deletion and offer reasons/show why

Report model results both with and without outliers –> compare and make a conclusion (choose one)