Week 4 Flashcards

1
Q

Central Limit Theorem

A

● In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution (informally a bell curve) even if the original variables themselves are not normally distributed

States that if any variable has a known range, then the distribution of the results will approximate that of the normal curve as the number of samples increases.
● The fewer the number of samples the farther away or worse the approximation of the normal curve will be, creating gaps between the bars of the histogram and the normal curve line.
● Regardless of a sample not being distributed normally as sample size increases it will be

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Measures of Central Tendency

A

● Mode: the most frequently occurring value of a variable of interest in your data set. The only one that works with nominal variables.
● Median: divides the sample in half, and gives the exact middle number. If the set of values in a set are even, you find the two middle values add them together than divide it by 2.
● Mean: the sum of all values of a particular variable, where the level of measurement is interval/ratio, divided by the total number of observations in the sample.
○ “Average case”: You can get a mean where there is no actual value - ten people with 0 dollars and ten with 10 - the average is 5 but no one actually has 5
○ Ecological fallacy: Even though you say that the average Canadian makes 50 thousand, you have to be careful to not say the average person in Ontario or suburb because there may be important geographical variations
○ Outliers and small groups: a sample with income with Bill Gates in the sample and a sample without Bill Gates - Bill Gates waking into a bar and making the average 20 million dollars
○ Using the mean for nominal or ordinal scale:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When is the mode, median or mean most useful

A

a) The mode is useful when:
i) Variables are measured at the nominal level because it gives us the most frequent score
ii) You want to know what most people think (e.g., in a survey)
b) The median is useful when:
i) Variables are measured at the ordinal level
ii) Variables measured at the interval-ratio level have highly skewed distributions - for example when you have outliers it isn’t affected as much as the mean
iii) You want to find the central score
c) Mean is useful when:
i) Interval-ratio level (except for highly skewed distributions)
ii) When you want to report the typical score
iii) When you anticipate additional statistical analysis (central anchor for most statistical methods)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Measures of Variability

A
  • We want to know more than central tendency we also want to know how the different values are spread and dispersed (variety, diversity, typical variation between scores, etc)
  • Knowing the average temperature of the month doesn’t tell you too much about the weather that month
  • The greater the dispersion of a variable, the less useful or informative the measure of central tendency becomes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Range, mean deviation, variance, standard deviation

A

1) The range: the spread between the lowest and highest value
a) Range = highest value - lowest value = H - L
b) Problem: heavily influenced by extreme values

2) The mean deviation: the average distance separating each value from the mean. Is calculated with absolute values meaning there are negative values.
a) Example: if mean is 15 and value is 10. 10-15=-5…. 15-10=5 these two cancel eachother out
b) MD = sum of distance to mean for each case/total number of cases

3) Variance: measures again the dispersion but by compiling the average squared distance from the mean value
a) Squaring, like using the absolute value, is a way of eliminating negative values
b) Variance = sum of squares / number of cases

4) Standard deviation: square root of the variance
a) Small variance (or std dev.)= cases are concentrated around the mean
b) As the term implies, gives us a standardized unit to measure how far a particular case is from the mean
c) It standardizes measurements and allows comparison
i) 5 steps:
(1) Subtract the mean from each score
(2) Square the variations
(3) Sum the squared deviations
(4) Divide the sum of the squared deviations by N or N-1
(5) That’s the variance but you want the square root of the result

The standard deviation is the average amount of variability in your dataset. It tells you, on average, how far each value lies from the mean.

A high standard deviation means that values are generally far from the mean, while a low standard deviation indicates that values are clustered close to the mean.

● Symmetrical, with mean = median = mode
○ So, 50% of observations below and 50% above the mean
● Any normal distribution, we know cutoff points:
○ 68.2% of values within 1 standard deviation from the mean
○ 95.44% of all values within 2
○ 99.74% of values within 3 standard deviations
● The main problem lies in differences in the scale of the variable used:
○ Sometimes we are measuring dollars in millions, age in years, deaths per 1000 births
○ So the units and the magnitude of the values differ
■ We can do this with the standard normal distribution
● Standardized to its equivalent with mean = 0 and standard deviation = 1
● Any normal distribution can be visualized as this one, by using the standard deviation as the unit for values (x-axis)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Standard or z-scores

A

● While standard deviation is great it doesn’t tell us much about the variation between a specific value and the total population, so while we we know SD for height, we also want to know where specific values within the larger population so if I’m 6.4 how many people of the population am I taller than, you can do this with the help of the z-score
● The equation for standard score for a sample:
○ Standard score = an individual raw score – the sample mean
■ Then divided by the sample standard deviation

One tailed: on a specific direction, if we want to find less or more than the mean - in this case the z-score has to be different on appendix a, 95% of people would be 50% on one side and 45% on the other
Two-tailed test: no specific direction, 95% of people would be 47.5% on each side

While data points are referred to as x in a normal distribution, they are called z or z-scores in the z-distribution. A z-score is a standard score that tells you how many standard deviations away from the mean an individual value (x) lies:

A positive z-score means that your x-value is greater than the mean.
A negative z-score means that your x-value is less than the mean.
A z-score of zero means that your x-value is equal to the mean.

Example:
○ Question: find where 90% of the population lays in terms of IQ
■ We need the left side of the normal distribution is 50%
■ So the right side has to be the remaining 40%
■ On the table 40% corresponds roughly to 1.28
■ So the IQ value that would have given us a z of 1.28
■ This z score on the chart is 119 so you can conclude that 10% of people have an IQ above 119

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Confidence Intervals:

A

A confidence interval represents a prediction (generalization) from a sample statistics to a population parameter. It provides:
Determining how much confidence we have in determining the accuracy of the mean taken from the sample reflecting the population mean
1. A range of possible values as an estimate for where the real value falls
2. A level of confidence for that range
● E.g., in the previous example, we could calculate a confidence interval from our sample, and assert that the mean income for the Mongolian population is between $1,200 and $1,500 at a 95% confidence level.
o this would mean that we are 95% confident that the mean of the population is between $1,200 and $1,500

Example of reporting:
“We found that both the US and Great Britain averaged 35 hours of television watched per week, although there was more variation in the estimate for Great Britain (95% CI = 33.04, 36.96) than for the US (95% CI = 34.02, 35.98).”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Confidence Interval: Width of Interval when we want a higher level of confidence

A

● width needs to increase, to ensure that we include more possible values, as the range increases there is a higher possibility of the population falling within that range
● the reverse is true for a lower level of confidence

What happens to the width of the interval if, for a given confidence level we increase the sample size?
● sample size decreases the standard error, so it decreases the margin of error. For the same confidence level, the interval width will narrow
● the reverse is true if we decrease the sample size

Building confidence intervals thus involves a tradeoff:
● a narrower and more precise interval is more helpful but it comes with a lower level of confidence unless you can change the sample size and vice-versa
A higher range is not useful even though you would be more confident - stick to 95% confidence or 1.96

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

T-distribution

A

● For a sample size under 120, we use different estimates:
o the t-distribution table (Appendix B) approaches the normal distribution when sample size is large enough
o for a sample size under 120, we should be using the t-table instead of the z table

To look in the t-table, we need two things:
● the degrees of freedom = sample size minus one (N-1)
● a critical value (the cutoff corresponding to the level of confidence we are using, e.g. 90%, 95%, etc.)

As the degrees of freedom (total number of observations minus 1) increases, the t-distribution will get closer and closer to matching the standard normal distribution, a.k.a. the z-distribution, until they are almost identical.

Above 30 degrees of freedom, the t-distribution roughly matches the z-distribution. Therefore, the z-distribution can be used in place of the t-distribution with large sample sizes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly