Midterm Flashcards
research
disciplined inquiry into questions and theories
statistics
organizing numbers and data
qualitative research
stats (organizing numbers and data) + disseminating results
wheel of science
theory > hypothesis > observations > empirical generalizations
descripstive vs inferential statistics
descriptive: what is going on in the data? can be bivariate or multivariate
inferential: generalizing data to population
independent & dependent variables
independent variables lead to dependent variables
x = what is doing the predicting, y = what is being predicted
discrete vs continuous variables
whole number measurements vs fractional measurements
nominal vs ordinal vs interval vs ratio
categories vs ranked variables vs numbers without true zero vs numbers with true zero
percentages and proportions
about conceptualizing data proportions are (f/n), percentages are (f/n)100 where f= frequency and n = number of cases in category
good graphs are….
theoretically motivated, easy to understand, useful
central tendency
the most typical/common/central score. describes data, makes certain characteristics easy to understand
mean and median and mode are all the same when…
the data is a normal curve
dispersion
how much variation is in the scores? when there is less dispersion, the curve is taller and narrower, and when there is more dispersion, the curve is flater and wider
variation ratio
simple measure of statistical dispersion in nominal distributions; it is the simplest measure of qualitative variation.
v = 1 - fm/n, where fm = the number of cases in the mode, and n = total number of cases
i.e. the proportion of cases not in a modal category
determining median in even number of cases
average of the two middle scores
interquartile range
the distance between 3rd and 1st quartile i.e. middle 50%
all scores ________ the mean
all scores cancel out to the mean
mean is the point of ________
mean is the point of minimized variation
when there is positive skew, x-bar is ____ relative to the median
when there is positive skew, x-bar is greater than the median
when there is negative skew, x-bar is ____ relative to the median
when there is negative skew, x-bar is less than the median
when there is no skew, x-bar is ____ relative to the median
when there is no skew, x-bar is equal relative to the median
when there is a positive skew, the shape of the curve is…
when there is positive skew, the shape of the curve is stretched out towards the right, with the “lump” being further to the left.
when there is a negative skew, the shape of the curve is..
when there is a negative skew, the shape of the curve is stretched out towards the left, with the “lump” being further to the right
standard deviation
the average distance from the mean
square root of the average difference from the mean squared
box plots
the box indicates the middle 50%, the lower boundary of the box represents the first quartile (i.e. the point where 25% of the sample lies under) and the upper boundary of the box represents the third quartile (i.e. the point where 75% of the sample lies above). The line through the box indicates the median. The whiskers indicate 1.5xIQR. Outliers are often included.
normal curve
theoretical, bell shaped, unimodal, symmetrical, mode/mean/median is equal
+/- 1 standard deviation captures __% of the sample
+/- 1 standard deviation captures 68.26% of the sample
+/- 2 standard deviations captures __% of the sample
+/- 2 standard deviations captures 95.44% of the sample
+/- 3 standard deviations captures __% of the sample
+/- 3 standard deviations captures 99.72% of the sample
z-score
z-score is a position along the normal curve, indicates the number of standard deviations it falls above or below the mean. i.e. z-score of 1 means that the data point is 1 standard deviation above the mean
population and parameter are analogous with…
population and parameter are analogous with sample and statistic.
in other words, statistics are characteristics of the sample, and parameters are characteristics of the population
EPSEM
equal probability of selection method
sampling distribution
theoretical concept that links the sample to the population. The sample distribution is normal in shape, and the mean is equal to the population standard deviation/sqrN.
The sampling distribution represents the distribution of the point estimates based on samples of a fixed size from a certain population.
law of large numbers
the more samples we have, the closer we get to the normal curve.
The law of large numbers is a principle of probability according to which the frequencies of events with the same likelihood of occurrence even out, given enough trials or instances.
So if you flip 10 coins, you may get 90% heads and 10% tails, but if you flip 100 coins, you’re more likely to get closer to 50% heads and 50% tails. The proportion of heads after n flips will almost surely converge to 1/2 as n approaches infinity.
central limit theorem
The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed. This will hold true regardless of whether the source population is normal or skewed, provided the sample size is sufficiently large (usually n > 30).
the average of your sample means will be the population mean
standard error
standard deviation of the sampling distribution
e.g. plotting the means of 50 samples of 10 would give you a normal curve with a standard deviation
point estimate
a single statistic used to infer info about the population
e.g. taking the mean of the heights of a sample of students and inferring the mean of the heights of all students from the sample mean
criteria for choosing estimators
- bias: if an estimator is unbiased if the mean of its sampling distribution is equal to the proportion of interest.
- efficiency
z-score for a 95% confidence interval
1.96
alpha
how certain do you want to be?
e.g. alpha = 0.05 means a confidence level of 95%
every alpha has a z-score associated with it
e.g. alpha = 0.05 has a z-score of 1.96
constructing confidence intervals for means
(1) set the alpha
(2) find the z-score associated with that alpha
(3) use formula for confidence intervals with sample means
the bigger the sample the _____ the width of the confidence interval because _______.
the bigger the sample the smaller the width of the confidence interval because standard error is smaller.
what would you do to increase the confidence interval?
increase the alpha, e.g. instead of wanting alpha = 0.05 CI 95%, set alpha to 0.01 CI 99%.
confidence interval _____ as confidence level _____.
confidence interval widens as confidence level increases.
Null hypothesis vs alternative hypothesis
null hypothesis (H0) always says there is no significant difference. alternative hypothesis (HA) says there is a significant difference. We always assume that the null is true.
what is a hypothesis test?
- make a hypothesis
- use z-score formula to determine probability of getting the observed difference: “this difference is statistically different at the alpha = 005 level.”
- trying to identify statistically significant differences that didn’t occur by chance
5 step model of hyptohesis testing: one sample case
(1) make assumptions -level of measurement is interval ratio, sampling distribution is normal (basically n > 120)
(2) state null hypothesis
(3) select sampling distribution and establish a critical region
(4) compare the test statistic
(5) make decision and interpret the results, either rejecting the null or failing to reject the null
one-tailed vs two-tailed test
one-tailed = “significantly less/more” +1.96 or -1.96
two-tailed = “significantly different” +/- 1.96
one-tailed is stronger.
alpha levels affect what in hypothesis testing?
critical region
> alpha = < critical region, critical region
e.g. alpha = 0.05, critical region +/- 1.96, alpha = 0.10, critical region =/-1.65
type I error
rejecting true null hypothesis. aka alpha error. this happens when the thing occurred by random chance but you claimed that it was significantly different. you can avoid type I error by increasing the alpha, e.g. saying you want to be 99% sure instead of 95% sure that something is significantly statistically different.
type II error
failing to reject false null hypothesis. aka beta error. this happens when the thing was actually significantly different but you claimed that was not statistically different and happened by random chance. you can avoid type II error by decreasing the alpha, e.r. saying you want to be 95% sure instead of 99% sure.
degrees of freedom
(n-1)
student’s t distribution
used for smaller samples (n < 120) when the population mean is unknown. the student t distribution is shorter and wider than the z-distribution.
two sample test of means for large samples
(1) make assumptions - the samples must be independent random sample i.e. mutually exclusive; interval ratio measurements; sampling distribution is normal (basically n > 120)
(2) State the null hypothesis
(3) select sampling distribution and establish critical region
(4) compare test statistic
(5) make decision and interpret results
two sample test of means for small samples
(1) make assumptions - the samples must be independent random sample i.e. mutually exclusive; interval ratio measurements; population variances are equal (as long as the 2 samples are approximately the same size, we can make this assumption), sampling distribution is normal (because we’re using small samples, we have to add the previous assumption in order to make this one)
(2) State the null hypothesis
(3) select sampling distribution and establish critical region
(4) compare test statistic
(5) make decision and interpret results
two sample test for proportions
(1) make assumptions - the samples must be independent random sample i.e. mutually exclusive; nominal measurements; sampling distribution is normal (basically n > 120)
(2) State the null hypothesis
(3) select sampling distribution and establish critical region
(4) compare test statistic
(5) make decision and interpret results
significance vs importance
differences that are otherwise trivial or uninteresting may be significant. Significance just states whether something is different (is the difference in our sample correct/same as the population?), but it doesn’t say if it is an important difference. The substantive importance is up for interpretation
test statistics get ____ as n gets ____.
test statistics (like p-vlue) get larger as n get larger.
confidence interval vs two sample test
when you’re using the two-sample test, you’re taking both estimates of the means and both standard deviations into account. So there is still a possibility of the error bars overlapping but the difference still being statistically different.
what is the variance of a normal curve?
1
population values can be estimated with…
sample values
what is a point estimate?
the use of sample data to calculate a single value (known as a statistic) which is to serve as a “best guess” or “best estimate” of an unknown (fixed or random) population parameter
which sample statistics are unbiased?
means and proportions
what is efficiency?
Basically sample size.
The (larger/smaller) the sample size, the (higher/lower) the value of the standard deviation of the sampling distribution.
larger, lower
The (larger or smaller) the sample size, the more tightly clustered the sample outcomes will be around the mean of the sampling distribution.
larger
difference between point estimates and interval estimates
point estimate: we estimate the population value is the same as the sample statistic
interval estimate: we construct a confidence interval, a range of values into which we estimate the population value