Final Exam Flashcards
Descriptive statistics
Statistical tools to organize and summarize data
- information about a collection of observations (their central tendency)
- information about the variability in a set of observations
- information about the shape of a distribution of observations
Inferential statistics
Statistical tools to generalize beyond collections (samples) of actual observations in order to make predictions and test hypotheses about the general population
Population
Any complete collection of observations or potential observations (ENTIRE group of interest)
- population characteristics are called parameters
- μ, σ
Real population
All potential observations are available at the time of sampling
- ex. anxiety scores of current participants in a meditation program
Hypothetical population
One in which not all potential observations are available at the time of sampling
Sample
Any smaller collection of actual observations drawn from a population
- sample characteristics are called statistics
- x̅, s
Level of measurement
Specifies the extent to which a number, word, letter, etc. represents something in the world
Nominal
- Words, letters, or numerical codes
- Observations are sorted into categories, no order
Ordinal
- Values have an inherent, logical order
- No equal intervals
Interval
The distance between consecutive points on the scale is the same all the way along the scale
Ratio
Amounts or counts of quantitative data that reflect differences in degree based on equal intervals and a true zero
Qualitative data
Consists of words, letters, or numerical codes that represent a class or category
Quantitative data
Consists of numbers that represent an amount or a count
Why is data type important?
We use different statistical tests depending on the type of data we have collected
Frequency distribution
A collection of observations produced by sorting observations into classes and showing their frequency (f) of occurrence
Ungrouped frequency distribution
- Frequencies are tallied for each and every value
- Each class has a single value
- Only use these for data sets that have ≤ 20 values
Grouped frequency distribution
- Observations are sorted into classes of multiple values
- Use for data sets with > 20 values
Relative frequency
Shows the frequency of each class as part of a fraction of the total frequency for the entire distribution
- frequency per class/total
Cumulative frequency
Shows the total number of observations and all lower-ranking classes
- add up from the bottom
Cumulative relative frequency
Shows the cumulative frequency of each class as a proportion of the total
- divide the cumulative frequency by the total
Percentile rank
Percentage of scores in the entire distribution with similar or smaller values than that score
Measures of central tendency
Means, medians, and modes
Mean
The average
- sum of all scores/number of scores
Median
The middle value when observations are ordered from smallest to largest (or vice versa)
Mode
The most frequent score in a distribution
Variability
The degree by which scores are spread out across a distribution
- range
- variance
- standard deviation
Range
Highest value – lowest value
Variance
A measure of how data points differ from the mean
Standard deviation
A measure of how dispersed the data is in relation to the mean
- σ = sum of squares/N
Sum of squares
A statistical measure of deviation from the mean
- population: SS = Σ(x - μ)^2
- sample: SS = Σ(x - x̅)^2
Negatively skewed distribution
The majority of observations are at the high end of the distribution, with few negative scores
- ex. retirement ages, scores on an easy test
Positively skewed distribution
Most scores are at the low end of the distribution, with few high scores
- ex. U.S. incomes, scores on a very difficult test
The normal distribution
- Most of the area under the curve falls in the middle
- No skew, a bell curve
- Symmetrical
- Mean = median = mode
- Half of scores fall on either side of the mean
- Total area under the curve = 1.00 or 100%
- X-axis is in units measure in experience (lbs, inches, mph)
- ex. IQ, height, weight
Standard normal distribution
- X-axis is in standard deviation units (x-axis can be turned into Z-scores)
- Mean is always 0
- Standard deviation is always 1
Z-score
A unit-free, standardized score that indicates how many standard deviations a score is above or below the mean
- can be positive or negative (unlike standard deviations; scores above the mean are positive, scores below the mean are negative)
- population: z = (x - μ)/σ
- sample: z = (x - x̅)/s
Table A / Z Table
Provides z-scores and their associated areas under the curve
How to use table A/the Z table
- Sketch the problem, know what you’re looking for, and plan the solution
- Calculate the necessary z-scores
- Find the appropriate areas under the standard normal curve in table A
Correlation
The relationship between variables, and how paired values of two variables change together (ex. height and weight, years of education and annual income, medication and anxiety)
- described as positive or negative, strong, moderate, or weak
Positive correlations
As one variable increases, the other increases (as one decreases, the other also decreases)
Negative correlations
- As one variable increases, the other decreases
- As one variable decreases, the other increases
Scatterplot
Graphs showing individual data points plotted as combinations of two variables
- useful for determining the direction of a relationship (negative or positive)
- useful for determining the strength of a relationship (strong, moderate, weak)
Pearson’s r
Describes the strength of correlation and direction of the relationship
- r = (Σ ZxZy)/(n-1)
- ranges from -1 to +1
- direction indicated by sign (+ or -)
- strength indicated by value (0 = no relationship, ±1 = perfect relationship)
- 0 < |r| < .3 = weak
- .3 < |r| < .7 = moderate correlation
- |r| > .7 = strong correlation
- correlation coefficient
Coefficient of determination (r^2)
The percentage of variance in one variable explained/predicted by the relationship between two variables
- ex. r^2 = (.94)^2 = .88
- 88% of the variation in psych GRE score is explained by the relationship between grades on a cognition final and psych GRE scores
- 1 - r^2 = (1 - .88) = .12 tells me that 12% of the variation in psych GRE scores is NOT explained by the relationship between grades on a cognition final and psych GRE scores
Linear regression
Plots a straight line through a cluster of dots on a scatterplot, and uses that line to predict the value of one variable from the value of another
Least squares regression line
Best fitting line for a set of data that minimizes the sum of the standard deviations from each data point to the line (minimizes the average distance to the line)
- Y’ = bx + a
- Y’ = predicted value
- x = value for which we are predicting y
- b = slope of regression line = r(sqrt((SSy)/(SSx)))
- a = y-intercept of the regression line = ȳ - bx̄
Standard error of the estimate
The estimation of the accuracy of any predictions
- Sx|y = sqrt((Σ (y - y’)^2)/(n-2))
Independent variable
A variable (or treatment) manipulated by the investigator in an experiment
Dependent variable
The variable believed to be influenced (changed) by the IV
Sampling distribution of the mean
Refers to the probability distribution of means for all possible random samples of a given size from some population
- mean = same as population mean
- shape will approximate a normal curve if sample size is sufficiently large (central limit theorem)
Standard error of the mean
The sampling distribution’s standard deviation
- σx̅ = σ / √n
- measures variability in the sampling distribution
- extent to which sample means vary around their mean
Null hypothesis
A statistical hypothesis that nothing special is going on in the sample with respect to a specific characteristic of the underlying difference; the hypothesis of no difference
Alternative hypothesis
Opposite of null; states that the sample is special or different from the population
Significance level
Indicates how rare a sample mean must be to reject the null hypothesis
- α (alpha)
Type I error (α)
Rejecting a null hypothesis when it is in fact true
Type II Error (β)
The likelihood of incorrectly retaining the null hypothesis, failing to reject a null hypothesis when it is in fact false
Confidence interval
A range of values that with a known degree of certainty, includes an unknown population characteristic
- x̅ ± (Zconf)(σx̅)
- Zconf is the critical z value used in the decision rule
- a 95% CI is a range of values that in the long run would contain the parameter of interest 95% of the time
Cohen’s d
Tells you about the observed mean difference in terms of SD units
- (mean 1 – mean 2)/standard deviation
- .2 = small
- .5 = medium
- .8 large
T-test
Used when we don’t know the standard deviation
- t = (x̄ - μx̄)/Sx̄
- Sx̄ = estimated standard error = s/sqrt(n)
- x̄ = sample mean
- μx̄ = hypothesized population mean