Week 1&2 (Descriptive/Foundations/Experimental Designs/Comparing 2 Means/Inferential Tables/Statistical Software) Flashcards
What are the types of biostatistics
descriptive statistics
probability
estimate population parameters
hypothesis testing
Types of population
target and accessible
target population definition
the LARGER population to which results of a study will be generalized
accessible population definition
the ACTUAL population of subjects available to be chosen for a study
Sample definition
a subgroup of the population of interest
parameter
statistical characteristics of population
statistic
statistical characteristic of sample
descriptive statistic
used to describe a sample shape, central tendency, and variability
inferential statistic
used to make inferences about a population (t-test, ANOVA, Pearsons R)
measures of central tendency
mean, median, and mode
what is central tendency
central value, BEST representative value of the target population
what is variability
the “spread” of the data
small: spike like
large: wave
frequency definition
the number of times a value appears in a data set
frequency distribution
the pattern of frequencies of a variable
methods of displaying frequency distributions
histogram & stem and leaf plots
skewed to the left (image)
skewed to the right (image)
normal “skewed” (image)
different shapes of distributions
normal (B)
skewed to right (A)
skewed to left (C)
Skewed to right (words)
“tail” faces right not where the bulk of the curve lies
AKA “positive skew”
mean > median/mode
Skewed to left (words)
“tail” faces left
AKA “negative skew”
mean < median/mode
Measures of Central Tendency: best choice for MEAN
best choice for numberic
(not good for skewed data)
Measures of Central Tendency: best choice for MEDIAN
best for non-symmetrical data
Measures of Central Tendency: best choice for MODE
limited utility; nominal or ordinal data
common in surveys
Mean: Advantages
easy, don’t have to arrange in order, all formulas are possible
Mean: Disadvantages
can’t be used with categorical data, affected by extreme values
Median: advantages
easy, can be used with “ranked” data
Median: disadvantages
tedious in a large data set
should be used with ordinal
mode: advantages
easy to understand and calculate
Mode: disadvantages
not based on all values
unstable when the data consist of a small number of values
sometimes the data has 2+ modes or no modes at all
common measures of variability
range, interquartile range, standard deviation, variance, coefficient of variation
range
difference between highest and lowest score
percentiles of range
a score’s position within the distribution (divides into 100 parts)
quartiles of range
divides distribution into 4 equal parts
interquartile range (IQR)
difference between 25th and 75th percentile
often used with median
What is a box plot?
five-number summary of data set
(minimum, 1st quartile, median, 3rd quartile)
box = interquartile range
horizontal line at median
“whiskers” = minimum and maximum scores
coefficient of variation
used for interval and ratio data only
unitless
helpful comparing variability between two distributions on different scales
what shape is normal distribution?
bell-shaped
constant and predictable characteristics of normal distribution
68% of scores are 1 SD of the mean
95% of scores are 2 SD of the mean
99% of scores are 3SD of the mean
z-scores
a standardized score based on the normal distribution
allows for the interpretation of a single score in relation to the distribution of scores
probability definition
the likelihood that any one event will occur, given all the possible outcomes
“what is likely to happen”
sampling error
difference between sample mean and population mean
what is sampling error measured by
standard error of the mean (SEM)
standard error of the mean equation
SEM = SD / square root of sample size (n)
what happens to the SEM if we increase our sample size?
decrease in error
What happens to the SEM if we increase our standard deviation?
increase in error
what is the standard error of the mean
allows us to estimate population parameters
90% SEM = z-score of what
1.65
95% SEM = z-score of what
1.96
99% SEM = z-score of what
2.58
point estimate
a single value that represents the best estimate of the population value
confidence interval
a range of values that we are confident contains the population parameters
increased precision (narrowed) by…
larger sample size
less variance (lower s)
lower selected level of confidence (90% vs 95%)
null hypothesis means
there is no difference
type I error
alpha
“liar”
we say there is a difference but there is no difference
reject the null but the null is true
type II error
beta
“blind”
we say there is no difference but there is a difference
do not reject the null but the null is false
normal value of alpha
.05
p-value
probability of type 1 error, if the null hypothesis is true
if p-value < a
reject the null
if p-value > a
Accept the null
if we “fail to reject” the null, we attribute any observed difference to
sampling error only
if a confidence interval does not have 0 it means
there is a real difference
if a confidence interval does have 0 it means
there is no difference
mistakenly finding a difference
false-positive
mistakenly finding no difference
false-negative
statistical power formula
1 - beta
critical values for a two-tailed test
+or- 1.96
one-tailed test is for
directional hypothesis
two-tailed test is for
nondirectional hypothesis
statistical power
the probability of finding a statistical significant difference if such a difference exists in the real world
the probability that the test correctly rejects the null hypothesis
four pillars of power
alpha, effect size, variance, sample size
to increase power
higher alpha, large effect size, LOW variance, large sample size
decreased power
lower alpha, small effect size, HIGHER variance, smaller sample size
determinants of statistical power
Power (1-B), Alpha level of significance, N (sample size), Effect size
PANE
A priori
before data collection
before study
Post hoc
after data collection
after study
A priori analysis standard effect sizes: small
d = .20
A priori analysis standard effect sizes: medium
d = .50
A priori analysis standard effect sizes: large
d = .80
True experimental design
RCT = gold standard
IV manipulated by researcher
at least 2 groups
randomly assigned
Quasi-experimental designs
may lack randomization
may lack comparison group
may lack both
does a posttest-only control group give us all the information we need?
no we dont have all info
same people in each level of the IV
within-subject design
single-factor (one-way) repeated measures design
no control group, subjects act as their own controls
examples of parametric statistics tests
t-tests, ANOVA, ANCOVA, Correlation, Regression
Assumptions of Parametric Test
scale data (ratio or interval), random sampling, equal variance, normality
t-test
comparing 2 means
2 different groups
variance (differences) comes from 2 sources:
IV and everything else (error variance)
comparing means for INDEPENDENT groups
difference between means / variability within groups
comparing means for REPEATED measures
mean of differences between pairs / Std error of the difference scores
if t > 1 then
you have a greater difference between groups
if t < 1 then
you have more variability within groups
comparing means formula:
t = (treatment effect + error) / error
degrees of freedom definition
the number of independent pieces of information that went into calculating the estimate
degrees of freedom equation
df = n - 1
assumptions of unpaired t-tests
data from ratio or interval scales
samples are randomly drawn from populations
homogeneity of variance
population is normally distributed
Effect size for t-test
the measure of effect the IV has on the DV
effect size: small
d = .20
effect size: medium
d = .50
effect size: large
d = .80
assumptions of paired t-tests
data from ratio or interval scales
samples are randomly drawn from populations
population is normally distributed
what is the best way to DECREASE the width of the CI?
decrease the percentage associated with the confidence interval