extras Flashcards
what to remember when describing a distribution
- centre - median need to SAY median
- Spread - IQR - such that the middle 50% of scores are situated btw x and y + max and min
- Shape - peaks and distribution of scores + skewness
- any outliers?
Outliers can occur because of?
sampling error
participant error
researcher error
random chance
probability density functions
hypothetical population distribution are defined using mathematical formulas known as pdfs - give the probability of observing a particular value of a variable
total area under the curve defined by a probability density function always equals 1
normal distribution is a…
hypothetical population distribution
should you describe a sample as normal?
No, it approximates a normal distribution
standard normal distribution
Normal distribution with u=0 and o=1
z score if x is an observation from a normal distribution - z-score of x is
z = x-u/o
Z scores follow what kind of distribution…
follow a normal distribution with u=0 and o=1
sampling distribution
We can imagine collecting an infinite number of samples of N = 40 Peabody scores, leading to an infinite number of sample means and standard deviations.
each of these samples came from the same population, then each sample
mean is an estimate of the same population mean, , and each sample standard deviation is an estimate of the same population standard deviation, .
Because of sampling error (not “bias”!), very few, if any, of these mean and standard deviation estimates will exactly equal the true population mean and standard deviation.
creating a frequency distribution table or graph for the collection of sample means obtained from repeatedly collecting different samples of size N = 40 from the same population. This collection of sample means would form the sampling distribution of the mean.
A sampling distribution is the distribution of a …
statistic
Sampling distributions are blank blank distributions
theoretical population distributions
Central limit theorem
Describes the sampling distribution of the mean
also applies to sample regression slope estimates
Central limit theorem - for means calculated from samples drawn from any parent population with the mean and sd, the sampling distribution of the mean will converge to a normal distribution with mean u and sd o/sqrtN - as N approaches infinity.
standard error is what
standard error of a statistic is the standard deviation of that statistics sampling distribution
o/sqrtN and is often represented as o xbar
average amount that that a sample mean xbar is expected to be different from the population mean u
Z score for individual
z = x-u/o
zscore for a sample mean
z = xbar - u/o/sqrtN
point estimate
single value used as an estimate of a population parameter
what are point estimates influenced by?
point estimates are calculated using data from random samples drawn from a much larger population so they are influenced by sampling error
variation of a point estimate from one sample to another represents the extent of sampling error
Sampling error and sample size
smaller samples have more sampling error than larger samples
point estimates from small samples, more sampling error
standard error of the mean formula- bigger N gets, smaller standard error gets - less sampling error with larger N
CI from small samples have more sampling error than from larger samples = wider CI
Confidence interval does what?
Conveys the degree of sampling error around a point estimate by presenting a range of plausible or reasonable values for the population parameter of interest.
CI is a range of values or an interval that is expected to capture a population parameter of interest with some prespecified level of confidence.
gives the precision of a point estimate
What does the Central Limit Theorem tell us about sample means?
Sample means can be treated as observations from a normal distribution.
Interpretation of a confidence interval
This interval captures u with 95% confidence
Factors affecting the width of a confidence interval that are under the researcher’s direct control:
level of confidence
sample size
Type I error
is the rejection of a true null hypothesis. The probability of a Type I Error is alpha (a), given that the correct statistical model has been used to test H0.
Type II error
is the failed rejection of a false null hypothesis. The probability of a Type II
error is beta ().
Power
Power is the probability of rejecting a false null hypothesis. Power is the complement of the probability of Type II error
What is power greater for?
larger sample sizes and for larger effect sizes
statistical model
represents the value of a dependent variable (often symbolized with the letter y) as a function of one or more parameters plus an error term.
General Linear Model
and thus all models we examine will express the dependent variable as a linear function of the parameter(s).
error variance,
which represents the extent
that professor salaries differ from the mean salary
In an intercept-only model, the error variance is equivalent to the variance of the dependent variable
t distribution is used when
using sample estimate of the standard error of the mean
t distribution has higher kurtosis that results from the added uncertainty due to estimating the standard error
The particular T distribution used depends on what?
the degrees of freedom
When df = infinity, t distribution =
standard normal distribution
t stat formula
t = ybar - uo/sybar
uo = population mean value given by the null hypothesis
One sample T test report
The mean nine-month salary for professors was M = $113,706.46 (SD = 30,289.04), with 95% CI [110,717.90, 116,695.10]. A one-sample t-test confirmed that this mean significantly differs from the U.S. population median salary, t (396) = 41.76, p < .001
Effect size
magnitude of the association
difference between two means
Assumptions for a one-sample t-test
- independent observations
- sample data come from normal pop distribution
general linear model
represents the dependent variable as a function of population means
Describe confidence interval of a slope estimate
The interval from blank to blank captures the population mean difference with 95% confidence
CI formula for slope parameter
Bhat1 +- tcrit Sbhat1
sbhat1 = standard error of the slope parameter
Df in binary independent variable for determining t crit
n-2 = two coefficients in the estimated model
Standard error estimate of Bhat1
for a binary independent variable
Sbhat1 = sqrt(s2pooled/n1 +s2pooled/n2)
pooled variance estimate assumes what?
Population variance of the dependent variable is equal across the two groups - homogeneity of variance
Two sample t test or binary categorical anova null hypothesis
H0: u1 = u2
H0: B1 =0
Error term formula for predicted errors from linear model
ehati = yi - uhat1
one error term for each group
error term formula for errors from null model if null is true
ehati = yi - ybar
what is the purpose of a statistical model?
describe or explain individual differences or variation in a dependent variable
If a model does a good job of accounting for individual differences, what should the variance of errors be like?
variance of the errors should be small relative to the overall variance of the variable
ie. full model has accounted for or explained a portion of the dependent variable variance
Proportion reduction in error
R2 - represents the proportion of dependent variable variance explained by the model
In the context of a single binary independent variable R2 =
eta squared
ANOVA
involves partitioning the total sample variation of the dependent variable into variation explained by the model and error variation -residual variation
Relation btw sd and variance
sd is the square root of the variance
Variance
sum of squared deviations from the mean
Numerator and denominator of F statistic
MS model/MS error
Variability explained by the model/residual variability
SS Total
Sum of squared deviations of observed values of y from the mean of y
SUM (yi-ybar)^2
Model SS for Y
SUM (uhati - ybar)^2
Model SS for Y is called variability explained by the model because
it summarizes the predicted variation due to group membership relative to the overall mean
Residual SS for Y
the sum of squared residuals across all observations described earlier
SUM (yi-uhati)^2
Write out the ANOVA table
Formula for R^2
SS model/SS total
1-(SSresid/SStotal)
Range of F stat
0 to infinity
Distribution of F stat
One tailed
Postiviely skewed
0 to infinity
varies by DF
Formula for T for the difference between two sample means
t = ybar2 - ybar1/sybar2-ybar1
When numerator df =1 then F =
t^2
Independent samples T test report
“The mean time reaction time was significantly greater for those with a reading disorder diagnosis (M = 2039.76ms, SD = 1128.36) than the control group (M = 1374.68ms, SD = 625.35), t (36) = 2.28, p = .03. The 95% CI for the mean difference was [72.14, 1258.02].”
- The observations are independent
- The dependent variable is normally distributed within each group
Homogeneity of variance: The use of the pooled variance estimate in the formula for the standard error of the regression slope (i.e., standard error of the sample mean difference) is based on the assumption that the sample variances of the two groups are both estimates of a single population variance.
Robustness against non-normality and homogeneity of variance violations when
sample size large
sample size equal
1st dummy variable step
J-1 separate binary dummy variables
Null hypothesis of one way anova
H0: B1=B2=B3=B4=0
APA report one way anova
“The overall proportion of variance explained by the linear model, R
2 = .45, was significant, F (4, 45) = 9.09, p < .001, indicating that the number of words recalled significantly varied across the five conditions representing different levels of depth of processing.”
What does the result of an anova indicate
at least one population mean is unlikely to be unequal to the other population means.
T formula for each slope coefficient estimate
t = Bhat/sbhat
When are anova t-tests valid
as planned comparisons
if a researcher explicitly planned to compare the mean of the reference with the other categories
When to do post hoc
When comparisons not planned a priori or you want to compare group means that do not include the reference group
APA report for a priori t tests
Because the dummy variables in the linear model were defined a priori, the corresponding ttests represent planned comparisons. The rhyming mean (M = 6.90) did not significantly differ from the counting mean (M = 6.90), t (45) = 0.07, p = .94. But the adjective mean (M = 11.00) was significantly different from the counting mean, t (45) = 2.88, p = .006.”
Etc. for the t-tests for the remaining dummy variables.
Assumptions for one way ANOVA
- independent observations
- normally distributed errors
- homogeneity of variance
What happens if one performs multiple significance tests on the same data without proper adjustments?
Probability that at least one of the tests produces a type 1 error is greater than .05
Formula for type 1 error accumulation
1-(1-a)^c
Tukeys HSD
experiment-wise Type I error rate is maintained at the -level used to
test the omnibus null hypothesis, regardless of whether the pairwise comparisons were planned a
priori.
Bonferroni adjustment
the experiment-wise alpha level is simply divided by the number of specific hypothesis tests to be performed.
Moderation
the second independent variable may moderate the effect of the primary
independent variable; for this reason, the second independent variable is often called a moderator
What does a population model represent
how these two independent variables
combine to explain individual differences in the dependent variable.
what does it mean that the main-effects model is likely misspecified
meaning that it is an incorrect model in the sense that it cannot adequately account for the major regularities of the
data.
interaction effects,
allows the effect of a smoking-group dummy variable to be moderated by the effect of a task-type dummy variable.
Null for two way anova
all interaction terms = 0
Questions asked by comparing full model with main-effects model
Does smoking group significantly interact with task type? Do the smoking group mean differences significantly vary across the task types? Is the effect of smoking group significantly moderated by task type?
MS effect
main and full model
Because the two models differ by the inclusion of the interaction terms, the difference between their RSS values (14857 – 13587) gives the overall interaction sum-of-squares term = 1269.5. = MSeffect
interaction degrees of freedom
(J – 1)*(K – 1)
family-wise error rate
control the overall probability of at least one Type I error within each level of the moderator variable. There are three levels of the task moderator, thus there are three families, and three pairwise comparisons within each family. Thus, the pvalues are adjusted based on three comparisons. A correction for experiment-wise error rate, on the other hand, would be based on nine comparisons.
, simple main effects
refers to the separate omnibus effects
of a focal independent variable within different levels of a moderator variable.
e.g. e simple main-effect of smoking group within the driving task
How to report on simple main effects
simple main-effect of smoking group within the reading task is significant, F
Assumptions for t tests and anovas
- Independent observations
- Dependent variable is normally distributed within each cell of the study design
- Homogeneity of variance: The variance of the dependent variable is constant across the cells of the study design.