Research Design, Statistics, & Test Construction Flashcards
Quasi-experimental design
at least one IV is manipulated, but there is no random-assignment of participants (typically because already in pre-existing groups)
Within-subjects design
groups compared are correlated or related; three conditions lead to this: repeated measures of same participants, subjects matched prior to assignment to groups, subjects have an inherent relationship (e.g., twins)
Latin square
most sophisticated form of counterbalancing subjects in a repeated measures design
Mixed design
includes groups that are both independent and correlated (e.g., patients randomly assigned to two different treatment groups and measured before and after treatment)
Idiographic
refers to single subject approaches (single or few participants studied intensely); AB, ABAB, multiple baseline, simultaneous treatment, and the changing criterion
Nomothetic
group approaches to research design (as opposed to single subject)
Autocorrelation
effect of measuring same person repeatedly; results in highly correlated data; problem of single subject design
AB design
baseline condition (A) followed by treatment condition (B); most significant problem is threat of history (difficult to determine whether intervention or other event caused change)
ABAB design
baseline (A) and treatment (B) alternated in ABAB sequence; protects against threat of history; two potential problems: failure of behavior to return to baseline, issues of ethics with removing effective treatment
Multiple baseline design
treatment is applied sequentially or consecutively across subjects, situations, or behaviors
Simultaneous (alternating) treatment design
two or more interventions implemented concurrently during the treatment phase that are balanced and varied across time of day
Changing criterion design
attempt is made to change behavior in increments to match a changing criterion (e.g., slowly reducing number of cups of coffee)
Momentary time sampling
simply recording whether target behavior is present or absent at moment that time interval ends
Whole-interval sampling
scoring target behavior positively only if exhibited for full duration of time interval
Analogue research
evaluates treatment under conditions that only resemble or approximate clinical situations; typically for less severe conditions; tight experimental control but limited generalizability (e.g., grad student clinicians using manual)
Clinical trials
outcome investigations conducted in clinical settings; often involve methodological compromises and sacrifices
Cross-sequential research
also called cohort-sequential research; takes several cross sections and follows them over briefer periods of time
Stratified random sampling
population is first divided into strata (e.g., age levels, income levels, ethnic groups), and then a random sample of equal size from each stratum is selected
Proportional sampling
individuals are randomly selected in proportion to their representation in the general population
Systematic sampling
selecting every kth element after a random start, e.g., if 100 out of 1000 persons are needed, every tenth person is selected; needs to be arranged in such a way that it is not biased
Cluster sampling
identifying naturally occurring groups of subjects (clusters) and randomly selecting certain clusters (e.g., classes or departments at a university, or schools within a particular school district)
History
threat to internal validity; incidents that intervene between measuring points, either in or outside of the experimental situation; best control is a control group
Maturation
threat to internal validity; factors that affect the subjects’ performance because of the passing of time (fatigue, maturing); best control is a control group
Testing or test practice
threat to internal validity; occurs when familiarity with testing affects scores on repeated testing; best control is Solomon Four-Group design
Solomon Four-Group design
control for testing threats to validity; divide subjects into four groups: measured pre- and post- and get intervention; measured pre- and post- and don’t get intervention, measured post and gets intervention, measured post and does not get intervention
Instrumentation
threat to internal validity; changes in observers or the calibration of equipment; control group corrects for this
Statistical regression
threat to internal validity; tendency for extreme scores (scores very much above or below the mean ) to become less extreme (closer to the mean) on retesting, even without any type of intervention; control group controls for this
Selection bias
threat to internal validity; caused by non-random assignment; best avoided with random sampling
Attrition or experimental mortality
threat to internal validity; differential loss of subjects from the groups; to assess for this, compare subjects who drop out using t-tests on relevant variables
Diffusion
threat to internal validity; occurs when no treatment group gets some of the treatment; difficult to eliminate completely, but tighter control over experimental situation can help
Construct validity
refers to factors other than the desired specifics of our intervention that result in differences; often lumped under threats to external validity; not measuring what you think you are measuring
Attention and contact with clients
threat to construct validity; difficult to tell whether changes are due to treatment or attention
Experimenter expectancies
threat to construct validity; cues or clues transmitted to the subjects by the experimenter; Rosenthal effect; can be controlled by masking experimenter to conditions
Rosenthal effect
refers to experimenter expectancies
Demand characteristics
threat to construct validity; factors in the procedures that suggest how the subject should behave; control by masking subjects to their condition
John Henry effect
hreat to construct validity; occurs when persons in a control group try harder than usual in the spirit of competition with the experimental group; control by making sure experimental and control groups do not know about each other and, if not possible, do not give groups any sense of competition
Threats to external validity
interfere with generalizability of effects
Sample characteristics
threat to external validity; difference between sample and population
Stimulus characteristics
threat to external validity; features of the study with which the intervention is associated (e.g., research assessing memory functioning in the laboratory may not be generalizable to memory functioning in naturalistic settings)
Contextual characteristics
threat to external validity; conditions in which intervention is embedded; e.g., reactivity
Reactivity
subjects behave in a certain way just because they are participating in research and being observed
Low power
threat to statistical conclusion validity; diminished ability to find significant results; small sample size and inadequate interventions can contribute
Unreliability of measures
threat to statistical conclusion validity; unreliable outcome measure
Variability in procedures
threat to statistical conclusion validity; inconsistency in treatment procedures; especially of concern in psychotherapy outcome research
Subject heterogeneity
threat to statistical conclusion validity; subject heterogeneity makes it more difficult to find significant differences between groups
Varies directly with
as one variable increases so does the other (e.g., a varies directly with b in a=b/c
Varies indirectly with
as one variable increases the other decreases (e.g., a varies indirectly with c in a = b/c)
Ordinal data
involve tallying people to see which ordered category a person falls into (e.g., likert scale, SES, percentile rank); group means cannot be calculated
Interval data
involve obtaining numerical scores for each person, where the score values have equal intervals; no zero score or zero is absolute (e.g., IQ test, t-score, temperature); group means can be calculated
Ratio data
involve obtaining numerical scores for each person, where the score values have equal intervals and an absolute zero (e.g., score on EPPP, money in bank, weight, number of children)
Standard deviation
average deviation (or spread) from the mean in a given set of scores
Variance
standard deviation squared
Positive skew
higher proportion of scores in the lower range of values (mode has lowest value, mean highest)
Negative skew
higher proportion of scores in the higher ranges of values (mean has lowest value, mode highest)
Kurtosis
refers to how peaked a distribution is
Leptokurtotic
distribution with a very sharp peak
Platykurtotic
distribution that is very flat
Criterion-referenced or domain-referenced score
example is percentage correct
Norm-referenced score
provides information on how person performed relative to group
Standard scores
based on standard deviation from the sample
Z-scores
standard scores that correspond directly to standard deviation units; transforming into Z-scores does not normalize a distribution (exact same distribution shape); z score = (score - mean)/SD
Z-scores and percentile ranks
-3 = .1, -2 = 2.5, -1 = 16, 0 = 50, 1 = 84, 2 = 97.5, 3 = 99.5
Parameters
population values
Statistics
sample values
Standard error of the mean
average amount of deviation of sample means from the population mean; equal to population SD divided by square root of sample size
Central limit theorem
states that assuming an infinite number of equal sized samples (of large enough size) are drawn from the population, and the means of these samples are plotted, a normally distributed distribution of means will result; the mean of the means ( the grand mean) will equal the population mean, and the standard deviation of the means will equal the standard deviation of the population divided by the square root of sample size (the standard error of the mean); allows researcher to calculate whether the obtained mean is most likely due to treatment or experimental effects, or to chance
Rejection region
also called region of unlikely values; unlikely researcher will obtain values by chance; size corresponds to alpha level
Two factors that contribute to conclusions about statistical significance
treatment effects and chance (sampling error)
Type I error
incorrectly reject null hypothesis; likelihood directly corresponds to size of alpha
Type II error
incorrectly accept null hypothesis; corresponds to beta
Beta
provides probability of making Type II error
Power
ability to correctly reject null hypothesis; increased when sample size is large, magnitude of intervention is large, random error is small, statistical test is parametric, test is one-tailed; inversely related to beta (power = 1- beta); direct relationship with alpha
Parametric test
three assumptions must be met: data are interval or ratio, homoscedasticity, normally distributed
Nonparametric test
used for nominal or ordinal DV
Statistic for testing differences, more than one DV
MANOVA
Statistics for testing differences, interval or ratio DV
t-test, ANOVA
Statistics for testing differences, nominal or ordinal DV
Chi-Square, Mann-Whitney, Wilcoxin
Homoscedasticity
similar variability or standard deviations in the different groups
Assumption for Chi-Square test
independence of observations
Degrees of freedom
number of possible variations in outcomes that can be obtained
Degrees of freedom for single sample chi-square
df = # of columns - 1
Degrees of freedom for multiple sample chi-square
df = (# rows - 1)(# columns - 1)
Degrees of freedom for single sample t-test
df = N - 1
Degrees of freedom for matched or correlated samples t-test
df = # of pairs - 1
Degrees of freedom for independent samples t-test
df = N - 2
Degrees of freedom total for ANOVA
df = N - 1
Degrees of freedom within for ANOVA
df within = df total = df between
Degrees of freedom between for ANOVA
df between = # of groups - 1
Expected frequency for Chi-Square when data are given in each cell
expected frequency for any cell = (sum of row * sum of column)/N
F ratio
F ratio = Mean Square between groups/Mean square within groups; typically significant as it gets above 2.0
Mean Square
measure of average variability
ANOVA post-hoc tests in order of most to least protection against Type I error
in order of most to least conservative: Scheffe, Tukey, Duncan/Dunette/Neuman-Kuels, Fisher’s LSD
Two-way ANOVA
when groups are being compared on two IVs; permits analysis of main effects and interaction effects; when interaction is significant, main effects must be interpreted in context of interaction effect
Examine whether interaction effects in ANOVA table
add up diagonals in each individual 2x2 set of squares
MANOVA
used when there are multiple DVs
Coefficient of determination
calculated by squaring correlation coefficient; represents amount of variability in Y that is shared with or explained by X
Assumptions of bivariate correlations
linear relationship between X and Y, homoscedasticity, unrestricted range of scores on X and Y
Bivariate correlation coefficient for two interval/ratio variables
Pearson r
Bivariate correlation coefficient for two ordinal variables
Spearman’s rho, Kendall’s Tao
Bivariate correlation coefficient for interval/ratio and true dichotomy
point-biserial
Bivariate correlation coefficient for interval/ratio and artificial dichotomy
biserial
Bivariate correlation coefficient for two true dichotomie
Phi
Bivariate correlation coefficient for two artificial dichotomies
tetrachoric
Coefficient for curvilinear relationship between X and Y
Eta
Zero-order correlation
examines relationship between X and Y when it is believed there are no extraneous variables affecting the relationship
Partial correlation
also called first-order correlation; examines the relationship between the predictor and the criterion with the effect of a third variable removed that is thought to be affecting both variables
Part correlation
also called a semi-partial correlation; examines the relationship between the predictor and the criterion with the influence of a third variable removed from only one of the original variables
Multivariate tests
involve several predictors and one or more criterions (DVs)
Multiplier R
multiple correlation; correlation between two or more IVs and one DV, where Y is always interval or ratio data, and at least one X is interval or ratio data
Coefficient of multiple determination
obtained by squaring multiple R; index of the amount of variability in the criterion (Y) that is accounted for by the combination of all the predictors (Xs)
Multiple regression
Has multiple predictors
Multicollinearity
problem that occurs in a multiple regression equation when the predictors are highly correlated with one another, and therefore essentially redundant
Stepwise regression
computer-generated; in forward regression, the computer adds predictor variables one at a time, starting with the predictor that has the highest correlation with criterion outcome; in backward regression, predictor variables are removed one at a time, starting with the variable that contributes the least to criterion outcome; allows for fewest possible predictors
Hierarchical regression
researcher controls regression analysis, adding variables in order consistent with theory
Canonical R
correlation between two or more IVs and two or more DVs; evaluate relationship between two sets of variables
Discriminant function analysis
special case of multiple regression; two or more predictors and one criterion that is nominal (rather than interval or ratio); allows to predict membership in group
Loglinear analysis
sometimes referred to as logit analysis; used to predict a categorical criterion based on categorical predictors
Approaches for causal modeling
not correlations and regressions; path analysis and SEM
Path analysis
applies multiple regression techniques to testing a model that specifies causal links among variables; relies on researcher having developed theoretically-based causal model; straight arrows denote causal relationships, curved denote correlations; path coefficients are analyzed to see if the pattern predicted by the model has emerged
Factor analysis
test of structure that extracts as many significant factors from set of data as possible
Characteristic root
another name for eigenvalues for factors (indicate strength of factors); less than 1.0 usually not interpreted
Factor loadings interpreted
equal to or exceed 0.30
Orthogonal rotation
factor rotation in which axes remain perpendicular; results in factors with no correlation
Communality
much of a test’s variability is explained by the combination of all the factors; can be calculated in orthogonal rotation; factor loadings squared and added together
Oblique rotation
factor rotation in which angle between axes is non-perpendicular and factors are correlated
Principal components analysis
subtype of factor analysis; trying to extract factors and there is no empirical or theoretical guidance on the values of the communalities; results in a few uncorrelated factors called components; no prior hypotheses
(Principle) factor analysis
communality values ascertained before analysis
Classical test theory
also called true score model; total variability = true score variability + error variability
Reliability
proportion of true score variability; often symbolized as rxx or rtt; minimum acceptable is 0.80
Content sampling error
occurs when a test, by chance, has items that do or do not tap into a test-taker’s knowledge base
Time sampling error
occurs when a test is given at different points in time and scores differ because of factors related to passage of time
Test heterogeneity error
occurs when a test has heterogeneous items tapping more than one domain
Factors affecting reliability
number of items, homogeneity of items, range of scores, ability to guess
Four estimates of reliability
test-retest reliability, parallel forms reliability, internal consistency reliability, interrater reliability
Coefficient of stability
expression of test-retest reliability
Coefficient of equivalence
expression of parallel forms reliability
Spearman-Brown prophecy formula
used when calculating split-half reliability; tells us how much more reliable the test would be if it were longer
Split-half reliability and speeded tests
split-half reliability inappropriate for speeded tests because only easy items included; preferred test of reliability is alternate forms
Power tests
have items that are of varying difficulty level, and subjects are provided sufficient time to complete them all
Kuder-Richardson (KR-20 and KR-21) and Cronbach’s coefficient alpha
sophisticated forms of internal consistency and reliability; involve analysis of the correlation of each item with every other item in the test; calculated by taking the mean of the correlation coefficients for every possible split-half; KR-20 (vary in difficulty) and KR-21 (consistent difficulty) when items scored dichotomously, Coefficient alpha when not scored dichotomously
Standard error of measurement
standard deviation of a theoretically normal distribution of test scores obtained by one individual on equivalent tests; assumed to be consistent across all persons; Smean = SDx * square root (1 - rxx); ranges from 0 (perfectly reliable test) to the standard deviation of the test (not at all reliable test)
Content validity
how adequately a test samples a particular content area; quantified by asking a panel of experts if each item is essential, useful/not essential, or not necessary, yet no numerical validity coefficient is derived
Criterion-related validity
how adequately a test score can be used to infer, predict, or estimate criterion outcome; calculated by using a Pearson r to correlate the test scores (also known as predictor scores) with criterion scores (also known as outcome scores)
Concurrent validity
subtype of criterion-related validity; predictor and criterion measured and correlated at about the same time
Predictive validity
subtype of criterion-related validity; delay between the measurement of the predictor and the criterion
Standard error of the estimate
average amount of error in estimating each person’s criterion score; standard deviation of a theoretically normal distribution of criterion scores obtained by one person measured repeatedly; Sest = SDy * square root (1-rxy2); ranges from 0 to value of standard deviation of criterion
Applications of criterion-related validity coefficient for prediction
expectancy tables, Taylor-Russell tables, decision-making theory
Expectancy tables
list the probability that a person’s criterion score will fall in a specified range based on the range in which that person’s predictor score fell
Taylor-Russel tables
numerically describe the amount of improvement that occurs in selection decisions when a predictor test is introduced
Selection ratio
proportion of available openings to number of applicants
Incremental validity
amount of improvement in success rate that results from using predictor test (e.g., if proportion of successful improves from base rate of .4 to .65, incremental validity is .25 or 25%)
Three variables that affect incremental validity
criterion-related validity coefficient of the predictor test, the company’s base rate, and the selection ratio
Decision-making theory
takes the predictions of performance that were based on the predictor tests and compares them with the actual criterion outcome
Item difficulty setting formula
(1.0 + probability of getting item by chance)/2.0
Item validity
correlation between item score and criterion score
Item-characteristic curve
plot of the relationship between item performance and total score
Item response theory
used to calculate to what extent a specific item on a test correlates with an underlying construct; subject’s performance on a test item as representing the degree to which the subject has a latent trait
Factors affecting criterion-related validity
range of scores, reliability of predictor, reliability of the predictor and the criterion, criterion contamination
Relationship of reliability of predictor and criterion-related validity
test must have some reliability to be valid, but a reliable test does not imply a valid test; validity can be greater than reliability; reliability determines ceiling for validity but is not always greater than validity
Correction for attenuation
calculates how much higher validity would be if predictor and criterion were both perfectly reliable
Criterion contamination
occurs with subjectively-scored criterion outcomes when the rater is informed of subjects’ predictor scores before assigning them criterion ratings
Ways evidence of construct validity can be obtained
factor analysis or multi-trait, multi-method matrix
Multi-trait, multi-method matrix
table with information about convergent and divergent validity, both of which are necessary for construct validity