Stats, Research Design, Test Construction Flashcards

1
Q

Most effective form of counterbalancing?

A

Latin Square

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

idiographic and nomothetic

A

idiographic = single subject research designs
nomothetic = multiple subjects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Group Designs/Single Subject Designs/Behavioural Measurement

A

Group Designs
-between groups
-within subjects
-mixed designs

Single Subject Designs
-AB
-ABAB
-Multiple Baseline Design: across subjects, situations, and behaviours
-Simultaneous (alternating) treatment design
-changing criterion design

Behavioural Measurement
-time sampling: momentary time sampling, whole-interval sampling, event recording

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Conditions of Experimentation

A

-analogue research
-clinical trials

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Time Frame

A

-cross-sectional
-longitudinal
-cross-sequential

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Sampling Procedures

A

-simple random sampling
-stratified random sampling
-proportional sampling
-systematic sampling
-cluster sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Threats to Internal Validity (8)
-factors other than the IV that may have caused change in the DV

A

-history - best control = control group
-maturation - best control = control group
-testing or test practice - best control = Solomon Four-Group Design
-instrumentation - best control = control group
-statistical regression - best control = control group
-selection bias - best avoided by random assignment
-attrition or experimental mortality - to assess - those who drop out should be compared on relevant variables via t-tests
-diffusion - best control = tight control of experimental situation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Threats to Construct Validity (4)
-factors other than the desired specifics of our intervention that result in differences (intervention-related)

A

-attention and contact with clients
-experimenter expectancies aka Rosenthal effect - keep experimenter blind
-demand characteristics - keep subj blind to tx condition
-John Henry effect - aka compensatory rivalry - groups not know about each other or given any sense of competition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Threats to External Validity (3)
-factors that interfere with generalizability

A

-sample characteristics
-stimulus characteristics
-contextual characteristics - reactivity –> Hawthorne effect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Threats to Statistical Conclusion Validity (4)

A

-low power
-unreliability of measures
-variability in procedures
-subject heterogeneity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Descriptive Stats

A
  1. Group Data
    A. Measures of Central Tendency
    -mean, median, mode
    B. Measures of Variability
    -SD, variance, range
    C. Graphs
  2. Individual Scores
    A. Raw Scores
    -percentage correct is a criterion-referenced or domain-referenced score
    B. Percentile Ranks
    -norm-referenced score
    C. Standard Scores
    -Z score formula = score - mean / SD
    -raw score formula = mean +/- Z-Score (SD)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Descriptive Stats

A
  1. Group Data
    A. Measures of Central Tendency
    -mean, median, mode
    B. Measures of Variability
    -SD, variance, range
    C. Graphs
  2. Individual Scores
    A. Raw Scores
    -percentage correct is a criterion-referenced or domain-referenced score
    B. Percentile Ranks
    -norm-referenced score
    C. Standard Scores
    -Z score formula = score - mean / SD
    -raw score formula = mean +/- Z-Score (SD)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Inferential Stats

A

Parameters = population values- mu is pop mean and sigma is pop SD

Standard Error of the Mean
-average amount of deviation in means across many samples
-standard error = SDpop / square root of N (pop size)

Hypothesis Testing
A. Key Concepts
-null hypothesis
-alternative hypothesis
-rejection region aka rejection of unlikely values (tail end of curve) - size of rejection region = alpha
-acceptance or retention region

Correct and Incorrect Decisions
-type 1 error - size of alpha corresponds to this (incorrectly reject null)
-type 2 error - probability of making type 2 error corresponds to beta (incorrectly accepting the null)
-power - ability to correctly reject the null - increased when sample size is large, magnitude of intervention is large, random error is small, stat test is parametric, test is one-tailed; power = 1 - beta; as alpha increases so does beta

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Selecting Stastistical Tests

A

Three questions commonly asked:
-questions of difference –> analyzed with Chi-square, Mann-Whitney, t-test, ANOVA etc
-questions of relationship and prediction –> analyzed with Pearson r, Biserial, multiple regression, etc.
-questions of structure or fit –> analyzed with principal components or factor analysis, cluster analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Tests of Difference

A

Type of Data of the DV
-if Nominal or Ordinal: non-parametric like chi-square, mann-whitney, wilcoxon
-if Interval or Ratio: parametric like t-test and ANOVA
-if more than one DV: MANOVA

-# and levels of the IV
-sample independence or correlation
-assumptions for parametric: interval or or ratio data; homoescedasticity; normal distribution of data
-assumption for chi-square: independence of observations

Nominal Data = chi square or multiple sample chi square (more than 1 IV); McNemar if groups correlated

Interval/Ratio and Ordinal
-more than one DV ALWAYS equals MANOVA
-one group = single-sample t-test (I/R) or Kolmogoroc (Ordinal)
-one IV, two groups = independent t-test either independent or matched samples
-more than two groups = ANOVA
-one way ANOVA = one IV; two-way = 2 IV etc - independent data
-2-way ANOVA AKA factorial ANOVA
-mixed or split plot ANOVA = one independent groups IV and one correlated groups IV
-2 IVS both correlated = rm factorial ANOVA
-2 IVS one is blocked = blocked ANOVA
-covariate - a variable that you weren’t interested in that is affecting the outcome - ANCOVA helps to take out that variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Tests of Difference: Degrees of Freedom

A

Single Sample Chi-Square
-nominal data collected for one IV
-df = #groups - 1

Multiple Sample Chi Square
-nominal data collected for two IVs
-df = (#rows - 1) x (#columns - 1)

T-Test for Single Sample
-df = N-1

T-Test for Matched or Correlated Samples
-df = #pairs - 1

T-Test for Independent Samples
-df = N-2

One-Way ANOVAs
-three possible DF: DFtotal; DFbetween; DFwithin
-DF total = N - 1
-DF between = #groups - 1
-DF within = DFtotal - DFbetween

17
Q

Tests of Difference: Calculating Expected Frequencies in a Chi Square

A

-in chi-square, eg democrat repub, male female - score for each category = obtained frequencies
-are there sig differences between men and women in voting preference for democrats and republicans? chi-square
-calculate expected frequency - 2 scenarios when need to do this: 1) survey 200 people as to voting preference and gender. Expected frequency = 50 per cell. Total number of people / number of cells
2) where data is given in the cells - sum the row and column separately, multiply by one another, divide by sample size

18
Q

Tests of Difference: Interpreting ANOVAs

A

One-Way ANOVAS
-one IV three groups or more
-ANOVA over multiple individual t-tests because the latter increases likelihood of type 1 error with every test run
-F-Ratio = ratio of MSbg/MSwg
-MS = variability
-BG = between groups
-WG = within groups
-variability WITHIN groups = error
-variabilitiy BETWEEN groups = good
wants BG high and WG lowe
-when F ratio is about 1, this is bad news, not significant - differences between group same as within all we have is error. As F ratio climbs to 2 or greater, then you have significance
-Post Hoc: pair-wise comparisons. Post-Hoc most conservative are Scheffe and Tukey - protect you from Type 1 errors (but increase chance Type 2 errors). Fischer’s LSD least protection from Type 1

Two-Way ANOVAs
-allow for main effects and interaction effects
-3 F ratios: one for each IV, one for the interaction
-if multiple things significant, must first interpret reaction effects then interpret main effects in light of interaction effects
-calculating main effects and interaction see p.35

MANOVAs
-more than one DV
-advantage = protects from type 1 error

Trend Analysis
-extension of an ANOVA
-when ANOVA significant may want to run trend analysis if IV has some kind of quantity e.g., dose of drug - tells you the patterns - e.g., curvilinear?

19
Q

Tests of Relationship and Prediction

A

Bivariate Tests
-look at relationship between two variables only- X and Y
-correlation coefficient - X predictor Y criterion - ranges from -1 to +1
-coefficient of determination - always associated with the correlation; the square of the correlation; represents the amount of variability in Y that is accounted for or explained by X
Simple Linear Regression Equation
-when there is a correlation, there is the implication of a prediction - regression equation is the line of best fit aka the line that fits best through your scatter plot. it is done by the least squares criterion
-regression: y = a + b(x) - a = intercept b = slope

Assumption of Bivariate Correlations
1) linear relationship between X and Y
2) homoscedasticity
3) unrestricted range of scores between X and Y - if you restrict range you reduce your variability

20
Q

Bivariate Correlation Coefficients

A

I/R + I/R = Pearson R
Ordinal + Ordinal = Spearman’s Rho or Kendall’s Tau
I/R + nominal dichotomous = biserial (artificial dichotomy) or point-biserial (true dichotomy naturally occurring men women)
true dichotomy + true dichotomy = phi
artificial dichotomy + artificial dichotomy = tetrachoric
Curvilenar relationship = Eta

21
Q

Bivariate: Types of Correlations and Variables

A

Zero-Order Correlation
-most basic correlation
-X and Y - believed that there are no extraneous variables

Partial Correlation (First Order Correlation)
-effect of a third variable (Z) is removed
-third variable is removed because it is thought to be influencing or confounding both the predictor and the criterion

Part (Semipartial) Correlation
-remove effect of Z from only X or only Y but not both

Moderator Variable
-a third variable that influences the strength of the relationship between the predictor and the criterion
-sometimes a correlation is stronger at certain points in the scatter plot and weaker at others

Mediator Variable
-explains why there is a relationship between predictor and criterion
-when you take our mediator you often no longer have sig relationship

22
Q

Multivariate Tests

A

-involve several Xs and one or more Ys

Multiple R
-big cousin of little Pearson r
-correlation between two more Xs and single Y
-Y is always I/R at least one X is I/R
-squaring multiple R gives the coefficient of multiple determination

Multiple Regression
-uses multiple R- allows the prediction of the criterion (Y) based on the values of the predictors (C)
-multiple regression equation: Y = a + b1x1 + b2x2 + b3x3 etc
-to optimize ability to predict- desirable to have low correlation between predictors and moderate-high correlation between each predictor and criterion
-multicollinearity: when predictors are highly correlated with each other and are thus essentially redundant
-compensatory technique
-stepwise regression: computer-generated re: ordering of variables based on how strongly they are related to criterion. Can be backward or forward. forward is adding one at a time starting with strongest. Backward is removing one at a time starting with weakest. allows researcher to come up with fewest possible predictors
-hierarchical regression: researcher controls the analysis, adding variables to the regression analysis in the order that is most consistent with proposed theory

Canonical R and Canonical Analyses
-2 more Xs AND 2 or more Ys
-allows you to evaluate the relationship between two sets of variables- a predictor set and a criterion set
-Canonical R = the relationship; Canonical Analyses = the prediction

Discriminant Function Analysis
-no correlation- just predicting
-special case of multiple regression equation
-Y is nominal data NOT interval/ratio
-allows a researcher to predict membership in a group based on knowledge of a set of predictor variables

Loglinear Analysis
-used to predict a categorical criterion (Y) but our Xs are also nominal

Path Analysis
-applies multiple regression techniques to testing a model that specifies CAUSAL links among variables (versus correlation can’t speak to causation)
-depends on a researcher having already developed a clearly articulated causal model that rests on a strong theoretical or empirical base
-straight arrows denote causal relationships and are called paths
-variables are described as exogenous or endogenous
-models can be recursive or non-recursive
-estimates of the causal relationships among variables are called path coefficients and are determined by multiple regression equations
-path coefficients are analyzed to see if the pattern predicted by the model has emerged

Structural Equation Modelling
-enables researchers to make inferences about causation
-can be used to test many different causal pathways that involve multiple predictors and criterion variables
-LISREL = linear structural relations - makes distinctions between independent and dependent variables in addition to latent and manifest variables. looks at direct and indirect effects and unidirectional and bi-directional paths

23
Q

Tests of Structure

A

-used when the researcher is interested in discovering which variables in the set fit best together or form coherent subsets that are relatively independent of one another
-eg., factor analysis of WAIS subtests

Factor Analysis
-used to reduce a large number of variables into a smaller number of factors
-extracts as many significant factors from the data as possible
-a factor = a dimension that consists of any number of variables
-first factor is always the strongest
-Eigenvalues = tells you the strength of the factor. aka characteristic root. Eigenvalues less than one usually not interpreted or considered significant
-Correlation Matrix = table of intercorrelations among tests of items
-Factor Loadings = determine which variables constitute a common factor. correlations between a variable and the underlying factor. interpreted if they are equal to or exceed plus or minus .30
-Factor rotation - makes the factors distinct and easier to interpret
-Orthogonal Rotations and Communality - results in factors that have no correlation with one another. Communality = how much of a test’s variability is explained by the combo of all the factors
-Oblique Rotations - factors are correlated
-Principal Components Analysis - no empirical or theoretical guidance on the values of communalities. components = uncorrelated factors. Factors are empirically derived - no prior hypotheses
-Principal Factor Analysis - communality values ascertained before the analysis

Cluster Analysis
-involves gathering data on a variety of dependent variables and statistically looking for naturally occurring subgroups in the data
-no a priori hypotheses
-e.g., MMPI-2 for police officers into 3 profile groups

24
Q

Test Construction: Reliability

A

Reliability = consistency in measurement

Classical Test Theory (The True Score Model)
-any obtained score is a combination of truth and error: X = T + E
-Total variability = true score variability + error variability
-Reliability is the proportion of true score variability
-Reliability coefficient is either rxx or rtt
-Minimum acceptable reliability = .80
-common sources of error: content sampling, time sampling, test heterogeneity

25
Q

Test Construction: Factors Affecting Reliability

A

-Number of items: more items = more reliable
-Homogeneity of items: more homogeneous = more reliable
-Range of scores: full spread of scores (unrestricted range) maximizes reliability - range of scores obtained is related to the heterogeneity of the subtests - more heterogeneous subjects = greater range of scores
-Ability to guess: easier to guess = lower reliability (e.g., true/false is less reliable than multiple choice)

26
Q

Test Construction: Four Estimates of Reliability

A

1) Test-Retest Reliability (coefficient of stability)
-test and re-test with identical instrument
-correlate people’s scores at two points in time
-are the scores stable when you measure people and measure again using the same instrument
-main source of error in test-retest is time - longer time between test and retest - lower reliability

2) Parallel Forms (Alternate Forms, Equivalent Forms) Reliability
-calculated by correlating the scores obtained by the same group of people on two roughly equivalent but not identical forms of the same test administered at two different points in time
-can be costly and time-consuming
-minimizes practice effects
-major sources of error: time and content sampling

3) Internal Consistency Reliability
-consistency of the scores within the test
-test is administered only once to one group of people
-two ways to assess internal consistency

a) Split-Half Reliability
-splitting the test in half and then correlating the scores obtained on each half by each person. Correlate those sets of scores
-when the test is split in half, the correlation is based on half the number of items = lower reliability –> underestimates the true reliability
-so, spearman brown (SP for split half and Spearman Brown) is used. Spearman Brown Prophecy Formula tells us how much more reliable the test would be if it were longer
-source of error is item or content sampling

a1) Speeded Tests - Split-Half reliability is inappropriate for speeded tests - like coding on the WAIS
-because subjects typically get all completed items correct, there should be a nearly perfect correlation between halved of the test –> reliability will be artificially inflated

a2) Power Tests - items that are of varying difficulty level
-subjects provided ample time
-scores expressed in terms of percentage correct

b) Kuder-Richardson & Cronbach’s Coefficient Alpha
-involves analysis of the correlation of each item with every other item in the test
-calculated by taking the mean of the correlation coefficients of for every possible split-half
-average of every possible way of splitting the test in half
-Kuder-Richardson = dichotomous
-Cronbach’s Alpha = non-dichotomous

4) Inter-Rater Reliability
-when what you are rating is subjectively scored
-degree of agreement between two or more scores between raters
-best way to improve = group discussion, practice exercises, feedback
-measured by: percent agreement, pearson r, Kappa statistic, Yule’s Y

27
Q

Test Construction: Reliability- Standard Error of Measurement

A

-one way to get some idea of the average amount of measurement in each person’s score would be to construct a theoretical distribution. This distribution would consist of one person’s scores if they were tested hundreds of times with alternate or equivalent forms of the test. This parallels what is done with the standard error of the mean
-the standard deviation of this distribution would indicate the average amount of measurement error, or the standard error of measurement
-standard error of measurement is the average amount of measurement error when we use a test and we measure anybody
-assumption is that for a given test, the amount of measurement error would be consistent across all persons
-formula on page 50

Standard Error of Measurement
-range of measurement error = 0.00 to a maximum value of the SD of the test
-if a test is totally unreliable, the standard error of measurement would be equal to the standard deviation of the test
-because there is error in measurement, a subject’s score on a test can never be reported as the true score. Rather, we report scores using confidence bands or confidence intervals
-to calculate a confidence interval, create a bell-curve. plot the score in the middle. add standard error to the obtained score. You need the score and you need the standard error. add and subtract error to the score for either side of the bell.
-possible confidence bands are 68%, 95%, 99%

28
Q

Validity

A

-defined as the meaningfulness, usefulness, or accuracy of a measure
-can tell us how well a test is measuring what it is supposed to be measuring or how well a test can be used to infer criterion performance

29
Q

Content Validity

A

-addresses how adequately a test samples a particular content area
-is our test measuring the content, knowledge, info that it’s supposed to?
-quantified by asking a panel of experts
-applies to tests that require knowledge of a particular domain or skills

30
Q

Criterion-Related Validity

A

-can we use our test to predict something?
-use pearson r to correlate test scores (predictor scores) with criterion scores (outcome scores)
-test scores = x; outcome scores = y
-rxy ranges from -1.0 to 1.0
-validity as low as .20 considered acceptable
-higher coefficient = more valid predictor
-negative validity coefficient indicates inverse relationship
-when there is a correlation, can construct a regression equation
-to determine how much of the outcome can be accounted for by the predictor, the criterion-related validity coefficient must be squared

Two Subtypes

1) Concurrent (same time e.g., few weeks apart)
-predictor and criterion are measured about the same point in time

2) Predictive Validity (e.g., few years apart)
-delay between the measurement of the predictor and criterion

31
Q

Criterion-Related Validity: Standard Error of Estimate

A

-error having to do with estimating (predicting)
-how much error there is in estimating what we are trying to estimate
-range of standard error is from a minimum value of 0.0 to a maximum value of the standard deviation of the criterion

32
Q

Criterion-Related Validity: Applications of the Criterion-Related Validity Coefficient

A

-relevant for purposes of prediction
-three applications:

1) Expectancy Tables
-likelihood that criterion scores will fall in a range given the range of scores that a predictor falls

2) Taylor-Russell Tables
-tells you how much of an improvements you will make in your selection or hiring decisions when you use a predictor test
-base rate: rate of successful employees without using a predictor test at all
-selection ratio: proportion of openings to applicants -openings divided by applicants
-incremental validity: the amount of improvement in the success rate when you use a predictor test compared to no test at all
-variables that affect incremental validity: criterion-related validity of the instrument (rxy); base rate (!); selection ratio (!)
-incremental validity is optimized when the base rate is moderate (.5) and the selection ratio is low (.1)

3) Decision-Making Theory
-looks at our predictions based on using our predictor tests and compares them with actual results
-four possible outcome: 1) true positives; 2) false positives; 3) true negatives; 4) false negatives
-classification of positive and negative is based on whether the person falls above or below the predictor cutoff (to the right hand side is positive)
-false positives are more problematic because they have been hired and you’ve wasted your money on them
-changing the cutoff affects the number of people in each category. it is less desirable to change the criterion cutoff relative to predictor.

33
Q

Criterion-Related Validity: Development of a Predictor Test

A

1) conceptualization - decide on test’s objective, administration, overall format
2) test construction - decide item format and write items- always create many more items than what you end up with
3) test tryout - trying out the items on the available sample
4) Item analysis - analyze items regarding:
a) item difficulty - how many people get it right? proportion - .9 = easy item. want range to be .3-.8 with an average of .5
b) item discrimination - discriminates between high and low scorers
c) item validity - correlation between item score and test score - higher correlation = higher validity
d) item-characteristic curve - plot of the relationship between item performance and total score - basis of IRT - item response theory aka latent trait theory
-IRT - calculate the extent to which the specific item correlates with the underlying construct we think we are measuring. Used to develop individually tailored adaptive tests - answer to one question in a domain area determines whether another question in that area will be asked next
5) Test Revision - subset of items is kept. Criterion-related validity coefficient is calculated. Then, the test is cross-validated - administered to a new sample. When we cross-validate, we end up with shrinkage - validity coefficient shrinks

34
Q

Criterion-Related Validity: Factors Affecting

A

Range of Scores
-unrestricted range of scores - we want a broad range of scores

Reliability of the Predictor
-reliability of the predictor affects the validity
-reliability (xx) puts a ceiling on validity - gives the upper limit of how valid it can be

Reliability of the Predictor and Criterion
-imperfect reliability of predictor and criterion creates measurement error when predictor and criterion are correlated
-correlation is less than what it would have been if the predictor and criterion had been perfectly reliable
-correction for attenuation - calculates how much higher validity would be if the predictor and criterion were perfectly reliable

Criterion Contamination
-occurs with subjectively-scored criterion outcomes when the rater is informed of the predictor scores before assigning rating
-typically results in an inflated or spuriously high criterion-related validity coefficient

35
Q

Construct Validity

A

-asks if my instrument is measuring the trait that I think it’s measuring

Multi-Trait, Multi-Method Matrix
-has information about convergent and divergent (discriminant) validity
-need both for construct validity

Convergent Validity
-correlation of scores on the new test with other available measures of the same trait
-correlation should be moderate-high e.g., .56; monotrait-heteromethod
-always about measuring the same trait and getting a decent correlation

Divergent Validity
-idea that if you are trying to create a test to measure aggression, you want that to diverge from measurement of a different construct e.g., hyperactivity
-correlation should thus be low e.g., ,07
-heterotrait-monomethod

36
Q
A
37
Q
A