Test Construction and Interpretation Flashcards

1
Q

Define Psychological Test

A

An objective and standardized measure of a sample of behaviour

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Norm-Referenced Scores: Pros and Cons

A

Pros:
* allows for comparison of an individuals performance on different tests
* E.g. one score may look better, but we can only tell by comparing it to similar others

Cons:
* don’t provide an absolute or universal standard of good or bad performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is meant by a ‘Sample of Behaviour’ in tests?

A

A measure can’t test ALL of a behaviour, it tests only a sample of it that should be representative of the entire concept it is measuring

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Reliability: define

A

Consistency of results between testings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Validity: define

A

The degree to which a test measures what it is designed to measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Test Characteristics

Maximum vs Typical Performance

A

Maximum: examinees best possible performance

Typical: what an examinee typically does or feels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Test Characteristics

Speed
Power
Mastery

A

Speed: response rate measured
Power: assesses level of difficult a person can attain. No time limit
Mastery: determine if a person can attain pre-established level of acceptable performance (e.g. the EPPP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Ceiling Effects

A

If a test doesn’t include an adequate range of items at the hard end, it limits what information that test can tell you

E.g. if there aren’t enough challenging questions, everyone may get the max score

Threatens internal validity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Floor Effects

A

Not enough items on the easy end, so all low achieving test takers are likely to score similarly

Threatens internal validity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Ipsative Measure

Define

A

The individual is the frame of reference in score reporting, not a norm group

Questions involve expressing preference for one thing over another

e.g. a personal preference inventory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Normative Measure

Define

A

Measure the strength of each attribute measured on a test

Every item is answered, not chosen from amongst other options

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Classical Test Theory

Reliability

A

People’s test scores consist of 2 things:
1. Truth
2. Error

True Score: the actual score that reflects their skill of whatever is being measured

Error: factors irrelevant to what is being measured that impact score (e.g. noise, luck, mood)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Reliability Coefficient

A

Correlation: 0.0 to +1.0
0.0 = entirely unreliable
0.90 = 90% of observed variability is due to true score differences; 10% due to measurement error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Test-Retest Reliability

A

AKA coefficient of stability
Need to get timing right; too soon (practice effects, memory), too far (more chance of random error)

Not good for unstable attributes (e.g. mood)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Alternate Forms Reliability

A

AKA coefficient of equivalence
Give 2 different forms of a test to the same group
Error due to content diffs between two forms, or time error. Time error reduced by giving tests in succession
Don’t use w/ unstable traits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to measure Internal Consistency Reliability?

A
  1. Split-half reliability
  2. Cronbach’s coefficient alpha
  3. Kuder-Richardson Formula 20
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Split-Half Reliability

Internal Consistency

A

Divide the test in two and correlate scores on the two halves

Shorter tests inherently less reliable-Spearman Brown Formula can mitigate this by estimating effect of test length on score

Not the most recommended

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Coefficient Alpha

Internal Consistency

A

Single administration, measure average degree of inter-item consistency

Used for tests w/ multiple scored items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Kuder-Richardson Formula 20

Internal Consistency

A

Single administration, inter-item consistency

Used on dichotomously scored tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How to measure Internal Consistency of speed tests?

A

Test-retest or alternate forms
Inter-item would wield perfect scores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Interscorer Reliability

What increases it?

A
  • Raters well trained
  • Raters know they are being observed
  • Scoring categories should be mutually exclusive and exhaustive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does Mutually Exclusive mean?

A

A behaviour belongs to one and only one category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Duration Recording

Interscorer Reliability

A

Rater records elapsed time during which target behaviour occurs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Frequency Recording

Interscorer Reliability

A

Observer keeps count of no. of times the target behaviour occurs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Interval Recording | Interscorer Reliability
Observing subject at given intervals and noting whether the target behaviour occurs Good for behaviours with no fixed beginning or end
26
Continuous Recording | Interscorer Reliabilty
Record all behaviour of the subject during the observation session
27
Standard Error of Measurement
How much error an individual test score can be expected to have Used to construct a **confidence interval**, which is the range someone's true score is likely to fall
28
What factors affect reliability?
1. Length of test 2. Homogeneity of testing group 3. Floor/ceiling effects 4. Guessing correct answers
29
Content Validity
The extent to which the test items adequately and representatively sample the content area to be measured Shown through correlation w/ other tests that assess same content
30
Criterion Related Validity: Define
Is it useful for predicting an individuals behaviour in specified situations? Criterion = 'that which is being predicted' E.g. the SAT being correlated with Uni GPA to establish relationship and determine criterion validity Used in applied situations (selecting employees, college admissions, special classes)
31
Criterion-Related Validity Coefficient
rxy (x = predictor; y = criterion) -1.0 - +1.0 score Few exceed .60
32
What is the Coefficient of Determination? | Criterion Validity
The square of a correlation coefficient, which shows the variability in criterion that is explained by variability in the predictor
33
Concurrent Validation | Criterion Validity
The predictor and criterion data are collected at same time It predicts a current behaviour E.g. job selection test for therapists given to current therapists, and it is correlated with their current performance ratings from supervisors
34
When would you use Concurrent Validation? | Criterion Validity
When you need the current status of a criterion May be used over predictive for cost and convenience
35
Predictive Validity | Criterion Validity
Predictor scores are collected first, criterion data collected later E.g. does the GRE predict grad school performance?
36
Standard Error of Estimate | Criterion Validity
Interprets an individual's predicted score on a criterion measure There will be difference between predicted criterion score and actual score, which is the **standard error of estimate** E.g. using SAT score to predict GPA via a regression equation
37
Equation for the Standard Error of Estimate | Criterion Validity
SE est = SD y 1 - r xy2 * SE est = standard error of estimate * SD y = standard deviation of criterion scores * r xy = validity coefficient This can be used to make a confidence interval | *Likely won't need to remember equation for exam. But do need to for SEM
38
How can you use Criterion Validity to make decisions?
**Criterion Cut off Point** Predict if someone is likely to make it above the cut off and be selected (e.g. all students w/ GPA of 3.0+)
39
What is a predictors Functional Utility? | Criterion Validity
Determine the increase in correct decision making that would result from using the predictor as a selection tool Calculated once predictor and criterion cut off points are made
40
4 possibilities for Criterion Cut Off Point Scores
1. **True Positives:** scored above cut off, and were successful 2. **False Positives:** scored above cut off, not successful 3. **True Negatives:** scored below cutoff, unsuccessful 4. **False Negatives:** scored below cutoff, successful
41
Heterogeneity of Examinees | Factors that Affect Validity Coefficient
Restricted range of scores with lower the validity coefficient Homogenous groups = lower validity coefficient
42
Reliability of Predictor and Criterion | Factors that Affect Validity Coefficient
They must both be reliable for a predictor to be valid High reliability does not guarantee good validity
43
Moderator Variables | Factors that Affect Validity Coefficient ## Footnote Differential Validity
**What are they?** an unrelated variable that affects the validity of the predictor **Differential Validity:** a test has this if there are different validity coefficients for different groups
44
Cross-Validation | Factors that Affect Validity Coefficient ## Footnote Shrinkage
After a test validated, it's re-validated with a different group of people **Shrinkage:** when the validity coefficient drops after cross-validation, because the predictor ended up being 'tailor made' by the OG sample
45
Criterion Contamination | Factors that Affect Validity Coefficient
**What is it?** knowledge of someones predictor scores impacts their criterion score **Prevention:** people involved in assigning criterion ratings should not know the persons predictor score
46
Construct Validity
**What is it?** the degree to which a test measures the construct it is intended to **How Measured?** over time, based on accumulation of evidence
47
Convergent Validity | Construct Validity
**What is it?** different ways of measuring the same trait yield similar results
48
Discriminant Validity | Construct Validity
**What is it?** when a test does NOT correlate with another test that measures something different
49
What is a Multi-Trait Multi-Method Matrix? | Construct Validity
Assessment of 2 or more traits by 2 or more methods. **Convergent Validity** if tests that measure same traits have a high correlation, even when different methods used **Discriminant Validity** when two tests that measure different traits have a low correlation, even when they use the same method
50
4 types of correlation coefficients in the Multitrait-Multimethod Matrix | Construct Validity
1. **Monotrait-monomethod:** correlate between measure & itself. RELIABILITY 2. **Monotrait-heteromethod:** correlation between two measures of same trait w/ different methods 3. **Heterotrait-monomethod:** correlation between two measures of different traits using same method 4. **Heterotrait-heteromethod:** correlation between two measures of different traits using different methods
51
What is Factor Analysis? | Construct Validity
A stats procedure that reduces a set of many variables to fewer 'themed' variables (underlying constructs/latent variables)
52
Factor Analysis: Factor Loading | Construct Validity
Correlation between a given test and a given factor +1 to -1 Can be squard to determine proportion of variability in the test accounted for by the factor
53
Factor Analysis: Communality | Common Variance Unique Variance ## Footnote Construct Validity
**Measures:** The proportion of variance of a test that is attributable to the factors **How Measured?** factor loadings are squared and added **Equation:** h2 **Common Variance:** the factors affect variance in all parts of test **Unique Variance:** variance specific to test, unrelated to factors * Subtract communality from 1.00
54
Explained Variance (Eigenvalues) | Construct Validity
**What are they?** measure of the amount of variance in all the tests accounted for by the factor Convert to percentage: (eigenvalue 100)/(# of tests)
55
Interpreting & Naming the Factors | Rotation ## Footnote Construct Validity
You must make inferences based on theory about what the factors are measuring (e.g. based on teh contents of items that load highly on that factor) **Rotation:** a procedure that places factors in a new position relative to the tests. Aids in interpretation
56
2 Types of Rotation | Interpreting & Naming Factors ## Footnote Construct Validity
1. **Orthogonal:** factors are independent of each other 2. **Oblique:** factors that are correlated w/ each other to some degree *Notes*: communality only exists for orthogonal Post-rotation, eigenvalues may have changed. Eigenvalue only used for unrotated factors.
57
Factorial Validity | Construct Validity
**What is it?** when a test correlates highly with a factor it would be expected to
58
Principle Components Analysis | Construct Validity
**Similar to Factor Analysis:** * reduce large set of variables to underlying constructs * Factor matrix * Eigenvalues: square & sum factor loadings * Underlying factors ordered in terms of explanatory power **Differences to Factor Analysis:** * Factor = principle component/eigenvector * no distinction between communality and specificity (variance only due to explained and error variance) * Factors are always uncorrelated. i.e no such thing as oblique rotation
59
Cluster Analysis | Construct Validity
**Purpose:** develop a taxonomy/classification Used to divide a group into similar subtypes (e.g. types of criminals) **Differences to Factor Analysis:** * Any type of data can be used for CA, whereas only interval or ratio for FA * Clusters are just categories, not latent variables * Not used when there is a pre-existing hypothesis, where as FA has one
60
Relationship between Reliability and Validity
A test can be reliable but not valid For a test to be valid, it must be reliable (if it doesn't have consistent results, it's only measuring random error) Validity coefficient is either less than or equal to the square root of the relability coefficient
61
Correction for Attenuation | Validity
This equation can show you what would happen to the validity of a test if both the criterion and predictor had higher reliability
62
How can Item Analysis help Reliability and Validity?
It can have them built into the test, item by item
63
Item Difficulty
* The percentage of examinees who answer it correctly (item difficult index; p) * Moderate difficulty items are most common; increase score variability which increases reliability & validity * Change based on purpose of the test * Avg difficulty should be halfway between 1.0 and level of success expected by chance
64
What scale is associated with the p level according to Anne Anastasi? | Item Difficulty
Ordinal scale **Why?** equivalent differences in p value do not indicate equivalent differences in difficulty e.g. we can conclude which items are easier than others, but that doesn't mean the difference in difficulty between items is equal to the difference between other items
65
Item Discrimination
Degree to which an item differentiates among examinees in terms of the behaviour it is designed to measure e.g. depressed people answer item consistently different than non-depressed people
66
How to measure Item Discrimination?
**Correlate Item Response with Total Score:** those w/ highest correlation are kept. Useful when test only measures one thing **Correlate Item with Criterion Measure:** choose items that correlate with criterion but not w/ each other **Item Discrimination Index:** D Divide group into top and bottom 27%. For each item, subtract % of examiners in low scoring from % of high scoring who answered the item correctly (D = U - L) *Range*: 100 to -100
67
Relationship between Item Difficulty and Item Discrimination
Difficulty level places a ceiling on discrimination index (if everybody or nobody answers it correctly, there is no discrimination) Moderate difficulty items have best discrimination
68
Item Response Theory: Define | How is it displayed? What does it show?
Based on Item Characteristic Curves, which depict items in terms of how difficult it was for individuals in different ability groups Slope on graph shows discrimination (steeper curve = less discrimination) Difficulty, discrimination, and probability of answering correctly
69
2 Assumptions of Item Response Theory
1. Performance on item is related to estimated amount of a latent trait being measured by item 2. Results of testing are sample free (*invariance of item parameters*) An item should have same difficulty & discrimination across all random samples of a population
70
Why do we need to compare peoples test scores to a norm? | Test Interpretation
Because without a reference point, tests results mean nothing
71
2 types of Developmental Norms | Test Interpretation
**Mental Age** * Compare score to the avg performance of others at different age levels * Used to calculate ratio IQ score (Mental Age/Chronological Age) x (100) **Grade Equivalent** * Primarily used for interpretation of educational achievement tests
72
Disadvantages of Developmental Norms | Test Interpretation
* Don't allow for comparison of individuals at different age levels, because the standard deviation is not accounted
73
Within-Group Norms | Test Interpretation
* Compare score to those of most similar standardization sample * E.g. percentile ranks, standard scores
74
Percentile Ranks | Test Interpretation
* Indicates the percentage of people in standardization sample who fall below a given raw score * E.g. 90th percentile = you scored better than 90% of others * *Disadvantage:* ordinal data, so can't quantify difference in scores between someone in the 90th or 80th percentile rank
75
Standard Scores: define | Test Interpretation
* Show a raw score's distance from the mean in standard deviation units * Can compare an individual at different ages
76
4 types of Standard Scores | Test Interpretation
**Z-Score** * shows how many SD's above/below mean. E.g. +1.0 = one SD above mean **T-Score** * have a mean of 50, SD of 10 * T score 60 = score falls 1 SD above mean **Stanine Score** * Literally means 'standard 9', scores range 1-9 * Mean of 5, SD of 2 **Deviation IQ Score** * mean 100, SD 15 * E.g. IQ tests