Test Construction and Interpretation Flashcards

1
Q

Define Psychological Test

A

An objective and standardized measure of a sample of behaviour

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Norm-Referenced Scores: Pros and Cons

A

Pros:
* allows for comparison of an individuals performance on different tests
* E.g. one score may look better, but we can only tell by comparing it to similar others

Cons:
* don’t provide an absolute or universal standard of good or bad performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is meant by a ‘Sample of Behaviour’ in tests?

A

A measure can’t test ALL of a behaviour, it tests only a sample of it that should be representative of the entire concept it is measuring

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Reliability: define

A

Consistency of results between testings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Validity: define

A

The degree to which a test measures what it is designed to measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Test Characteristics

Maximum vs Typical Performance

A

Maximum: examinees best possible performance

Typical: what an examinee typically does or feels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Test Characteristics

Speed
Power
Mastery

A

Speed: response rate measured
Power: assesses level of difficult a person can attain. No time limit
Mastery: determine if a person can attain pre-established level of acceptable performance (e.g. the EPPP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Ceiling Effects

A

If a test doesn’t include an adequate range of items at the hard end, it limits what information that test can tell you

E.g. if there aren’t enough challenging questions, everyone may get the max score

Threatens internal validity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Floor Effects

A

Not enough items on the easy end, so all low achieving test takers are likely to score similarly

Threatens internal validity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Ipsative Measure

Define

A

The individual is the frame of reference in score reporting, not a norm group

Questions involve expressing preference for one thing over another

e.g. a personal preference inventory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Normative Measure

Define

A

Measure the strength of each attribute measured on a test

Every item is answered, not chosen from amongst other options

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Classical Test Theory

Reliability

A

People’s test scores consist of 2 things:
1. Truth
2. Error

True Score: the actual score that reflects their skill of whatever is being measured

Error: factors irrelevant to what is being measured that impact score (e.g. noise, luck, mood)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Reliability Coefficient

A

Correlation: 0.0 to +1.0
0.0 = entirely unreliable
0.90 = 90% of observed variability is due to true score differences; 10% due to measurement error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Test-Retest Reliability

A

AKA coefficient of stability
Need to get timing right; too soon (practice effects, memory), too far (more chance of random error)

Not good for unstable attributes (e.g. mood)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Alternate Forms Reliability

A

AKA coefficient of equivalence
Give 2 different forms of a test to the same group
Error due to content diffs between two forms, or time error. Time error reduced by giving tests in succession
Don’t use w/ unstable traits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to measure Internal Consistency Reliability?

A
  1. Split-half reliability
  2. Cronbach’s coefficient alpha
  3. Kuder-Richardson Formula 20
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Split-Half Reliability

Internal Consistency

A

Divide the test in two and correlate scores on the two halves

Shorter tests inherently less reliable-Spearman Brown Formula can mitigate this by estimating effect of test length on score

Not the most recommended

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Coefficient Alpha

Internal Consistency

A

Single administration, measure average degree of inter-item consistency

Used for tests w/ multiple scored items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Kuder-Richardson Formula 20

Internal Consistency

A

Single administration, inter-item consistency

Used on dichotomously scored tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How to measure Internal Consistency of speed tests?

A

Test-retest or alternate forms
Inter-item would wield perfect scores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Interscorer Reliability

What increases it?

A
  • Raters well trained
  • Raters know they are being observed
  • Scoring categories should be mutually exclusive and exhaustive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does Mutually Exclusive mean?

A

A behaviour belongs to one and only one category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Duration Recording

Interscorer Reliability

A

Rater records elapsed time during which target behaviour occurs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Frequency Recording

Interscorer Reliability

A

Observer keeps count of no. of times the target behaviour occurs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Interval Recording

Interscorer Reliability

A

Observing subject at given intervals and noting whether the target behaviour occurs

Good for behaviours with no fixed beginning or end

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Continuous Recording

Interscorer Reliabilty

A

Record all behaviour of the subject during the observation session

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Standard Error of Measurement

A

How much error an individual test score can be expected to have

Used to construct a confidence interval, which is the range someone’s true score is likely to fall

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What factors affect reliability?

A
  1. Length of test
  2. Homogeneity of testing group
  3. Floor/ceiling effects
  4. Guessing correct answers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Content Validity

A

The extent to which the test items adequately and representatively sample the content area to be measured

Shown through correlation w/ other tests that assess same content

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Criterion Related Validity: Define

A

Is it useful for predicting an individuals behaviour in specified situations?

Criterion = ‘that which is being predicted’

E.g. the SAT being correlated with Uni GPA to establish relationship and determine criterion validity

Used in applied situations (selecting employees, college admissions, special classes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Criterion-Related Validity Coefficient

A

rxy (x = predictor; y = criterion)
-1.0 - +1.0 score
Few exceed .60

32
Q

What is the Coefficient of Determination?

Criterion Validity

A

The square of a correlation coefficient, which shows the variability in criterion that is explained by variability in the predictor

33
Q

Concurrent Validation

Criterion Validity

A

The predictor and criterion data are collected at same time

It predicts a current behaviour

E.g. job selection test for therapists given to current therapists, and it is correlated with their current performance ratings from supervisors

34
Q

When would you use Concurrent Validation?

Criterion Validity

A

When you need the current status of a criterion

May be used over predictive for cost and convenience

35
Q

Predictive Validity

Criterion Validity

A

Predictor scores are collected first, criterion data collected later

E.g. does the GRE predict grad school performance?

36
Q

Standard Error of Estimate

Criterion Validity

A

Interprets an individual’s predicted score on a criterion measure

There will be difference between predicted criterion score and actual score, which is the standard error of estimate

E.g. using SAT score to predict GPA via a regression equation

37
Q

Equation for the Standard Error of Estimate

Criterion Validity

A

SE est = SD y 1 - r xy2
* SE est = standard error of estimate
* SD y = standard deviation of criterion scores
* r xy = validity coefficient

This can be used to make a confidence interval

*Likely won’t need to remember equation for exam. But do need to for SEM

38
Q

How can you use Criterion Validity to make decisions?

A

Criterion Cut off Point

Predict if someone is likely to make it above the cut off and be selected (e.g. all students w/ GPA of 3.0+)

39
Q

What is a predictors Functional Utility?

Criterion Validity

A

Determine the increase in correct decision making that would result from using the predictor as a selection tool

Calculated once predictor and criterion cut off points are made

40
Q

4 possibilities for Criterion Cut Off Point Scores

A
  1. True Positives: scored above cut off, and were successful
  2. False Positives: scored above cut off, not successful
  3. True Negatives: scored below cutoff, unsuccessful
  4. False Negatives: scored below cutoff, successful
41
Q

Heterogeneity of Examinees

Factors that Affect Validity Coefficient

A

Restricted range of scores with lower the validity coefficient

Homogenous groups = lower validity coefficient

42
Q

Reliability of Predictor and Criterion

Factors that Affect Validity Coefficient

A

They must both be reliable for a predictor to be valid

High reliability does not guarantee good validity

43
Q

Moderator Variables

Factors that Affect Validity Coefficient

Differential Validity

A

What are they? an unrelated variable that affects the validity of the predictor

Differential Validity: a test has this if there are different validity coefficients for different groups

44
Q

Cross-Validation

Factors that Affect Validity Coefficient

Shrinkage

A

After a test validated, it’s re-validated with a different group of people

Shrinkage: when the validity coefficient drops after cross-validation, because the predictor ended up being ‘tailor made’ by the OG sample

45
Q

Criterion Contamination

Factors that Affect Validity Coefficient

A

What is it? knowledge of someones predictor scores impacts their criterion score

Prevention: people involved in assigning criterion ratings should not know the persons predictor score

46
Q

Construct Validity

A

What is it? the degree to which a test measures the construct it is intended to

How Measured? over time, based on accumulation of evidence

47
Q

Convergent Validity

Construct Validity

A

What is it? different ways of measuring the same trait yield similar results

48
Q

Discriminant Validity

Construct Validity

A

What is it? when a test does NOT correlate with another test that measures something different

49
Q

What is a Multi-Trait Multi-Method Matrix?

Construct Validity

A

Assessment of 2 or more traits by 2 or more methods.

Convergent Validity if tests that measure same traits have a high correlation, even when different methods used

Discriminant Validity when two tests that measure different traits have a low correlation, even when they use the same method

50
Q

4 types of correlation coefficients in the Multitrait-Multimethod Matrix

Construct Validity

A
  1. Monotrait-monomethod: correlate between measure & itself. RELIABILITY
  2. Monotrait-heteromethod: correlation between two measures of same trait w/ different methods
  3. Heterotrait-monomethod: correlation between two measures of different traits using same method
  4. Heterotrait-heteromethod: correlation between two measures of different traits using different methods
51
Q

What is Factor Analysis?

Construct Validity

A

A stats procedure that reduces a set of many variables to fewer ‘themed’ variables (underlying constructs/latent variables)

52
Q

Factor Analysis: Factor Loading

Construct Validity

A

Correlation between a given test and a given factor
+1 to -1
Can be squard to determine proportion of variability in the test accounted for by the factor

53
Q

Factor Analysis: Communality

Common Variance
Unique Variance

Construct Validity

A

Measures: The proportion of variance of a test that is attributable to the factors

How Measured? factor loadings are squared and added

Equation: h2

Common Variance: the factors affect variance in all parts of test

Unique Variance: variance specific to test, unrelated to factors
* Subtract communality from 1.00

54
Q

Explained Variance (Eigenvalues)

Construct Validity

A

What are they? measure of the amount of variance in all the tests accounted for by the factor

Convert to percentage: (eigenvalue 100)/(# of tests)

55
Q

Interpreting & Naming the Factors

Rotation

Construct Validity

A

You must make inferences based on theory about what the factors are measuring (e.g. based on teh contents of items that load highly on that factor)

Rotation: a procedure that places factors in a new position relative to the tests. Aids in interpretation

56
Q

2 Types of Rotation

Interpreting & Naming Factors

Construct Validity

A
  1. Orthogonal: factors are independent of each other
  2. Oblique: factors that are correlated w/ each other to some degree

Notes: communality only exists for orthogonal

Post-rotation, eigenvalues may have changed. Eigenvalue only used for unrotated factors.

57
Q

Factorial Validity

Construct Validity

A

What is it? when a test correlates highly with a factor it would be expected to

58
Q

Principle Components Analysis

Construct Validity

A

Similar to Factor Analysis:
* reduce large set of variables to underlying constructs
* Factor matrix
* Eigenvalues: square & sum factor loadings
* Underlying factors ordered in terms of explanatory power

Differences to Factor Analysis:
* Factor = principle component/eigenvector
* no distinction between communality and specificity (variance only due to explained and error variance)
* Factors are always uncorrelated. i.e no such thing as oblique rotation

59
Q

Cluster Analysis

Construct Validity

A

Purpose: develop a taxonomy/classification
Used to divide a group into similar subtypes (e.g. types of criminals)

Differences to Factor Analysis:
* Any type of data can be used for CA, whereas only interval or ratio for FA
* Clusters are just categories, not latent variables
* Not used when there is a pre-existing hypothesis, where as FA has one

60
Q

Relationship between Reliability and Validity

A

A test can be reliable but not valid

For a test to be valid, it must be reliable (if it doesn’t have consistent results, it’s only measuring random error)

Validity coefficient is either less than or equal to the square root of the relability coefficient

61
Q

Correction for Attenuation

Validity

A

This equation can show you what would happen to the validity of a test if both the criterion and predictor had higher reliability

62
Q

How can Item Analysis help Reliability and Validity?

A

It can have them built into the test, item by item

63
Q

Item Difficulty

A
  • The percentage of examinees who answer it correctly (item difficult index; p)
  • Moderate difficulty items are most common; increase score variability which increases reliability & validity
  • Change based on purpose of the test
  • Avg difficulty should be halfway between 1.0 and level of success expected by chance
64
Q

What scale is associated with the p level according to Anne Anastasi?

Item Difficulty

A

Ordinal scale
Why? equivalent differences in p value do not indicate equivalent differences in difficulty

e.g. we can conclude which items are easier than others, but that doesn’t mean the difference in difficulty between items is equal to the difference between other items

65
Q

Item Discrimination

A

Degree to which an item differentiates among examinees in terms of the behaviour it is designed to measure

e.g. depressed people answer item consistently different than non-depressed people

66
Q

How to measure Item Discrimination?

A

Correlate Item Response with Total Score: those w/ highest correlation are kept. Useful when test only measures one thing
Correlate Item with Criterion Measure: choose items that correlate with criterion but not w/ each other

Item Discrimination Index: D
Divide group into top and bottom 27%. For each item, subtract % of examiners in low scoring from % of high scoring who answered the item correctly (D = U - L)
Range: 100 to -100

67
Q

Relationship between Item Difficulty and Item Discrimination

A

Difficulty level places a ceiling on discrimination index (if everybody or nobody answers it correctly, there is no discrimination)

Moderate difficulty items have best discrimination

68
Q

Item Response Theory: Define

How is it displayed?
What does it show?

A

Based on Item Characteristic Curves, which depict items in terms of how difficult it was for individuals in different ability groups

Slope on graph shows discrimination (steeper curve = less discrimination)

Difficulty, discrimination, and probability of answering correctly

69
Q

2 Assumptions of Item Response Theory

A
  1. Performance on item is related to estimated amount of a latent trait being measured by item
  2. Results of testing are sample free (invariance of item parameters)
    An item should have same difficulty & discrimination across all random samples of a population
70
Q

Why do we need to compare peoples test scores to a norm?

Test Interpretation

A

Because without a reference point, tests results mean nothing

71
Q

2 types of Developmental Norms

Test Interpretation

A

Mental Age
* Compare score to the avg performance of others at different age levels
* Used to calculate ratio IQ score (Mental Age/Chronological Age) x (100)

Grade Equivalent
* Primarily used for interpretation of educational achievement tests

72
Q

Disadvantages of Developmental Norms

Test Interpretation

A
  • Don’t allow for comparison of individuals at different age levels, because the standard deviation is not accounted
73
Q

Within-Group Norms

Test Interpretation

A
  • Compare score to those of most similar standardization sample
  • E.g. percentile ranks, standard scores
74
Q

Percentile Ranks

Test Interpretation

A
  • Indicates the percentage of people in standardization sample who fall below a given raw score
  • E.g. 90th percentile = you scored better than 90% of others
  • Disadvantage: ordinal data, so can’t quantify difference in scores between someone in the 90th or 80th percentile rank
75
Q

Standard Scores: define

Test Interpretation

A
  • Show a raw score’s distance from the mean in standard deviation units
  • Can compare an individual at different ages
76
Q

4 types of Standard Scores

Test Interpretation

A

Z-Score
* shows how many SD’s above/below mean. E.g. +1.0 = one SD above mean
T-Score
* have a mean of 50, SD of 10
* T score 60 = score falls 1 SD above mean
Stanine Score
* Literally means ‘standard 9’, scores range 1-9
* Mean of 5, SD of 2
Deviation IQ Score
* mean 100, SD 15
* E.g. IQ tests