Measurement Theory and Assessment 1 Flashcards
Psychometrician
Specialist in psychology or education who develops and evaluates psychological tests
Test
A standardised measure for sampling behaviour which describes it using categories or scores
Characteristics of a test
- standardised procedure
- for a specific sample of behaviour
- uses scores or categories
- uses norms or standards
- makes a prediction of non-test behaviour
Norm-referenced test
Performance of the examinee is referenced to standardisation sample
Criterion-referenced test
Determines where the examinee stands, regarding tightly defined educational objectives
Assessment
Appraising/estimating the magnitude of one or more attributes in a person
Group tests
Suitable to the testing of large groups of individuals simultaneously (e.g. pen-and-paper tests)
Individual tests
Designed to be administered one-on-one
Types of psychological tests
- intelligence tests
- aptitude tests
- achievement tests
- creativity tests
- personality tests
- interest inventories
- behavioural procedures
- neuropsychological tests
Responsibilities of test publishers
- publication and marketing issues
- competence of test purchasers
Responsibilities of test users
- best interests of the client
- confidentiality and the duty to warn
- expertise of the test user
- informed consent
- obsolete tests and the standard of care
- responsible report writing
- communication of test results
- consideration of individual differences
Diagnostics
Getting to know a situation in order to be able to make a decision
Psychodiagnostics
Getting to know an individual’s psychosocial functioning
- reliable and valid description of their psychosocial reality
- find possible explanations for problems
- test possible explanations
Scientific diagnostics
- ideally repeatable
- ideally approach reality
Uses of tests
- problem analysis
- classification and diagnosis
- treatment planning
- program/treatment evaluation
- self-knowledge
- scientific research
Committee on tests and testing in The Netherlands (COTAN)
Criteria
- principles of test construction
- goal
- group
- function
- standardisation
- quality of test material
- quality of test manual
- norms: representative reference group
- reliability: consistency, repeatability
- validity: does the test assess what it aims to?
Tests need to:
- be relevant
- be performed by qualified individuals
- have role integrity
- be confidential
- have informed consent
- be independent and objective
Classical test theory
Test scores are influenced by two factors: consistency factors and inconsistency factors
X = T +e
Sources of measurement error
- item selection: choosing an instrument/parts of an instrument
- test administration: general environment aspects, countenance of an examiner
- test scoring: subjectively scored tests are vulnerable to mistakes/bias from the scorer
- systematic measurement error: consistent error where something unwanted is measured
Correlation coefficient (r)
Degree of linear relationship between two sets of scores obtained from the same people
Range: -1.00 to 1.00
Positive correlation (r > 0.00) or negative correlation (r < 0)
The closer r is to 1 (as an absolute value), the stronger the relationship
Test-retest reliability
Administering an identical test to the same sample group
Alternate forms reliability
Two tests are independently created to measure the same thing; typical have same (or similar) means and standard deviations; correlation of test groups (from the same sample group) should be strong and positive
Split-half reliability
Correlate scores from the 1st and 2nd half of a test to each other (instead of administering 2 tests)
Spearman-Brown formula
Corrects for the underestimation of reliability when using split-half reliability
Coefficient alpha (Cronbach’s alpha)
Mean of all possible split-half coefficients, corrected by the Spearman-Brown formula
Range: 0.00 to 1.00
Index of internal consistency of the items; tendency for items to correlate positively
Kuder-Richardson formula
Similar to Cronbach’s formula, used for tests with only two answer options
Interscorer reliability
A sample of tests is independently scored by two or more examiners; scores for the tests from each examiner are correlated (should have a strong, positive correlation). Used for subjective scoring tests
Systematic errors
- either positive or negative
- average measurement error is not 0
- can be due to test construction/an inconsistency in the assessed construct
- serve as a measure of validity - how well is the test measuring what it is supposed to
Unsystematic erros
- are random and unpredictable
- are both positive and negative
- average measurement error is 0
- are not related to the true score
- are a measure of reliability - affects the consistency of scores
Raw score
Most basic information provided by a psychological test
e.g. how many questions were answered correctly
Norm group
Sample of examinees, representative of the population for whom the test is intended
Norm-referenced test
Results of an examinee are interpreted using the instrument’s corresponding norms
Measurements of central tendency
Mean: average, good for normally distributed data
Median: middle number/score, better than mean when distribution of data is skewed, used for percentiles
Mode: most common score, shows the peak on a skewed distribution§
Percentile
Percentage of people who scored below a specific raw score (e.g. score of 25 > 94th percentile, 94% of participants scored below 25)
Standard score
Distance from the mean in standard deviation units, aka a z-score
T-scores
Transformation of z-scores to avoid negative and decimated numbers
M = 50, SD = 10
T = 10z + 50
Stanine
Raw scores converted to a system using 1 to 9
M = 5, SD ≈ 2
Scores are ranked lowest to highest then put into numbers bij percentage:
1 > bottom 4%
2 > next 7%
3 > next 12%
4 > next 17%
5 > next 20%
6 > next 17%
7 > next 12%
8 > next 7%
9 > next 4%
Random sampling
Each member of the population (or subset thereof) has an equal chance of getting selected
Stratified random sampling
Create strata (groups) from the population based on certain demographics then selecting the sample randomly (can be proportional)
Expectancy table
Shows the relationship between test scores and the expected outcomes on a different, relevant task
e.g. scores on a scholastic aptitude test and subsequent college grade point average
Criterion-referenced test
Compare the examinee’s score to a predefined performance standard
Used often for education purposes
Most commonly used confidence intervals (CI)
68% CI > X ± 1*SD
90% CI
Most commonly used confidence intervals (CI)
68% CI > X ± 1SD
90% CI > X ± 1.65SD
95% CI > X ± 1.96SD
99% CI > X ± 3SD
X = score
SD = standard deviation
Difference score (SEdiff)
Used to determine if the difference between pre- and post-treatment scores is valid (or due to the unreliability/validity of the test)
Relative norms
Goal: classify on a continuum
Interpretation: control group
Items should maximally discriminate
Absolute norms
Goal: determine if criterion has been reached
Interpretation: previously determined criterium
Items should be relevant to criterium
Especially used in education
Norms
Summary of distribution of characteristics in a representative sample
Need to be up-to-date:
- 15 years > outdated
- 20 years > unusable
Summative assessment
- used for selection, qualification, or prognosis
- assessment of learning
e.g. course exam
Formative assessment
- strengths & weaknesses
- aimed at instruction (compare with own scores or those of peers)
- assessment for learning
e.g. polls & feedback on a report
Flynn effect
IQ increases by 3 points every 10 years
Validity
A test is valid to the extent that inferences made form it are appropriate, meaningful, and useful
Types of validity:
- content validity
- criterion-related validity
- construct validity
Content validity
The degree to which the content of a test is representative of the sample of behaviour/construct the test is designed to assess
- affected by a proper selection of items and thorough assessment of the construct
- can be evaluated using an expert panel
Face validity
Does the test look valid to test users, examiners, and examinees?
- more a matter of social acceptability than a technical form of validity
Criterion validity
Correlation between an examinee’s test score and the behaviour/construct you want to predict
- concurrent validity
- predictive validity
Concurrent validity
Assess the behaviour at approximately the same time (usually the same day), using both the predictor and criterion tests
Predictive validity
Assess the behaviour at separate time (usually a long period in-between), in order to predict future behaviour, predictor test first, then later criterion test
Construct validity
The extent to which the test/measure accurately assesses what it is supposed to, measured by correlating the test to another test
- convergent validity
- discriminant validity
Convergent validity
Assess the relationship between the main test scores to those of a test which assesses the same construct
- ideally > good, high correlation
Discriminant validity
Assess the relationship between the main test scores and test scores on another unrelated test (one which does not assess the same construct)
- ideally > bad/no correlation
Test construction process
- Defining the test
- Selecting a scaling method
- Constructing the items & analysis
- Revising the test
6 Publishing the test
If test is found to be inadequate after step 4, return to step 3
Representative scaling methods
- Expert rankings
- Likert scales
- Guttman scales
- Thurstone scales
- Absolute scales
- Empirical scales
Item-difficulty index
Method for testing items
Proportion of examinees who get the item correct in a tryout; identifies the items which should be altered or discarded from the test
Item-reliability index
Method for testing items
Items should display internal consistency and good correlation to total test scores
Item-validity index
Method for testing items
Used to identify predictively useful test items; how well does each item contribute to the overall predictive validity
Item-characteristic curves
Method for testing items
Graphical display of the relationship between the probability of a correct response and the examinee’s position on the underlying trait being measured by the test
Item-discrimination index
Method for testing items
Statistical index of how efficiently an item discriminates between people who obtain high and low scores on the entire test
Cross validation
Method for revising a test
Using the original regression equation in a new sample to determine whether the test still predicts the criterion well
Validity shrinkage
Method for revising a test
Often, a test predicts the relevant criterion less accurately with a new sample
Feedback from examinees
Method for revising a test
Receive feedback from the examinees in the try-out sample on the:
- behaviour of examiners
- testing conditions
- clarity of exam instructions
- convenience in using the answer sheet
- perceived suitability of the test
- perceived cultural fairness of the test
- perceived sufficiency of time
- perceived difficulty of the test
- emotional response to the test
- level of guessing
- level/method of cheating by the examinee or others
Factor analysis
Summarises the interrelationships among a large number variables in a concise and accurate manner as an aid in conceptualisation
CHC Theory broad ability factors
- fluid intelligence/reasoning (Gf)
- crystallised intelligence/knowledge (Gc)
- domain-specific knowledge (Gkn)
- visual-spatial abilities (Gv)
- auditory processing (Ga)
- broad retrieval (Gr)
- cognitive processing speed (Gs)
- decision/reaction time or speed (Gt)
Sternberg’s Triarchic Theory of Intelligence
- componential (analytic) intelligence > executive processes
- experiential (creative) intelligence > dealing with novelty
- contextual (practical) intelligence > adaptation
IQ tests measure…
- problem-solving abilities
- verbal abilities
- global capacity vs specific mental functions
- speed of response and thinking
IQ tests do not measure…
- learning competence
- social competence
IQ experts to know…
- Galton: IQ as sensory keenness (speed)
- Spearman: IQ as a global capacity (g) and specific factors (s)
- Thurstone: IQ as 7 primary mental abilities
- Luria: IQ as simultaneous vs successive processing
- Guilford: IQ as the SOI model; added creativity; model consists of: operations, contents, and products
Cattel-Horn-Carroll (CHC) Theory (1968)
- hierarchical structure of intelligence
- stratum 3: overall capacity (g)
- stratum 2: broad cognitive abilities
- stratum 1: narrow cognitive abilities
Gardner Multiple Intelligences (1983)
- critique on g > no underlying general factor exists
- introduced multiple intelligences: people smart, music smart, etc.
- found evidence in brain studies (localisation)
- evolutionary plausible
Wechsler IQ: broad cognitive skill indexes
- verbal comprehension index (VCI)
- visual spatial index (VSI)
- fluid reasoning index (FRI)
- working memory index (WMI)
- processing speed index (PSI)
Wechsler IQ test: psychometric properties
Full-scale IQ: M = 100, SD = 15, 55 - 145
Indexes IQ: M = 100, SD = 15, 55 - 145
Individual subtests: M = 10, SD = 3, 1 - 19
FSIQ alpha: 0.96 (SEM = 3)
FSIQ test-retest: 0.95