Lecture 3: Essentials of Test score interpretation Flashcards
Frames of References
specific standard used to judge an individual test score
Raw scores
- summarizes a person’s test performance (dichotomous items, polytomous items)
- conveys no meaning beyond expressing how many items were solved
Two sources of information to derive from frames of reference for test scores
- Norms
2. Performance criteria
Norms
- norm-referenced test interpretation uses standards based on the performance of speicific groups
- test performance ot typical behavior of one or more reference groups
- Typical question: performance of test taker compared to that of others?
Performance Criteria
- test scores are interpreted against the demonstration of “performance”
- assess whether and to what extent the desired levels of mastery or performance criteria have been met (educational and clinal setting)
Norm- Referenced Test interpretation
developmental norms
- ordinal scales based on behavioral sequences
- still in use today with young children
- age vs. theory
- mental scores
- grade equivalent scores
Within-group norms
provide a way of evaluating a person’s performance against the performance of a reference group (place test takers performance within a normal distribution)
The normative sample
- should be representative of the intended population
- needs to be sufficient large
- important: recency (re-norming)
Standardization sample
group of individuals on whom the test is originally standardized in terms of administration and scoring procedures, and developing test’s norms
The normative sample
often used as synonymous with the standardization sample, but can refer to any group from which norms are gathered
Reference group
any group of people against which test scores are compared
Information needed to evaluate the application of a normative sample
- how large is the normative sample?
- when and where was the sample gathered?
- how were individuals identified and selected in this sample?
- who tested the sample? And how did the examniner’s qulaify to test the sample?
- what was the composition of this normative sample? (age, sex, ethinicty, etc.)
Variants (of norms)
- subgroup norms
- local norms
- convenience norms
Subgroup norms
- when large samples are gathered, norms can be seperated into subgroup norms
- formed in terms of any variable that may have a significant impact on test scores or yield comparisons of interest
Local norms
- based on respondents from a specific geographical or institutional setting
Convenience norms
test developers use norms based on a group of people who simply happen to be availabe at the time (for financial constraints?)
Scores used for expressing within-group norms
- Percentiles
2. Standard scores
Percentiles
indicates the percentage of person’s in the reference group who scored at or below a given raw score
- 50th percentile (the median)
Percentiles vs. Percentages
- percentiles reflect the rank or position of an individuals’s performance on a test in comparison to a reference group
- percentage scores reflect the number of correct responses that an individual obtain out of a total number of correct responses
Units are unequal across the range?
- differences are exaggerated in the middle section of the raw score distribution and compressed at the extremes (percentag of poeple who score near the middle is greater)
Test ceiling
test taker reaches the highest score attainable on an already standardized test, maximum difficulty level is insufficient
Test floor
person fails all the items presented in a test or scores lower than any of the poeple in the normative sample (insufficient test floor)
Standard scores
- linear trasnformation of raw scores
- then transformation of z scores into other derived standard scores
Linear transformation of raw scores
chnages the units in which scores are expressed while leaving the interrelationships among them unaltered
- normally distributed scores of tests with different means (etc) can be fully compared (as long as the same reference group is used)
trasnformation into z-scores
X - reference group mean/ standard deviation of the reference group
- indicate the direction in which the orginal score deviated from the mean of a group
Pearsons prodcut-moment correlation
correlation between variable x and y
- mean of the cross-products of z-scores of the correlated variables
Deriving standard scores
New standard score = z score x New SD + New Mean
- z scores involve negative values, therefore z scores undergo additional linear transformations
non-linear transformations
convert a raw score int a distribution that has a different shape than the original
- normalized standard scores
- stanines
Normalized test scores
usen when a score distribution approximates but does not quite match the normal distribution
raw score -> cumulative percent -> cumulative proportion -> normalized z score
Raw score -> cumulative percent (CP)-> cumulative proportion (cP)-> normalized z score
- raw score and CP are located in the score distribution
- cp is CP/100
- normalized z score is obtained from the tables of areas under the normal curve
- > can then be trasnformed into various other stnadrd score systems
Stanine scores
- > transforms all scores in a distribution into single-digit numbers from 1-9
- > makes use of cumulative frequency and cumulative percentage distribution
Intertest comparisons
- comparability is also compromised by…
- difference in scale units
- contruct meaning
- additional factors affecting the test results of individuals
Test scores cannot be meaninfully compared, when…
- the test or test versions are different
- the reference groups are different
- the score scales differ
Test equating
designed to achieve (some) comparability of scores across tests
(place test scores in a common frame of reference)
Different forms of test equating
- Alternate forms (Parallel tests)
- Anchor tests
- Fixed reference groups
- Simultaneous norming
Alternate forms
cosists of two (more) versions of a test meant to be used interchangeably, intended for the same purpose, and administered in identical fashion
Parallel tests
forms are equated not only in content coverage and procedures. But also, in some of their statistical charcateristics
Anchor tests
consists of common set of items adminstered to different groups of examinees in the context of two or more tests (different skill areas - uniform test)
Fixed reference group
achieve some comparability and continuity of test scores over time
Simultaneous norming
two or more tests on the same standardization sample are normed
(compare the performance of individuals on more than one test, using the same standard)
Item Response Theory (IRT)
model seeks to estimate the level of unobservable abilities, traits or psychological constructs (latent variables), produce item parameter estimates that are invariant across populations
computerized adaptive testing
individuals ability level can be estimated based on their response to test items during testing process
Small test revisions
small changes in the wording of items
- comparability of the old and revised versions can be established easily
Major test revisions
reorganization in content, scoring and adminstrative procedures
- revision of norm-referenced tests, require restandardization of the test with a new normative sample
Need for updated norms
- score norms tend to drift into a direction due to changes in the population at different time periods
- Lynn-Flynn effect: long-term upward trend in the level of performance required to obtain any given IQ score (increase pace of restandardization of the Wechsler’s scale)
Criterion-referenced Test interpretation
- specific test purpose
- needs a determined criterion or standard against which performance/response behavior is to be measured
- use of cutoff scores
Specific test purposes (two types)
- to ascertain whether a person has reached a certain level of competence
- to assess to what degree a person meets certain requirements
Validity of criteria? (common characteristics)
- are meant to assess the extent to which test takers are proficient in certain skills or knowledge domains
- are scored in such way that one perons’s performance does not affect the relative standing of others
Norm-referenced test interpretation
seek to locate performance with regard to the contruct the tests asses (frame reference - other poeple)
criterion-referenced test interpretaion
seek to evaluate performance in relation to standard related to the construct itself
- frame reference: knowledge of a domain or level of competence displayed