Ch3 - Test Score Interpretation Flashcards
Raw score
A number that summarizes an aspect of a person’s performance on a test
• No meaning by itself - it’s impossible to interpret a score without a frame of reference (is high a good or bad result?) - and even then we can be mislead
Norms
test performance of 1+ reference groups
○ Norm-referenced test interpretation uses standards based on the performance of specific groups
○ Useful to compare individuals with one another
Normative sample
the groups we use to establish norms
• Performance criteria
○ Criterion-referenced interpretation: makes use of procedures designed to asses whether and to what extent the desired performance criteria have been met
Norm-Referenced Test Interpretation
Score is used to place the test taker’s performance within a pre-existing distribution and compare it
Developmental norms
Ordinal Scales Based on Behavioural Sequences
• The sequence of development can be used as an ordinal scale
• Frame comes from observing/noting uniformities in the order/timing of behavioural attainments across many individuals
Ex:
• Provence Birth-to-Three Developmental Profile: Example of developmental norm using ordinal scale
○ Information about the timelines with which a child attains developmental milestones in relation to their age in 8 domains, for various categories of ages
○ Scores are added to create a performance age, compared with the chronological age
Theory-Based Ordinal Scales
The ordinal scales are based on some other factors than age
Example: Ordinal Scales of Psych Development
○ Based on Piaget’s delineation of the order in which cognitive competencies are acquired during infancy / childhood
age equivalent scores (AKA test ages or test-ages equivalents)
○ A way of comparing the test taker’s performance on a test with the average performance of the normative age group with which it corresponds
§ Ex: a child’s raw score = the raw score of 9 years old in the normative group
○ Problematic because development varies within age groups
○ Has LOTS of limitations - not much used in psych because of that How does it work • Ex: test, with grades ranging from easy, to harder ○ The same test is administered to children in a range of grades (grade 2 to 6) ○ Expectation: younger kids will get less far than older ones ○ *ONLY the means are recorded, not the SD § Does not take into account the range of grade distributions - major flaw ○ The means increase for each grade ○ In other schools, all those who rate higher than 15 have a raw score equivalent of 2.0 (because that is where grade 2 students graded at the start of the year) ○ Grade equivalent scores are established through interpolation § Between 15 and 25, there are 10 raw score points § Between 2.0 and 3.0, there are 10 grade units § If someone scores 17, their grade equivalent score will be 2.2
Grade Equivalent Scores
Another way of interpreting developmental norms - made possible by the uniformity of the school curriculum
derived by locating the performance of test takers within the norms of the students at each grade level in the standardization sample
○ Ex: a child has scored in 5th grade in English (does NOT mean that he knows 5th grade English) and in 3rd grade in maths
• Can also be misleading ○ Curriculums still vary ○ The advance expected between grades varies ○ Not all children will attain their grade scores and its ok
Within-Group norms
Compare one’s score to the performance of one or more reference groups
The Normative Sample Requirements
• Should be representative of the kinds of individuals for whom the tests are intended
• Needs to be sufficiently large, to ensure the stability of the values obtained
○ Tests that require specialized samples may have smaller samples
• Needs to be recent
Standardization sample
group on whom the test is originally standardized in terms of administration /scoring procedures, and establishment of norms
Reference group
Any group of people against which test scores are compared
Subgroup Norms
A large sample can be further divided into smaller subgroups (age, gender, etc) for which norms can be established
Local Norms
• Reference groups drawn from a specific geographic/institutional setting
Convenience Norms
• Norms based on people who were available at the time of testing
Percentile score (disadvantages)
relative position of a test-taker compared to the reference group
○ Most test-takers understand them easily
○ Raw scores can easily be compared with percentile ranks
• Disavantages: ○ In a normative sample, there is a lowest and a highest score - those can be said to be the 0th and 100th percentile, but this is impossible to narrow down when we interpret the scores of a larger population ○ The fact that scores are clustered in the middle and extended at the end changes the perception of those scores in percentiles
Test Ceiling and Test Floor
- Test ceiling: highest score attainable on an already standardized test - someone reaching it means that the test might be too easy (insufficient ceiling)
- Test floor: if a person fails all the items or scores lower than anyone in the normative sample, the test might be too hard (insufficient floor)
Linear transformation
changes the units in which scores are expressed while leaving the interrelationships among them unaltered
○ The shape of a linearly derived scale score distribution is the same as that of the original score distribution
1- convert raw scores in z scores: indicates relative position of a score within a distribution *The value of a Z score represents the original score's distance from the mean in ST DEV units
Additional Systems for Deriving Standard Score (except Z)
• Z scores are usually further transformed because they include +/- signs and decimals
Some score formats became familiar with specific tests, though their numbers (mean and SD) are determined arbitrarily
ex: T scores: many personality inventories (MMPI and others)
CEEB: used for the SATs and GREs
Wechsler scale subtest scores: all subtests of Weschler scales and others
Wechsler scale deviation IQs: summary scores of all Wechsler scales and other tests
Otis-Lennon School Ability Indices: Used in the Otis Group Intelligence Scale
What difference does it make if the SD is 15 or 18?
• Ex: 2 tests have a mean of 100 and SD of respectively 12 and 15
○ Score of 112 on test 1 (SD = 12) = Z score of +1.00 (84th percentile)
○ Score of 112 on test 2 (SD = 15) = Z score of +0.80 (79th percentile)
Deviation IQs
- 1st introduced by Wechsler for the WAIS
- Different from original ratio IQs
- Now simply called IQ
- Converting raw scores into Wechsler Scale scores, adding them, locating their sum in a normative table
Nonlinear Transformations
• Those that convert a raw score distribution into a distribution that has a different shape than the original
ex:
• Transforming normally distributed raw scores into percentile rank scores - nonlinear conversion
○ Transforming raw into z
○ Locate Z in the Table of Areas of the Normal Curve (Appendix C)
○ Derive the proportion/% of the area of the normal curve that is under that point
ex2:
• Normalized standard scores - another type of nonlinear conversion
○ Used when a score distribution approximates but does not quite match the normal distribution
○ Find the % of persons in the reference sample that fall at or below each raw score (Cumulative Percent column)
○ % are converted into proportions
○ Proportions are located in the Table of Areas of the Normal Curve
○ Obtain the Z scores corresponding
○ *SAME process as for linear transformations BUT they should be indicated as normalized st scores to indicate that they don’t come from normal distributions
• Can then be transformed into other scores using the same procedure as for linear conversions
Stanines
- Transforms all the scores into digits from 1 to 9
- Reduces time/effort to enter the scores on computer
- Use nonlinear conversion of raw scores
- Mean = 5, SD = 2
- Loss of precision
Why cant two norm referenced scores be compared?
- Norm-referenced scores can’t be compared unless they come from the same normative distribution
- Even when the tests, the norms, and the scale units are the same, test scores don’t necessarily have the same meaning
Equating Procedures
Comparing scores of individuals/groups across time or in various psychological functions against an uniform norm
Ex: comparing test college admission scores over time
Allows to save money and time on standardization procedures
Goal: make scores from different tests more comparable
Alternate forms
Creating alternate forms that are alike in the content they cover but vary in their specific items Useful for when someone has to take the same test on separate occasions Practice effects (score increases attributable to practice) do come in effect, but less
Parallel forms
equated in content coverage, procedures AND some statistical characteristics (raw score means and SD, indexes of variability/reliability)
Anchor tests
when one part of a test (a set of items) is the same in 2 different tests, so both tests are comparable even though their normative sample might not be the same at all. The purpose of the anchor test is to provide a baseline for an equating analysis between different forms of a test
Fixed reference groups
Anchor tests embedded in each successive form of a test to provide a linkage to one or more earlier forms of the same test
○ SATs: best example of fixed reference groups use
§ Until 1995, the reference group was the test takes of 1941: mean of 500, SD of 100
§ Then they changed the reference
Simultaneous norming (AKA co-norming)
norming 2+ tests on the same sample, makes for easier comparison of the performances
Absolute standard (in criterion-referenced analysis)
a. Score for each respondent is compared with an absolute standard, which is:
i. External to the test
ii. Established by content experts of that particular area
iii. Some type of threshold / minimum score that the examinee has to score
1) A pass-fail system (must be above xyz)
Those tests are more typical for establishing mastery in someone who already has some level of skill
Ex: licensing exams to do a certain profession
What is mastery?
The minimum level of skills in order to say that a person has some basic skills on the idea
What is that threshold score for pass-fail?
Ex: driver’s license exam
2 parts - theoretical (threshold might be 20/25 questions, for ex) and skill (threshold might be something like 75% of maneuvers, for ex)
What constitutes basic mastery, what should the cutoff be?
Usually established by experts (ex: in public safety, transportation, etc)
What is wrong with grade equivalents (or age equivalent scores, or grade age equivalent)
• Grade equivalents are:
○ Simple to understand
○ Parent friendly
• What’s wrong? - 2 reasons
1. It relies on interpolation to assign most GE scores, not actual data
2. The ST DEV of the standard scores are ignored, and the GE are based on means only
A problem because, the GE don’t stay the same as children go into higher grades
The SD increases as grades get higher (students in grade 10 will have a wider SD than students in grade 2) - because little kids don’t know that much, so their ceilings are limited, but with maturation we see more individual differences - the GE change as children age
The units are only at an ordinal level of measurement - another problem
Therefore, not good for research purposes
What’s wrong with percentiles?
Percentiles are:
• Simple
• Descriptive - meaning is understood easily, gives us some info
What’s wrong?
• Almost never analyzed as test scores
• Percentile units are NOT equal, they are only at the ordinal level of measurement - units are not constant
• Original raw score would be better, or another type of score
Item Response Theory (IRT) (AKA Latent Trait Models)
• Procedures that replace the older equating procedures above (fixed reference, anchor tests, alternate and parallel forms)
• Latent trait: the models seek to observe the unobservable qualities underlying behaviour
• IRT apply models to test item data, not test data
○ *Can produce item parameter estimates that are invariant across populations
• Can be used to: 1. Estimate the probability that ppl with specified levels of the ability/trait in question will answer an item correctly or in a certain way 2. Estimate the trait levels needed to have a specified probability of responding in a certain way
Computerized Adaptive Testing (CAT) + advantages/disadvantages
Analyzing the test taker’s ability as they are responding to items, and selecting the next items to be shown depending on those results
• Shortens test length
• Reduce test taker’s frustration when the test is not adapted to their abilities
• Problems with security, cost, inability to change answers
Why do we conduct test revisions, what are they used to?
- A test’s name may not always indicate the test’s content
- When a test is revised, an edition number can be added, or its name can change
- Giving the two versions of a test to the same group and comparing results indicate if the versions are interchangeable
- Major revisions require re-standardization
The Flynn Effect
Increase in the level of performance required to obtain the same score over 2 different versions of a test (means that the test is getting harder, to adjust for the population’s better performance)
○ Does not mean that the people are becoming more intelligent - other factors may influence this
○ Creates debate: execution of convicts on the verge of mental retardation
Criterion-Referenced Test Interpretation (2 types of standards for those tests)
When a person’s performance has to be determined to have reached a certain level or not
• Performance will be compared to pre-established criteria, and not the performance of others
• Criterion: may refer to either knowledge of a specific domain or competence
• Often, but not always, uses cutoff scores or score ranges
2 underlying sets of standards for those tests: 1. The amount of knowledge of a domain 2. The level of competence in a skill
The criteria for competency or knowledge can be quantitative (a certain %) or more qualitative, or even on an all-or-none basis
What type of test are school exams considered to be? Define this type
Content- or domain-referenced tests
There needs to be a very defined and clear field of subject from which to assess knowledge
The selection of items and the definition of that field should be chosen by experts
Requires a table of specifications: with cells that state the number of items/tasks to be included in the test for each learning objective
Define performance assessment
How is the scoring/evaluation like?
• Assess competence in tasks that are more realistic/complex/time-consuming than in content or domain-referenced tests
• Assessing performance through displays of behaviours (work samples, etc)
○ Criterion = quality of the performance itself or of its product
○ Evaluation and Scoring in the Assessment of Performance § Relies + on subjective judgement than assessments of competence § Can also be objective (when quality = speed, or else) § Most assessments involve: □ Identifying/describing qualitative criteria for evaluating □ Developing a method for applying the criteria (rating scales, scoring rubrics)
Define mastery testing (+ expectancy tables/charts)
When a test score is used to predict the future performance of the individual on a certain criterion
• Expectancy tables: show the distribution of test scores for one or more groups of individuals, cross-tabulated against their criterion performance
• Expectancy charts: used when criterion performance in a job/program/else can be classified as either successful or unsuccessful
○ Present the distribution of scores along with the % of people at each score interval who succeeded/failed in terms of the criterion
Name the 2 fundamental differences between norm-referenced test interpretation and criterion-referenced interpretation
- In norm-referenced-testing, the primary objective is to make distinctions among individuals/groups in terms of the ability/trait assessed
- In criterion-referenced testing, the primary objective is to evaluate a person/group’s degree of competence or mastery of a skill or knowledge domain in terms of a preestablished standard of performance
Sometimes the same instrument can be used for both - but one often ends up being more evaluated than the other because the tests need to be constructed differently
Criterion-Referenced Test Interpretation in Clinical Assessment
• Term not used for personality assessments, since those can’t be assessed with criteria
• Cut-off scores can be used to establish if clinical criteria have been met for some disorders
○ Same use of criterion-referenced interpretation as when test scores are used to place someone in an educational/employment setting
○ Ex: Beck depression inventory
Which methods are best suited for tests whose scores can be interpreted with normative AND criterion-referenced bases? Why?
Item Response Theory methods
○ Why - their goal is to estimate a test taker’s position on a latent trait or ability dimension
Which default of norm-referenced testing contributes to lowering standards?
○ No matter how poorly a student pop scores, half of them will be above average
What is the issue that equating tries to resolve?
when 2 diff test are administered to the same person
• Interpreting the score from those 2 tests = problem
Can we compare the scores of 2 different tests together?
Depends if the normative samples of each test are comparable
Describe co-norming
2 separate tests which normative samples overlap
ex: SB5 and BG VM II - the normative samples overlapped by about 75-80%