Ch3 - Test Score Interpretation Flashcards

1
Q

Raw score

A

A number that summarizes an aspect of a person’s performance on a test
• No meaning by itself - it’s impossible to interpret a score without a frame of reference (is high a good or bad result?) - and even then we can be mislead

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Norms

A

test performance of 1+ reference groups
○ Norm-referenced test interpretation uses standards based on the performance of specific groups
○ Useful to compare individuals with one another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Normative sample

A

the groups we use to establish norms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

• Performance criteria

A

○ Criterion-referenced interpretation: makes use of procedures designed to asses whether and to what extent the desired performance criteria have been met

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Norm-Referenced Test Interpretation

A

Score is used to place the test taker’s performance within a pre-existing distribution and compare it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Developmental norms

A

Ordinal Scales Based on Behavioural Sequences
• The sequence of development can be used as an ordinal scale
• Frame comes from observing/noting uniformities in the order/timing of behavioural attainments across many individuals

Ex:
• Provence Birth-to-Three Developmental Profile: Example of developmental norm using ordinal scale
○ Information about the timelines with which a child attains developmental milestones in relation to their age in 8 domains, for various categories of ages
○ Scores are added to create a performance age, compared with the chronological age

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Theory-Based Ordinal Scales

A

The ordinal scales are based on some other factors than age

Example: Ordinal Scales of Psych Development
○ Based on Piaget’s delineation of the order in which cognitive competencies are acquired during infancy / childhood

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

age equivalent scores (AKA test ages or test-ages equivalents)

A

○ A way of comparing the test taker’s performance on a test with the average performance of the normative age group with which it corresponds
§ Ex: a child’s raw score = the raw score of 9 years old in the normative group
○ Problematic because development varies within age groups

	○ Has LOTS of limitations - not much used in psych because of that
How does it work
• Ex: test, with grades ranging from easy, to harder
	○ The same test is administered to children in a range of grades (grade 2 to 6)
	○ Expectation: younger kids will get less far than older ones
	○ *ONLY the means are recorded, not the SD
		§ Does not take into account the range of grade distributions - major flaw
	○ The means increase for each grade

	○ In other schools, all those who rate higher than 15 have a raw score equivalent of 2.0 (because that is where grade 2 students graded at the start of the year)

	○ Grade equivalent scores are established through interpolation
		§ Between 15 and 25, there are 10 raw score points
		§ Between 2.0 and 3.0, there are 10 grade units
		§ If someone scores 17, their grade equivalent score will be 2.2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Grade Equivalent Scores

A

Another way of interpreting developmental norms - made possible by the uniformity of the school curriculum

derived by locating the performance of test takers within the norms of the students at each grade level in the standardization sample
○ Ex: a child has scored in 5th grade in English (does NOT mean that he knows 5th grade English) and in 3rd grade in maths

• Can also be misleading
	○ Curriculums still vary
	○ The advance expected between grades varies 
	○ Not all children will attain their grade scores and its ok
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Within-Group norms

A

Compare one’s score to the performance of one or more reference groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The Normative Sample Requirements

A

• Should be representative of the kinds of individuals for whom the tests are intended
• Needs to be sufficiently large, to ensure the stability of the values obtained
○ Tests that require specialized samples may have smaller samples
• Needs to be recent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Standardization sample

A

group on whom the test is originally standardized in terms of administration /scoring procedures, and establishment of norms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Reference group

A

Any group of people against which test scores are compared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Subgroup Norms

A

A large sample can be further divided into smaller subgroups (age, gender, etc) for which norms can be established

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Local Norms

A

• Reference groups drawn from a specific geographic/institutional setting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Convenience Norms

A

• Norms based on people who were available at the time of testing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Percentile score (disadvantages)

A

relative position of a test-taker compared to the reference group
○ Most test-takers understand them easily
○ Raw scores can easily be compared with percentile ranks

• Disavantages:
	○ In a normative sample, there is a lowest and a highest score - those can be said to be the 0th and 100th percentile, but this is impossible to narrow down when we interpret the scores of a larger population
	○ The fact that scores are clustered in the middle and extended at the end changes the perception of those scores in percentiles
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Test Ceiling and Test Floor

A
  • Test ceiling: highest score attainable on an already standardized test - someone reaching it means that the test might be too easy (insufficient ceiling)
    • Test floor: if a person fails all the items or scores lower than anyone in the normative sample, the test might be too hard (insufficient floor)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Linear transformation

A

changes the units in which scores are expressed while leaving the interrelationships among them unaltered
○ The shape of a linearly derived scale score distribution is the same as that of the original score distribution

	1- convert raw scores in z scores: indicates relative position of a score within a distribution
		*The value of a Z score represents the original score's distance from the mean in ST DEV units
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Additional Systems for Deriving Standard Score (except Z)

A

• Z scores are usually further transformed because they include +/- signs and decimals
Some score formats became familiar with specific tests, though their numbers (mean and SD) are determined arbitrarily

ex: T scores: many personality inventories (MMPI and others)
CEEB: used for the SATs and GREs
Wechsler scale subtest scores: all subtests of Weschler scales and others
Wechsler scale deviation IQs: summary scores of all Wechsler scales and other tests
Otis-Lennon School Ability Indices: Used in the Otis Group Intelligence Scale

What difference does it make if the SD is 15 or 18?
• Ex: 2 tests have a mean of 100 and SD of respectively 12 and 15
○ Score of 112 on test 1 (SD = 12) = Z score of +1.00 (84th percentile)
○ Score of 112 on test 2 (SD = 15) = Z score of +0.80 (79th percentile)

21
Q

Deviation IQs

A
  • 1st introduced by Wechsler for the WAIS
    • Different from original ratio IQs
    • Now simply called IQ
    • Converting raw scores into Wechsler Scale scores, adding them, locating their sum in a normative table
22
Q

Nonlinear Transformations

A

• Those that convert a raw score distribution into a distribution that has a different shape than the original

ex:
• Transforming normally distributed raw scores into percentile rank scores - nonlinear conversion
○ Transforming raw into z
○ Locate Z in the Table of Areas of the Normal Curve (Appendix C)
○ Derive the proportion/% of the area of the normal curve that is under that point

ex2:
• Normalized standard scores - another type of nonlinear conversion
○ Used when a score distribution approximates but does not quite match the normal distribution
○ Find the % of persons in the reference sample that fall at or below each raw score (Cumulative Percent column)
○ % are converted into proportions
○ Proportions are located in the Table of Areas of the Normal Curve
○ Obtain the Z scores corresponding
○ *SAME process as for linear transformations BUT they should be indicated as normalized st scores to indicate that they don’t come from normal distributions
• Can then be transformed into other scores using the same procedure as for linear conversions

23
Q

Stanines

A
  • Transforms all the scores into digits from 1 to 9
    • Reduces time/effort to enter the scores on computer
    • Use nonlinear conversion of raw scores
    • Mean = 5, SD = 2
    • Loss of precision
24
Q

Why cant two norm referenced scores be compared?

A
  • Norm-referenced scores can’t be compared unless they come from the same normative distribution
    • Even when the tests, the norms, and the scale units are the same, test scores don’t necessarily have the same meaning
25
Q

Equating Procedures

A

Comparing scores of individuals/groups across time or in various psychological functions against an uniform norm
Ex: comparing test college admission scores over time
Allows to save money and time on standardization procedures

Goal: make scores from different tests more comparable

26
Q

Alternate forms

A
Creating alternate forms that are alike in the content they cover but vary in their specific items
		Useful for when someone has to take the same test on separate occasions
		Practice effects (score increases attributable to practice) do come in effect, but less
27
Q

Parallel forms

A

equated in content coverage, procedures AND some statistical characteristics (raw score means and SD, indexes of variability/reliability)

28
Q

Anchor tests

A

when one part of a test (a set of items) is the same in 2 different tests, so both tests are comparable even though their normative sample might not be the same at all. The purpose of the anchor test is to provide a baseline for an equating analysis between different forms of a test

29
Q

Fixed reference groups

A

Anchor tests embedded in each successive form of a test to provide a linkage to one or more earlier forms of the same test
○ SATs: best example of fixed reference groups use
§ Until 1995, the reference group was the test takes of 1941: mean of 500, SD of 100
§ Then they changed the reference

30
Q

Simultaneous norming (AKA co-norming)

A

norming 2+ tests on the same sample, makes for easier comparison of the performances

31
Q

Absolute standard (in criterion-referenced analysis)

A

a. Score for each respondent is compared with an absolute standard, which is:
i. External to the test
ii. Established by content experts of that particular area
iii. Some type of threshold / minimum score that the examinee has to score
1) A pass-fail system (must be above xyz)
Those tests are more typical for establishing mastery in someone who already has some level of skill
Ex: licensing exams to do a certain profession

What is mastery?
The minimum level of skills in order to say that a person has some basic skills on the idea
What is that threshold score for pass-fail?
Ex: driver’s license exam
2 parts - theoretical (threshold might be 20/25 questions, for ex) and skill (threshold might be something like 75% of maneuvers, for ex)
What constitutes basic mastery, what should the cutoff be?
Usually established by experts (ex: in public safety, transportation, etc)

32
Q

What is wrong with grade equivalents (or age equivalent scores, or grade age equivalent)

A

• Grade equivalents are:
○ Simple to understand
○ Parent friendly
• What’s wrong? - 2 reasons
1. It relies on interpolation to assign most GE scores, not actual data
2. The ST DEV of the standard scores are ignored, and the GE are based on means only
A problem because, the GE don’t stay the same as children go into higher grades
The SD increases as grades get higher (students in grade 10 will have a wider SD than students in grade 2) - because little kids don’t know that much, so their ceilings are limited, but with maturation we see more individual differences - the GE change as children age
The units are only at an ordinal level of measurement - another problem
Therefore, not good for research purposes

33
Q

What’s wrong with percentiles?

A

Percentiles are:
• Simple
• Descriptive - meaning is understood easily, gives us some info
What’s wrong?
• Almost never analyzed as test scores
• Percentile units are NOT equal, they are only at the ordinal level of measurement - units are not constant
• Original raw score would be better, or another type of score

34
Q

Item Response Theory (IRT) (AKA Latent Trait Models)

A

• Procedures that replace the older equating procedures above (fixed reference, anchor tests, alternate and parallel forms)
• Latent trait: the models seek to observe the unobservable qualities underlying behaviour
• IRT apply models to test item data, not test data
○ *Can produce item parameter estimates that are invariant across populations

• Can be used to:

	1. Estimate the probability that ppl with specified levels of the ability/trait in question will answer an item correctly or in a certain way
	2. Estimate the trait levels needed to have a specified probability of responding in a certain way
35
Q

Computerized Adaptive Testing (CAT) + advantages/disadvantages

A

Analyzing the test taker’s ability as they are responding to items, and selecting the next items to be shown depending on those results
• Shortens test length
• Reduce test taker’s frustration when the test is not adapted to their abilities
• Problems with security, cost, inability to change answers

36
Q

Why do we conduct test revisions, what are they used to?

A
  • A test’s name may not always indicate the test’s content
    • When a test is revised, an edition number can be added, or its name can change
    • Giving the two versions of a test to the same group and comparing results indicate if the versions are interchangeable
    • Major revisions require re-standardization
37
Q

The Flynn Effect

A

Increase in the level of performance required to obtain the same score over 2 different versions of a test (means that the test is getting harder, to adjust for the population’s better performance)
○ Does not mean that the people are becoming more intelligent - other factors may influence this
○ Creates debate: execution of convicts on the verge of mental retardation

38
Q

Criterion-Referenced Test Interpretation (2 types of standards for those tests)

A

When a person’s performance has to be determined to have reached a certain level or not
• Performance will be compared to pre-established criteria, and not the performance of others
• Criterion: may refer to either knowledge of a specific domain or competence
• Often, but not always, uses cutoff scores or score ranges

2 underlying sets of standards for those tests:

	1. The amount of knowledge of a domain
            2. The level of competence in a skill 

The criteria for competency or knowledge can be quantitative (a certain %) or more qualitative, or even on an all-or-none basis

39
Q

What type of test are school exams considered to be? Define this type

A

Content- or domain-referenced tests

There needs to be a very defined and clear field of subject from which to assess knowledge
The selection of items and the definition of that field should be chosen by experts
Requires a table of specifications: with cells that state the number of items/tasks to be included in the test for each learning objective

40
Q

Define performance assessment

How is the scoring/evaluation like?

A

• Assess competence in tasks that are more realistic/complex/time-consuming than in content or domain-referenced tests
• Assessing performance through displays of behaviours (work samples, etc)
○ Criterion = quality of the performance itself or of its product

	○ Evaluation and Scoring in the Assessment of Performance
		§ Relies + on subjective judgement than assessments of competence
		§ Can also be objective (when quality = speed, or else)
		§ Most assessments involve:
			□ Identifying/describing qualitative criteria for evaluating
			□ Developing a method for applying the criteria (rating scales, scoring rubrics)
41
Q

Define mastery testing (+ expectancy tables/charts)

A

When a test score is used to predict the future performance of the individual on a certain criterion
• Expectancy tables: show the distribution of test scores for one or more groups of individuals, cross-tabulated against their criterion performance
• Expectancy charts: used when criterion performance in a job/program/else can be classified as either successful or unsuccessful
○ Present the distribution of scores along with the % of people at each score interval who succeeded/failed in terms of the criterion

42
Q

Name the 2 fundamental differences between norm-referenced test interpretation and criterion-referenced interpretation

A
  1. In norm-referenced-testing, the primary objective is to make distinctions among individuals/groups in terms of the ability/trait assessed
    1. In criterion-referenced testing, the primary objective is to evaluate a person/group’s degree of competence or mastery of a skill or knowledge domain in terms of a preestablished standard of performance

Sometimes the same instrument can be used for both - but one often ends up being more evaluated than the other because the tests need to be constructed differently

43
Q

Criterion-Referenced Test Interpretation in Clinical Assessment

A

• Term not used for personality assessments, since those can’t be assessed with criteria
• Cut-off scores can be used to establish if clinical criteria have been met for some disorders
○ Same use of criterion-referenced interpretation as when test scores are used to place someone in an educational/employment setting
○ Ex: Beck depression inventory

44
Q

Which methods are best suited for tests whose scores can be interpreted with normative AND criterion-referenced bases? Why?

A

Item Response Theory methods

○ Why - their goal is to estimate a test taker’s position on a latent trait or ability dimension

45
Q

Which default of norm-referenced testing contributes to lowering standards?

A

○ No matter how poorly a student pop scores, half of them will be above average

46
Q

What is the issue that equating tries to resolve?

A

when 2 diff test are administered to the same person

• Interpreting the score from those 2 tests = problem

47
Q

Can we compare the scores of 2 different tests together?

A

Depends if the normative samples of each test are comparable

48
Q

Describe co-norming

A

2 separate tests which normative samples overlap

ex: SB5 and BG VM II - the normative samples overlapped by about 75-80%