Test Worthiness, Pt I Flashcards
Test Worthiness
Four Cornerstones
Validity
Reliability
Cross-Cultural Fairness
Practicality
Test Worthiness
Correlation Coefficient
Correlation
Statistical expression of the Relationship between two sets of scores (or variables)
Test Worthiness
Correlation Coefficient
Positive Correlation
Increase in one variable accompanied by an increase in the other variable
“Direct” relationship
Test Worthiness
Correlation Coefficient
Negative Correlation
Increase in one variable accompanied by a decrease in the other variable
“Inverse” relationship
Test Worthiness
Correlation Coefficient
Correlation coefficient (r)
A number between -1 and +1 that indicates Direction and Strength of the relationship
As “r” approaches +1, strength increases in a direct and positive way
As “r” approaches -1, strength increases in and inverses and negative way
As “r” approaches 0, the relationship is weak or non existent (at zero)
Test Worthiness
Correlation, cont’d
The closer to 1 or -1 the stronger the correlation
Graph, Class 1, slide 5
1/0 is a PERFECT positive correlation
-1/0 is a PERFECT negative correlation
Reliability
Accuracy or Consistency of test scores
Would one score the same if they took the test over, and over, and over again?
Classical Test Theory
Assumes a Priori that any measurement of a human personality characteristic will be inaccurate to some degree
Charles Spearman (1904)
Observation + True Score + Error
X = T + E
Sources of Measurement Error
Item Selection
Test Administration
Test Scoring
Systematic and unsystematic measurement error
Systematic and Random Error
Systemic Error
Impact All People Who Complete and Instrument (such as misspelled words or poorly conceived sampling of bx represented by the instrument’s questions).
Systematic and Random Error
Unsystematic Errors
Involve factors that affect Individual Expression of a Trait
Item Response Theory
Item Response Funtion
Relationship between Latent Trait and Probability of Correct Response
Usual standard score range -3 to +3
Item difficulty parameter
Item discrimination parameter
Item Response Theory
Invariance in IRT
Individual trait level can be estimated from any set of items
IRFs do not depend on the population of examiners
Rasch Scale
Based on Item Response Theory, the relationship between the Test Taker’s Probability of success on an item and the latent trait (e.g., the ability)
Test taker’s ability vs. item difficulty (both will vary)
The items are used to define the measure’s scale
Goal: Person’s ability - Item difficulty
Test-taker receives multiple items that matches their ability
Protects against the ceiling effect
See graph on Class 1, slide 18
Rash Scale & Discriminatory Power
When an item measures a construct (has a good fit) the levels of the item will ovary with the train
When an item does not measure a construct, the levels of the item will not co-vary with the trait
Four ways to determine Reliability
Internal Consistency A. Split-half or Odd Even B. Coefficient Alpha C. Kundera-Richardson Test-Retest Alternate, Parallel, or Equivalent Forms Inter-rater reliability
Internal Consistency
Reliability within the test, rather than using multiple administrations
Internal Consistency
3 Types
Split-Half or Odd-Even
Corn Bach’s Coefficient Alpha
Kundera-Richardson
Internal Consistency
Split-Half of Odd-Even Reliabilty
Correlate on half of test with other half for all who took the test
The correlation = the split half reliability estimate
The Spearman-Brown coefficient corrects for sampling bias
Internal Consistency
Spearman-Brown Formula
See Class 1, Slide 23
Internal Consistency
Cronbach’s Coefficient Alpha
Developed by Lee Cronbach in 1951
A formula for estimating the mean of all possible Split-Half Coefficients using items that have Three or more response possibilities or anchor definitions
**Report reliability coefficient for total and/or each scale or subtext
Basics of Cronbach’s Coefficient Alpha
Cronbach’s alpha reliability coefficient normally ranges between 0 and 1
Closer alpha coefficient = 1.0, > internal consistency of the scale items
Standardized Item Alpha: Alpha coefficient when all scale items have been standardized (made into z scores).
This coefficient is used only when the individual scale items are not scaled the same
Internal Consistency
Kuder-Richardson
(KR-20) (KR-21)
Variation on alpha formula used with dichotomous data
An estimate the mean of all possible split-half coefficients
Test-Retest Reliability
Give the same test Two or More Times to the Same Group of People then correlate the scores.
Alternate, Parallel or Equivalent Forms of Reliability
Have Two or More forms or versions of the same test
Administer the two groups of respondents the two forms the same item (e.g., group1 gets version A and group 2 gets version B)
Correlate scores on first form with scores on second form
Inter-Rater Reliability
The degree of agreement between two or more separate raters
Qualitative applications
Consensus coding
Standardized Scores
A collective number of variations on standard scores devised by test specialists
They eliminate fractions and negative signs by producing values other than zero for the mean and 1.00 for the SD of the transformed scores
Important Point: we can transform any distribution to a preferred scale with predetermined mean and SD
T-Score
Has a mean of 50 and a SD of 10
Common with personality tests
See pg 53, para 6 for formula
Age Norm
Depicts the level of test performance for each separate age group in the normative sample
Purpose is to facilitate same-aged comparisons
Grade Norms
Depicts the level of test performance for each separate grade in the normative sample
Rarely used with ability tests
Local norms
Derived from representative local examines, as oppose to a national sample
Subgroup Norms
Consist of the scores obtained from an identified subgroup as opposed to a diversified national sample
Expectancy Table
Portrays the established relationship between test scores and expected outcome on a relevant task.
Useful with predictor tests used to forecast well-defined criteria
Always based on the previous predictor and criterion results for large samples of examinees…so, if conditions or policies change, and expectancy table can become obsolete or misleading
Criterion-Referenced Tests
Are used to compare examinees’ accomplishments to a predefined performance standard
The focus is on what the test taker can do rather than on comparisons to the performance levels of others
Identify an examinees‘ s relative mastery (or nonmastery) of specific, predetermined competencies
Content of test is selected on the basis of its relevance in the curriculum
Best suited to the testing of basic academic skills in educational settings
Norm-Referenced Tests
Purpose is to classify examinees, from low to high, across a continuum of ability or achievement
Uses a representative sample of individuals (norm group or standardization sample) as its interpretive framework
Items are chosen so that they provide maximal discrimination Amon. Respondents along the dimension being measured
Characteristics of Criterion and Norm-Referenced Tests
Pg 57
Reliabililty
Refers to the attribute of consistency in measurement
Best viewed as a continuum ranging from minimal consistency of measurement to near-perfect repeatability of results
Classical Theory of Measurement
The idea that test scores result from the influence of two factors:
~Factors that contribute to consistency. These consist entirely of the stable attributes of the individual, which the examiner is trying to measure. (This is a desirable factor because it represents the true amount of the attribute in question, while the second factor represents the unavoidable nuisance of error factors that contribute to inaccuracies in measurement)
~Factors that contribute to inconsistency. These include characteristics of the individual, test, or situation that have nothing to do with the attribute being measured, but that nonetheless affect test scores.
The true score is never known! We can obtain a probability the the true score resides within a certain interval and we can also derive a best estimate of the true score.
Sources of Measurement Error
Item Selection
Test Administration
Test Scoring
Systematic Errors of Measurement
Unsystematic Measurement Error
Their effects are unpredictable and inconsistent
Systematic Measurement Error
Arises when, unknown to the test developer, a test consistently measures something other than the trait for which it was intended
This is a problem for test validity
Results in inaccuracies of measurement
Measurement Error and Reliability
ME reduces reliability or repeatability of psychological test results
Main Features of Classical Theory
~Measurement errors are random
~Mean Error of measurement = 0
~True scores and errors are us correlated: rTe=0
~Errors on different tests are uncorrelated: r12=0
Implications for Reliability and Measurement
TBD
The Reliability Coefficient (eXX)
The ratio of true score variance to the total variance of test scores (pg 61)