Reliability of a test Flashcards
dependability or consistency of the instrument or scores obtained by the same person when re-examined with the same test on different occasions, or with different sets of equivalent items
Reliability
index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance
Reliability Coefficient
score on an ability test is presumed to reflect not only the testtaker’s true score on the ability being measured but also the error
Classical Test Theory (True Score Theory)
refers to the component of the observed test score that does not have to do with the testtaker’s ability
Error
Factors that contribute to consistency
stable attributes
Factors that contribute to inconsistency
characteristics of the individual, test, or situation, which have nothing to do with the attribute being measured, but still affect the scores
Goals of Reliability:
EEDT
✓ Estimate errors
✓ Devise techniques to improve testing and reduce errors
useful in describing sources of test score variability
Variance
variance from true differences
True Variance
variance from irrelevant random sources
Error Variance
all of the factors associated with the process of measuring some variable, other than the variable being measured
Measurement Error
- difference between the observed score and the true score
Measurement Error
Sources of Error Variance that refer to variation among items within a test as well as to variation among items between tests
- The extent to which testtaker’s score is affected by the content sampled on a test and by the way the content is sampled is a source of error variance
Item Sampling/Content Sampling
Sources of Error Variance that testtaker’s motivation or attention, environment, etc.
Test Administration
Sources of Error Variance that may employ objective-type items amenable to computer scoring of well-documented reliability
Test Scoring and Interpretation
source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in measurement process (e.g., noise, temperature, weather)
Random Error
source of error in a measuring a variable that is typically constant or proportionate to what is presumed to be the true values of the variable being measured
- has consistent effect on the true score -
SD does not change, the mean does
Systematic Error
________ refers to the proportion of total variance attributed to true variance
Reliability
The _____ the proportion of the total variance attributed to true variance, the ________ the test
greater - more reliable
___________ may increase or decrease a test score by varying amounts, consistency of test score, and thus, the reliability can be affected
Error variance
Error: Time Sampling
Test-Retest Reliability
an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the test
Test-Retest Reliability
appropriate when evaluating the reliability of a test that purports to measure an enduring and stable attribute such as personality trait
- established by comparing the scores obtained from two successive measurements of the same individuals and calculating a correlated between the two set of scores
Test-Retest Reliability
the longer the time passes, the greater likelihood that the reliability coefficient would be insignificant
Test-Retest Reliability
happened when the test-retest interval is short, wherein the second test is influenced by the first test because they remember or practiced the previous test = inflated correlation/overestimation of reliability
Carryover Effects
scores on the second session are higher due to their experience of the first session of testing
Practice Effect
test-retest with ______ interval might be affected of other extreme factors, thus, resulting to _____ correlation
longer - low
problems in absences in second session (just remove the first tests of the absents)
Mortality
statistical tool Test-Retest Reliability
Pearson R, Spearman Rho
Error: Item Sampling (Immediate), Item Sampling changes over time (delaued)
Parallel Forms/Alternate Forms Reliability
established when at least two different versions of the test yield almost the same scores
- has the most universal applicability
Parallel Forms/Alternate Forms Reliability
each form of the test, the means, and the variances, are EQUAL; same items, different positionings/numberings
Parallel Forms
simply different version of a test that has been constructed so as to be parallel
Alternate Forms
- technique to avoid carryover effects for parallel forms, by using different sequence for groups
- can be administered on the same day or different time
Counterbalancing
most rigorous and burdensome, since test developers create two forms of the test - main problem: difference between the two test - test scores may be affected by motivation, fatigue, or intervening events - means and the variances of the observed scores must be equal for two forms - Statistical Tool: Pearson R or Spearman Rho
Counterbalancing
used when tests are administered once - consistency among items within the test - measures the internal consistency of the test which is the degree to which each item measures the same construct
Internal Consistency (Inter-Item Reliability)
Error: Item Sampling Homogeneity
Internal Consistency (Inter-Item Reliability)
measurement for unstable traits
- if all items measure the same construct, then it has a good internal consistency
Internal Consistency (Inter-Item Reliability)
if a test contains items that measure a single trait (unifactorial)
Homogeneity
degree to which a test measures different factors (more than one factor/trait) - more homogenous = higher inter-item consistency
- Heterogeneity
______ homogenous = _____ inter-item consistency
more - higher
used for inter-item consistency of dichotomous items (intelligence tests, personality tests with yes or no options, multiple choice), unequal variances, dichotomous scored
KR-20
if all the items have the same degree of difficulty (speed tests), equal variances, dichotomous scored
KR-21
used when two halves of the test have unequal variances and on tests containing non-dichotomous items, unequal variances
Cronbach’s Coefficient Alpha
measure used to evaluate internal consistence of a test that focuses on the degree of differences that exists between item scores
Average Proportional Distance
obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered ONCE
Split Half Reliability
Error: Item sample: Nature of Split
Split-Half Reliability
useful when it is impractical or undesirable to assess reliability with two tests or to administer a test twice
- cannot just divide the items in the middle because it might spuriously raise or lower the reliability coefficient, so just randomly assign items or assign odd-numbered items to one half and even-numbered items to the other half
Split-Half Reliability
allows a test developer of user to estimate internal consistency reliability from a correlation of two halves of a test, if each half had been the length of the whole test and have the equal variances
Spearman-Brown Formula
estimates how many more items are needed in order to achieve the target reliability
Spearman-Brown Prophecy Formula
counterpart of spearman-brown formula, which is the ratio of the variance of difference between the odd and even splits and the variance of the total, combined odd-even, score
- if the reliability of the original test is relatively low, then developer could create new items, clarify test instructions, or simplifying the scoring rules
- equal variances, dichotomous scored
Rulon’s Formula
Error: Scorer Difference
Inter-Scorer Reliability
the degree of agreement or consistency between two or more scorers with regard to a particular measure
- used for coding nonbehavioral behavior
- observer differences
Inter-Scorer Reliability
determine the level between TWO or MORE raters when the method of assessment is measured on CATEGORICAL SCALE
Fleiss Kappa
two raters only
Cohen’s Kappa
two or more rater, based on observed disagreement corrected for disagreement expected by chance
Krippendorff’s Alpha
Tests designed to measure one factor _____ are expected to have _____ of internal consistency and vice versa
(Homogenous) - high degree
trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experience
Dynamic
barely changing or relatively unchanging
Static
– if the variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower
Restriction of range or Restriction of variance
when time limit is long enough to allow test takers to attempt all times
Power Tests
generally contains items of uniform level of difficulty with time limit
Speed Tests
Reliability should be based on performance from two independent testing periods using _______ and _________ or split-half-reliability
test-retest - alternate-forms
designed to provide an indication of where a testtaker stands with respect to some variable or criterion
Criterion-Referenced Tests
As individual differences ______, a traditional measure of reliability would also_______, regardless of the stability of individual performance
decrease - decrease
everyone has a “true score” on test
Classical Test Theory
genuinely reflects an individual’s ability level as measured by a particular test
True Score
estimate the extent to which specific sources of variation under defined conditions are contributing to the test scores
Domain Sampling Theory
_______ is conceived of as an objective measure of how precisely the test score assesses the domain from which the test draws a sample
Test reliability
based on the idea that a person’s test scores vary from testing to testing because of the variables in the testing situations
Generalizability Theory
test situation
Universe:
number of items in the test, amount of review, and the purpose of test administration
Facets
According to ____________, given the exact same conditions of all the facets in the universe, the exact same test score should be obtained (Universe score)
Generalizability Theory
developers examine the usefulness of test scores in helping the test user make decisions
Decision Study
the probability that a person with X ability will be able to perform at a level of Y in a test
Focus: item difficulty
Item Response Theory
a system of assumption about measurement and the extent to which item measures the trait
Latent-Trait Theory
The ______ is used to focus on the range of item difficulty that helps assess an individual’s ability level
computer
attribute of not being easily accomplished, solved, or comprehended
Difficulty
degree to which an item differentiates among people with higher or lower levels of the trait, ability etc.
Discrimination
can be answered with only one of two alternative responses
Dichotomous
3 or more alternative responses
Polytomous
provide a measure of the precision of an observed test score
Standard Error of Measurement
Standard deviation of errors as the ________ of error
basic measure
Index of the amount of inconsistent or the amount of the ______ error in an individual’s score
expected
Higher reliability the ______
lower Standard Error of Measurement
a range or band of test scores that is likely to contain true scores
Confidence Interval
can aid a test user in determining how large a difference should be before it is considered statistically significant
Standard Error of the Difference
refers to the standard error of the difference between the predicted and observed values
Standard Error of Estimate
a range of and of test score that is likely to contain true score
Tells us the relative ability of the true score within the specified range and confidence level
The larger the range, the higher the confidence
Confidence Interval
If the reliability is low, you can increase the number of _____ or use factor analysis and item analysis to increase internal consistency
items
nature of the test will often determine the reliability metric
Reliability Estimates
detects true positive
Test Sensitivity
detects true negative
Test Specificity
proportion of the population that actually possess the characteristic of interest
Base Rate –
– no. of available positions compared to the no. of applicants
Selection ratio
one of the Four Possible Hit and Miss Outcomes– predict success that does occur
True Positives (Sensitivity)
one of the Four Possible Hit and Miss Outcomes – predict failure that does occur
True Negatives (Specificity)
one of the Four Possible Hit and Miss Outcomes – success does not occur
False Positive (Type 1)
one of the Four Possible Hit and Miss Outcomes – predicted failure but succeed
False Negative (Type 2)