Module 2: Reliability Flashcards
Reliability
+ dependability or consistency of the instrument or scores obtained by the same person when re-examined with the same test on different occasions, or with different sets of equivalent items
+ Free from errors
+ Minimizing error
+ True score cannot be found
If tests are reliable, are they automatically reliable in all contexts?
No. Test may be reliable in one context, but unreliable in another
How can reliability be computed?
Estimate the range of possible random fluctuations that can be expected in an individual’s score
How many items should there be to have higher reliability?
The higher/greater the number of items, the higher the reliability will be.
What kind of sample should be used to obtain an observed score?
Using only representative sample to obtain an observed score
Reliability Coefficient
index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance
Classical Test Theory (True Score Theory)
score on an ability tests is presumed to reflect not only the testtaker’s true score on the ability being measured but also the error
Error
+ refers to the component of the observed test score that does not have to do with the testtaker’s ability
+ Errors of measurement are random
What is the formula of the classical test theory?
X = T + E
X - observed behavior
T - true score
E - error
How can the true score be computed?
When you average all the observed scores obtained over a period of time, then the result would be closest to the true score
What is a factor that contributes to consistency?
stable attributes
What are factors that contribute to inconsistency?
characteristics of the individual, test, or situation, which have nothing to do with the attribute being measured, but still affect the scores
What are the goals of reliability?
- To estimate errors
- Devise techniques to improve testing and reduce errors
Variance
useful in describing sources of test score variability
What are the two types of variance?
- True Variance
- Error Variance
True Variance
variance from true differences
Error Variance
variance from irrelevant random sources
Measurement Error
+ all of the factors associated with the process of measuring some variable, other than the variable being measured
+ difference between the observed score and the true score
Positive Variance
can increase one’s score
Negative Variance
decrease one’s score
What are the sources of error variance?
- Item Sampling/Content Sampling
- Test Administration
- Test Scoring and Interpretation
Item Sampling/Content Sampling
+ refer to variation among items within a test as well as to variation among items between tests
+ the extent to which testtaker’s score is affected by the content sampled on a test and by the way the content is sampled is a source of error variance
Test Administration
testtaker’s motivation or attention, environment, etc.
Test Scoring and Interpretation
may employ objective-type items amenable to computer scoring of well-documented reliability
Random Error
source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in measurement process (e.g., noise, temperature, weather)
Systematic Error
+ source of error in a measuring a variable that is typically constant or proportionate to what is presumed to be the true values of the variable being measured
+ has consistent effect on the true score
+ SD does not change, the mean does
What is the relationship between reliability and variance?
+ Reliability refers to the proportion of total variance attributed to true variance
+ The greater the proportion of the total variance attributed to true variance, the more reliable the test
What can error variance do to a test score?
Error variance may increase or decrease a test score by varying amounts, consistency of test score, and thus, the reliability can be affected
True Score Formula
Rxx (x - [x with the dash on top] + [x with the dash on top]
wherein
Rxx - correlation coefficient
x - obtained score
x with the dash on top - mean
What is an error in test-retest reliability?
time sampling
Test-Retest Reliability
+ an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the test
What is test-retest reliability appropriate for?
appropriate when evaluating the reliability of a test that purports to measure an enduring and stable attribute such as personality trait
How is test-restest reliability established?
established by comparing the scores obtained from two successive measurements of the same individuals and calculating a correlated between the two set of scores
When does the reliability coefficient of test-retest reliability become insignificant?
the longer the time passes, the greater likelihood that the reliability coefficient would be insignificant
Carryover Effects
happened when the test-retest interval is short, wherein the second test is influenced by the first test because they remember or practiced the previous test = inflated correlation/overestimation of reliability
Practice Effect
scores on the second session are higher due to their experience of the first session of testing
Test Sophistication
items are remembered by the test takers especially the difficult ones/items that we got highlight confused
Test Wiseness
might inflate the abilities of test takers
When does test-retest reliability have lower correlation?
test-retest with longer interval might be affected of other extreme factors, thus, resulting to low correlation
What does low correlation in test-retest reliability mean?
lower correlation = poor reliability
Mortality
problems in absences in second session (just remove the first tests of the absents)
What does test-retest reliability measure?
coefficient of stability
What are the statistical tools that should be used for test-retest reliability?
Pearson R, Spearman Rho
What are the errors in Parallel Forms/Alternate Forms Reliability?
Item Sampling (Immediate), Item Sampling changes over time (delayed)
Parallel Forms/Alternate Forms Reliability
+ established when at least two different versions of the test yield almost the same scores
+ has the most universal applicability
+ true scores must be the same for two tests
+ means and the varianes of the observed scores must be equal for two forms
Parallel Forms
each form of the test, the means, and the error variances are EQUAL; same items, different positionings/numberings
Alternate Forms
simply different version of a test that has been constructed so as to be parallel
What is required of parallel forms/alternate forms reliability
The test should contain the same number of items and the items should be expressed in the same form and should cover the same type of content; range and difficulty must also be equal
What is required of parallel forms/alternate forms reliability?
The test should contain the same number of items and the items should be expressed in the same form and should cover the same type of content; range and difficulty must also be equal
What should be done if there is a test leakage during parallel/alternate forms reliability?
If there is a test leakage, use the foem that is not mostly administered.
Counterbalancing
technique to avoid carryover effects for parallel forms, by using different sequence for groups (e.g. G1 - listen to song before counseling, G2 - counseling first, before listening to the song)
Counterbalancing
technique to avoid carryover effects for parallel forms, by using different sequence for groups (e.g. G1 - listen to song before counseling, G2 - counseling first, before listening to the song)
When can the two different tests for parallel forms/alternate forms reliability be administered?
It can be administered on the same day or different time.
What is the most rigorous and burdensome form of reliability?
Parallel forms/alrernate forms because test developers create two forms of the test.
What is the main problem for parallel form/alternate form reliability?
There is a difference between the two tests
What are the factors that may affect parallel form/alternate form reliability test scores?
It may be affected by motivation, fatigue, or intervening events.
What are the factors that may affect parallel form/alternate form reliability test scores?
It may be affected by motivation, fatigue, or intervening events.
What are the statistical tools for parallel form/alternate form reliability?
Pearson R or Spearman Rho
What are the statistical tools for parallel form/alternate form reliability?
Pearson R or Spearman Rho
What is Internal Consistency also known as?
Inter-Item Reliability
What is an error of Internal Consistency?
Item Sampling Homogeneity
Internal Consistency (Inter-Item Reliability)
+ used when tests are administered once
+ consistency among items within the test
+ measures the internal consistency of the test which is the degree to which each item measures the same construct
+ measurement for unstable traits
When can a test be said to have good internal consistency?
This is if all items measure the same construct, then it has a good internal consistency
What is internal consistency most useful for?
useful in assessing Homogeneity
Homogeneity
if a test contains items that measure a single trait (unifactorial)
Heterogeneity
degree to which a test measures different factors (more than one factor/trait)
When will a test have higher inter-item consistency?
more homogenous items = higher inter-item consistency
What are the different statistical tools that may be used for computing Internal Consistency?
+ KR-20
+ KR-21
+ Cronbach’s Coefficient Alpha
KR-20
used for inter-item consistency of dichotomous items (intelligence tests, personality tests with yes or no options, multiple choice), unequal variances, dichotomous scored
KR-21
used if all the items have the same degree of difficulty (speed tests), equal variances, dichotomous scored
Cronbach’s Coefficient Alpha
used when two halves of the test have unequal variances and on tests containing non-dichotomous items; unequal variances
Average Proportional Distsance
measure used to evaluate internal consistencies of a test that focuses on the degree of differences that exists between item scores
What is an error of Split-Half Reliability?
Item Sample; Nature of Split
Split-Half Reliability
obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered ONCE
What is split-half reliability useful for?
it is useful when it is impractical or undesirable to assess reliability with two tests or to administer a test twice
How can split-half reliability be done?
One cannot just divide the items in the middle because it might spuriously raise or lower the reliability coefficient, so just randomly assign items or assign odd-numbered items to one half and even-numbered items to the other half
What are the different statistical formulas that may be used for computing Split-Half Reliability?
+ Spearman-Brown Formula
+ Spearman-Brown Prophecy Formula
+ Rulon’s Formula
Spearman-Brown Formula
allows a test developer of user to estimate internal consistency reliability from a correlation of two halves of a test, if each half had been the length of the whole test and have the equal variances
Spearman-Brown Prophecy Formula
estimates how many more items are needed in order to achieve the target reliability
How is Spearman-Brown Prophecy Formula computed?
multiply the estimate to the original number of items
Rulon’s Formula
counterpart of spearman-brown formula, which is the ration of the variance of difference between the odd and even splits and the variance of the total, combined odd-even, score
What should the developer do if the split-half reliability is relatively low?
If the reliability of the original test is relatively low, then developer could create new items, clarify test instructions, or simplifying the scoring rules
What are the statistical tools that may be used to compute split-half reliability?
Pearson R or Spearman Rho
What is the error of Inter-Scorer Reliability?
Scorer Differences
Inter-Scorer Reliability
+ the degree of agreement or consistency between two or more scorers with regard to a particular measure
+ evaluated by calculating the percentage of times that two individuals assign the same scores to the performance of the examinees
Variation of Inter-Scorer Reliability
a variation is to have two different examiners test the same client using the same test and then to determine how close their scores or ratings of the person are
What is Inter-Scorer Reliability most used for?
used for coding nonbehavioral behavior/factors
What are statistical measures that may be used for Inter-Scorer Reliability?
+ Fleiss Kappa
+ Cohen’s Kappa
+ Krippendorff’s Alpha
Fleiss Kappa
determine the level between TWO or MORE raters when the method of assessment is measured on CATEGORICAL SCALE
Cohen’s Kappa
two raters only
Krippendorff’s Alpha
two or more rater, based on observed disagreement corrected for disagreement expected by chance
Dynamic
trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experience
Static
barely changing or relatively unchanging
Restriction of Range or Restriction of Variance
if the variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower
Power Tests
when time limit is long enough to allow test takers to attempt all times
Speed Tests
generally contains items of uniform level of difficulty with time limit
What kind of reliability should be used for speed tests?
Reliability should be based on performance from two independent testing periods using test-retest and alternate-forms or split-half-reliability
Criterion-Referenced Tests
designed to provide an indication of where a testtaker stands with respect to some variable or criterion
What will happen to the traditional measure of reliability when individual differences decerase?
As individual differences decrease, a traditional measure of reliability would also decrease, regardless of the stability of individual performance
Classical Test Theory
+ states that everyone has a “true score” on a test
+ made up of “true score” and random error
True Score
genuinely reflects an individual’s ability level as measured by a particular test
Domain Sampling Theory
+ estimates the extent to which specific sources of variation under defined conditions are contributing to the test scores
+ considers problem created by using a limited number of items to represent a larger and more complicated construct
+ test reliability is conceived of as an objective measure of how precisely the test score assesses the domain from which the test draws a sample
+ Systematic Error
Generalizability Theory
Domain Sampling Theory
+ based on the idea that a person’s test scores vary from testing to testing because of the variables in the testing situations
+ according to generalizability theory, given the exact same conditions of all the facets in the universe, the exact same test score should be obtained (universe score)
Universe
the test situation
Facet
number of items in the test, amount of review, and the purpose of test administration
Decision Study
developers examine the usefulness of test scores in helping the test user make decisions
Item Response Theory
+ the probability that a person with X ability will be able to perform at a level of Y in a test
+ a system of assumption about measurement and the extent to which item measures the trait
What is the focus of Item Response Theory?
item difficulty
What is Item Response Theory also known as?
Latent-Trait Theory
Computer using IRT
+ The computer is used to focus on the range of item difficulty that helps assess an individual’s ability level
+ If you got several easy items correct, the computer will then move to more difficult items
Difficulty
attribute of not being easily accomplished, solved, or comprehended
Discrimination
degree to which an item differentiates among people with higher or lower levels of the trait, ability, etc.
Dichotomous
can be answered with only one of two alternative responses
Polytomous
3 or more alternative responses
Standard Error of Measurement
+ provide a measure of the precision of an observed test score
+ index of the amount of inconsistent or the amount of the expected error in an individual’s score
+ allows to quantify the extent to which a test provide accurate scores
+ used to estimate or infer the extent to which an observed score deviates from a true score
+ Standard Error of a Score
What is the basic measure of error (SEM)?
Standard deviation of error
What does the SEM provide?
provides an estimate of the amount of error inherent in an observed score or measurement
What does it mean when a test has lower SEM?
Higher reliability
What is SEM used for?
Used to estimate or infer the extent to which an observed score deviates from a true score
Confidence Interval
Standard Error of Measurement
+ a range or band of test scores that is likely to contain true scores
+ tells us the relative ability of the true score within the specified range and confidence level
What does it mean when the range is larger?
The larger the range, the higher the confidence
Standard Error of the Difference
Standard Error of Measurement
can aid a test user in determining how large a difference should be before it is considered statistically significant
Standard Error of Estimate
Standard Error of Measurement
refers to the standard error of the difference between the predicted and observed values
What can one do if the reliability is low?
If the reliability is low, you can increase the number of items or use factor analysis and item analysis to increase internal consistency
Reliability Estimates
nature of the test will often determine the reliability metric
Types of Reliability Estimates
a) Homogenous (unifactor) or heterogeneous (multifactor)
b) Dynamic (unstable) or static (stable)
c) Range of scores is restricted or not
d) Speed Test or Power Test
e) Criterion or non-Criterion
Test Sensitivity
detects true positive
Test Specificity
detects true negative
Base Rate
proportion of the population that actually possess the characteristic of interest
Selection ratio
no. of hired candidates compared to the no. of applicants
Formula for Selection Ratio
number of hired candidates / total number of candidates
/ = divided by
Four Possible Hit and Miss Outcomes
- True Positives (Sensitivity)
- True Negatives (Specificity)
- False Positive (Type 1)
- False Negative (Type 2)
True Positives (Sensitivity)
predict success that does occur
True Negatives (Specificity)
predict failure that does occur
False Positive (Type 1)
success does not occur
False Negative (Type 2)
predicted failure but succeed
Quartile 1
scored well, performed poorly
Quartile 2
scored well, performed well
Quartile 3
scored well, performed poorly
Quartile 4
scored poorly, performed poorly