Psychometric Tests For Reliability Flashcards
Estimating Reliability
➢ Classical Test Theory proposes different ways to estimate reliability of measures
❖ At least one method would be used every time a test is used with a new sample
➢ Essentially, these methods are based on administering two tests to the respondents
❖ Checking for consistency across the tests
1. Administer two different versions of a test (that measure the same thing)
❖ e.g., Give one depression measure at Time 1 and a different depression measure at Time 2
o Then check to see if the scores are related (i.e., consistent)
2. Giving people the same test twice (i.e., identical tests)
❖ e.g., Give a depression measure at Time 1 and the exact same depression measure at Time 2
o Then check to see if the scores are related (i.e., consistent)
3. View the different items on a single test as essentially “separate” tests
❖ e.g., Exploring how the respective items/ questions on a test were answered
o Have items that measure the same thing been answered “consistently”
Estimating Reliability Pt.2
➢ These methods can be distinguished by specific tests for reliability
❖ These methods differ in the kind of data they produce (and underlying assumptions)
1. Two different versions of tests = Alternate-Forms Reliability (uses overall test scores)
❖ Reliability estimated by the consistency of scores between two different versions of a test
o Two different tests that measure the same construct (at two different times)
o Check how the sets of scores correlate with one another (i.e., are they consistent)
2. Giving the same test twice = Test-Retest Reliability (uses overall test scores)
❖ Reliability estimated by the consistency of scores on the same measure at different times
o Giving a measure to a group and then again at a later time point in time
o Check the test-retest correlation between the two sets of scores
3. Explore different items on a test = Internal Consistency (inter-item relations)
❖ Reliability estimated by the consistency of scores on “parts” of the same measure
o Based on correlations between different items on the same test
o Indicates whether items measuring the same construct produce similar scores
Considerations For Estimating Reliability
There are some important things to consider before detailing each method:
➢ There is no single method that provides complete accurate estimates of reliability
❖ No measure is ever 100% reliable (it is a question of “degree of reliability”)
❖ If data violates relevant assumptions, then certain estimates may not be accurate
➢ Each method requires two “testings” to generate reliability estimates
❖ These “testings” look different depending on the method
o Alternate-Form: requires individuals to take two highly similar tests
o Test-Retest: requires individuals to take the exact same test at two timepoints
o Internal Consistency: individuals respond to different parts of a test (2 items or more)
➢ Each method is estimating “consistency” between relevant scores
❖ If consistency is high = high degree of reliability
➢ Reliability should be checked for every component of a measure
❖ If construct is unidimensional, then need to check reliability of that one dimension
❖ If multidimensional construct, then reliability for all components should be checked
“Parallel” From Tests
➢ To accurately compare two test scores - the two tests must be “parallel”
❖ Remember the parallel model of reliability (most restrictive model)
➢ Parallel means:
❖ Both tests must measure the exact same construct
❖ Both tests (& all items) use the same unit of measurement
❖ The true and error scores for both tests are assumed to be equal
➢ When two tests meet the criteria for “parallel forms”, we can perform:
❖ Alternate Forms Reliability
❖ Test-Retest Reliability
➢ In reality, these methods are normally used when creating a measure
❖ When you do your own study, you don’t need to conduct test-retest reliability
o Unless you question the reliability of a specific measure in a sample
- Alternate Forms Reliability
➢ Alternate Forms Method estimates reliability of overall test scores
❖ Sometimes called Parallel Forms Reliability
➢ Obtaining scores from two different tests that assess the same construct
❖ Compute a correlation between the scores to estimate reliability
❖ High reliability if the two observed scores are strongly consistent with one another
➢ Only applicable if the two test forms are parallel
❖ All items must measure the same construct & use the same unit of measurement
❖ With the same precision and the same amount of error
Example: Education
Educators may want to administer a test to students. If some students fail, a different
test may be administered that assesses the same concept but avoids students already
being familiar with the questions.
❖ This principle also applies with questionnaires for psychological constructs
Correlation Coefficients = Reliability
➢ The correlation between two parallel test scores equates to the reliability
➢ Pearson correlation coefficient (r) used to measure the strength of relationship
❖ If correlation is statistically significant and above .70 = acceptable reliability
o For practical contexts, often .90 or above is needed
❑ r = 1.00 (perfect reliability)
❑ r between 0.80 - 0.99 (good reliability – i.e., scores highly consistent)
❑ r between 0.70 - 0.80 (acceptable reliability – i.e., relatively consistent)
❑ r between 0.60 - 0.70 (questionable reliability – i.e., moderately consistent)
❑ below 0.50 (poor reliability – i.e., inconsistency between scores)
Statistical Significance
➢ It is important to not only consider the direction and strength of a correlation
❖ But whether it is statistically significant
➢ A result that is statistically significant, is likely not caused by chance
❖ Statistical significance is indicated using a p value (i.e. probability)
➢ A cut-off for significance is normally P < 0.05
❖ There is less than a 5% probability that the finding could be wrong
❖ Less that 5% chance of concluding there is an association (when there actually isn’t)
➢ If p is greater than 0.05 - we cannot be confident the result is not by chance
❖ e.g., If p = 0.45, then there is a 45% the result is simply by chance
❖ Thus, if p > 0.05, this means no meaningful effect/ association is observed
Limitations: Alternate Forms
Practical Issue:
➢ It is difficult to be entirely confident that alternate tests are truly parallel
❖ Do different tests actually measure the same psychological attribute?
❖ By definition, they will include different content
❖ Thus, the true scores from both tests may actually vary
Example: We may give two different versions of a self-esteem scale
➢ Test 1 may mainly include items relating to personal self-esteem (self-worth)
➢ Test 2 may include some items relating to social self-esteem (interactions with others)
❖ i.e., the tests may actually reflect (slightly) different concepts
Subtle Issue:
➢ Order-effects (i.e., carryover) may occur as a result of repeated testing
❖ Taking a second test may impact responses (e.g., prior memory, attitude, mood)
❖ Error scores on Test 1 could potentially correlate with observed scores on Test 2
o Violates a fundamental assumption of CTT – that error should be random
- Test-Retest Reliability
➢ Test-Retest Method also estimates reliability of overall test scores
❖ Sometimes referred to as “Retest Reliability”
➢ This relates to consistency on scores on the same test at two timepoints
➢ This is a common method and avoids issues with using different (alternate) tests
❖ The content will be identical across both tests (i.e., same items and wording)
➢ Correlation coefficient (r) indicates strength of association between the scores
❖ Correlation above .70 = acceptable reliability
❖ Strong correlation = scores at Time 1 are representative of scores at Time 2
Examples: Pearson correlation between T1 and T2: (r = .85, p < .001)
Examples: Pearson correlation between T1 and T2: (r = .38, p = .03)
Considerations: Test-Retest Reliability
➢ Test-retest reliability still needs to meet the assumption for parallel tests
➢ We need to be confident respondents true scores are stable across the timepoints
❖ Typically conducted over two time-points (Time1; Time2)
❖ Should be a relatively short period of time
o To reduce other influencing factors of true scores
o e.g., Age-related changes or situational changes
➢ Another assumption is that the error variance is equal between the two scores
❖ Testing procedures/ settings should be similar for both timepoints
❖ To reduce measurement error
➢ If the above assumptions are met, we can be confident:
❖ Correlation scores from Test 1 and Test 2 is an estimate of reliability
Limitations of Test-Retest Reliability
➢ Important to consider the context of the test-retest method
1. Immediate testing-situation (i.e., noise, distractions, other people, events)
❖ Important to create testing situations that are comparable with each other
2. Some psychological attributes are less stable than others (i.e., influences reliability)
❖ Mood states are much more changeable that personality traits
❖ Thus, test-retest may produce less reliable results for mood than personality
3. Length of retest interval
❖ Lengthier gaps between tests risk allowing greater variation in scores
❖ True (real) scores are more likely to change over years (than say weeks)
❖ Test-retest is usually conducted between 1-8 weeks
4. Period at which the interval occurs
❖ Changes in scores may be more likely at certain life stages
o e.g., more changes in cognitive abilities in childhood than adulthood
o e.g., anxiety may change drastically during an exam period in-between tests
Test–Retest Reliability
(Multiple Timepoints - ICCs)
➢ If there are more than two test timepoints (T1, T2, & T3)
❖ Use Intraclass Correlation Coefficient (ICC)
❖ Tests whether the scores at the three time points are consistent with one another
(ICC often used when there are multiple independent raters/ observers ….)
➢ ICC values that are less than 0.50 = poor reliability
➢ ICC values between 0.50 and 0.70 = moderate reliability
➢ ICC values above 0.70 = good/ excellent reliability
❖ If ICC = 1.00 then scores are identical at all time points
Alternate Forms & Test-Retest (Summary)
➢ Both the Alternate Forms and Test-Retest have theoretical foundations
❖ Sound estimates of reliability
➢ But only for parallel tests (i.e., not suitable for measuring different constructs)
➢ In addition, practical considerations that need to be accounted for:
❖ They require multiple testing (at least two occasions)
o Such testing can be expensive
o Is more time-consuming
o Often unappealing for respondents
➢ Thus, these methods may not always be practically feasible
Internal Consistency
➢ Test based on the correlations between different items within a measure
❖ Considers how the items have been scored
❖ Checks items that measure the same construct have been answered in a similar manner
➢ This does NOT require respondents to take the test twice
❖ They complete a test/ measure once
❖ But the measure needs to include different items (at least 2 or more)
➢ In this sense, items reflect different “parts” of the test
➢ Helps simplify the evaluation of reliability
❖ Most measures include multiple items and can facilitate internal consistency
❖ Internal consistency is a widely used method
o Psychological research often requires internal consistency for every study
Internal Consistency
➢ There are two main factors that influence internal consistency:
1. Consistency between the items (parts) of the test
❖ High consistency between observed item scores = high reliability
❖ Poor consistency between the items = lower reliability
o If they are poorly related to one another (may not be relevant to the current sample)
Example: If all items are supposed to reliably measure self-esteem….
❖ We would expect people high in self-esteem to rate all the items highly
❖ We would expect people low in self-esteem to score all the items low
o If they are answering items differently (randomly) = poor internal consistency
2. The length of the test
❖ A longer test (more items) – more likely to produce internal consistency scores
❖ A shorter test (fewer items) – less likely to produce internal consistency (stricter)
o Arises from the nature of measurement error
Reverse-Scored Items
➢ Before testing internal consistency (or any reliability test)…
❖ Consider the direction that items are worded
➢ Some measures include positively & negatively worded items
❖ To try and prevent people answering randomly
❖ To identify if they are just circling the “highest options”
➢ Some items may need reverse scoring
❖ So that a high score reflects a low score and vice versa
❖ Thus, all items must be scored in the same direction
❖ If this is not done, then item consistency will be poor
Methods Of Internal Consistency
➢ There are three methods to estimating internal consistency:
➢ Split-Half Approach
❖ Split items in half (e.g., separate total of 5 items into 3 items & 2 items)
❖ Estimate the extent both parts of the test contribute equally
➢ “Raw Alpha” Approach
❖ “Cronbach’s alpha” (α)
❖ Needs to be greater than .70 to be acceptable
➢ “Standardised Alpha” Approach
❖ Appropriate if test scores are created using standardised item scores (z-scores)
❖ Scores may be standardized if their variances are dramatically different
- Split-Half Approach
➢ Split items in half (e.g., 4 items = 2 items & 2 items)
❖ This is quite subjective & arbitrary (i.e., the two halves need to be equivalent)
❖ Then add up all the scores for each half (i.e., so now have two scores)
➢ Then compute a correlation between the two subset scores
❖ That will produce an r correlation value
❖ If highly correlated indicates consistency between the two halves of the test
o To interpret a correlation the two halves should be parallel (unlikely)
➢ Add correlation value into a formula to get a reliability estimate
❖ Spearman-Brown Split-Half Formula
Reliability = 2(correlation value)/ 1 + correlation value
= 2(.276)/ 1 + .276 = .43
Limitation
➢ It is difficult to know the two halves are parallel (same true & error variances)
❖ It is hard to know how to accurately split items
- “Raw” Alpha Approach
➢ This methods takes an “item level” approach (i.e., not just two halves of items)
❖ Evaluates the consistency across ALL the items on a test
❖ Each item is now considered its own subset of the test
➢ “Raw” Alpha indicates the consistency of scores across the items
❖ Checks if items have been answered in a similar manner
o The items should/ need to measure the same construct
➢ “Raw Alpha” is indicated by a Cronbach’s alpha coefficient (the symbol is α)
❖ A Cronbach Alpha will be a number from 0 to 1
❑ Above .90 = Excellent
❑ Above .80 = Very Good
❑ Above .70 = Good
❑ Above .60 = Questionable??
❑ Below .50 = Poor Internal Consistency
- Standardised Alpha
➢ Very similar to the “raw” alpha….
❖ However, it is calculated after converting the item scores into z-scores
❖ Z-scores = standardising scores (mean = 0)
➢ This method is uncommon, but some test-users may want to standardise scores
❖ e.g., Educators may want to use multiple indicators to reflect academic ability
o May want to use grades and SAT scores together (which are scored on different scales)
o Standardising the scores means they are then comparable
➢ Standardised items scores are then used to compute an overall “composite” score
❖ Standardised Alpha is the relevant to estimate reliability for standardised scores
❖ The cut-off scores remain the same (i.e., 0.70 is acceptable)
➢ “Raw” and “Standardised Alpha” scores don’t often vary drastically
❖ If based on the exact same items
Example: Raw Alpha
➢ The most important number to report is the Cronbach Alpha
❖ Above 0.70 = Acceptable internal consistency
o For practical/ individual contexts, then .90 may be required
➢ Internal consistency can also indicate item discrimination
❖ How much an item score relates to the overall test score
➢ These are known as corrected item-total correlations
❖ Item-total correlations above 0.30 suggest the item is consistent
➢ Internal consistency can identify if any items are problematic
❖ Deleting an item may improve internal consistency
➢ Report:
❖ State Cronbach Alpha
❖ State that all Item Correlations were above .30
Item Discrimination
➢ Item discrimination is the degree to which:
❖ “an item differentiates people who score high and low on the overall test”
➢ For high reliability, we want all items to have high discrimination
❖ i.e., differences in an item score will reflect differences in overall test score
There are two main indexes of item discrimination
1. Item–Total Correlation
❖ High item discrimination = the item is consistent with the overall test score
❖ Low item discrimination = the item is not consistent with the overall test score
o Corrected Item-Total Correlation = between an item & a “corrected” total score
2. Discrimination Index (D)
❖ Proportion of high overall scorers that get an item correct (compared to low overall scorers)
❖ A high D value = high & low overall scorers differ in how they answered an item
o D values are ranged between 0 – 1 (closer to 1 = high item discrimination)
Confidence Interval
➢ Cronbach Alpha reflects the point of estimate of internal consistency
❖ i.e., it is our best guess at the consistency between the items
➢ Points of estimates are usually combined with confidence intervals
❖ Confidence intervals indicate the accuracy of a point of estimate
➢ Confidence intervals are presented by two values (a low & high boundary value)
❖ As well as a ‘degree of confidence’ (90% or 95%)
❖ The higher the degree of confidence we use, the more confident we can be
➢ A large range between the confidence intervals = less confident
❖ If we used this measure again, we may get an alpha value within this range
➢ A narrower range = more confident
❖ The values will be closer to our point of estimate
Final Considerations With Internal Consistency
➢ Internal consistency across items does not indicate a unidimensional construct
❖ Just because items are consistent, they still could measure different concepts
❖ Some concepts may be made up of different components
o e.g., empathy has cognitive & emotional components
➢ Identifying consistency among different items (parts of a test) can help…
❖ Modify and improve a test/ measure
❖ If two items do not correlate with the other items…could be replaced by better items
➢ Longer tests will generally be more reliable
❖ As test length increase, true score variance increases more than error score variance
➢ Sample heterogeneity (diversity)
❖ The more the sample varies in observed scores = the greater the reliability
❖ To some degree, reliability depends on the nature of the sample used
Reliability: Research Implications
➢ Researchers use different measures to explore relationships between concepts
➢ Regression is a way to explore relationships between two numerical scores
❖ This could be between two mean scores on the same test
❖ Or mean scores on different tests measuring different concepts
➢ If a test is unreliable…
then mean scores may be enhanced/ deflated (compared to the true scores)
➢ We may end up finding inaccurate/ untrue relationships in the regression
o i.e., finding significant relationships when there is not in reality
o Or findings non-significant relationships when there is in reality
➢ We may also end up with over (or under) inflated effects sizes
❖ Effect sizes are often indicated by R² values, η2 values or Cohen’s d
❖ We may come to the wrong conclusion on the strength of a relationship