Reliability (Classical Test Theory) Flashcards

1
Q

Reliability

A
  • refers to the consistency and replicability of a measure
  • reliability is concerned with the properties of the actual responses: this can be on specific items within a test or the overall score on a test (‘composite score’)
  • measures need to reflect “real” psychological attributes (and differences)
  • psychological attributes are invisible aspects of people (reliability is a theoretical & unobservable aspect of test scores)
  • reliability should be checked in every study with a new sample
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Observed Vs True Scores

A
  • reliability essentially reflects the extent to which respondents observed scores are consistent with differences in their true scores
  • observed score: the values obtained from measuring a characteristic in a sample of individuals i.e., an actual score on a test
  • true score: the ‘real’ amount of a characteristic that an individual has (i.e., the score obtained if a measure is 100% perfectly precise)
    -> we never really know someone’s true scores of an attribute
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

No Measure is Perfectly Reliable

A
  • reliability relies on observed scores being an accurate reflection of true scores (no measure can ever be perfectly reliable)
  • reliability is actually on a continuum (that varies in degree)
    -Describe reliability using terms like:
    ❖ “Demonstrated acceptable levels of reliability”
    ❖ “Demonstrated poor degree of reliability”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Classical Test Theory (CTT)

A
  • defines the conceptual bias of reliability
  • outlines procedures for estimating reliability of psychological measures
  • according to CTT, reliability is the extent to which: “differences in observed test scores are a function of true psychological differences, as opposed to measurement error”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

(Measurement Error)

A
  • reliability for a measurement procedure depends on the extent to which differences in observed scores can be attributed to differences in their true scores rather than other random factors (referred to as measurement error or just error)
  • no measure is perfectly reliable; error is often a consequence of the measurement process
  • important to consider how measurement error may impact observed scores
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Sources of Error

A
  1. Subtle Error
    o Respondents may not fully understand questions and answer randomly
    o Some tests require observers to score test-takers, different observers may score differently
  2. Obvious Error
    o Tests may be given in different context (i.e., loud or quiet settings)
    o Tests may be given at different times of the day
    o Different temperatures for test users (i.e., hot versus cold)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Reliability = Signal & Noise

A
  • signal = true psychological differences (we want to detect)
  • noise = measurement error (obscuring the signal and making it hard to detect)
    -reliability is strongest when there is a strong signal and minimal noise (very little measurement error)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Understanding Reliability

A
  • reliability depends on two things:
    ❖ the extent differences in test scores can be attributed to real individual differences
    ❖ the extent to which differences in test scores are a function of measurement error
    ➢ Classical Test Theory assumes observed scores are a function of true score + error
    Observed Score = True Psychological Score + (Measurement) Error
    ➢ CTT also assumes measurement error occurs randomly
    ❖ Circumstances we cannot really control
    o e.g., someone feeling tired, answering randomly, etc…
    ❖ Measurement error is thus equally likely to inflate some scores + deflate others
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Impact of Measurement Error

A

➢ Measurement error can “shift/ reorder” the individual differences
❖ Away from the true individual differences
Two further points on error:
❖ Error tends to cancel itself out across respondents (i.e., mean error = 0)
❖ Error scores are uncorrelated with true scores (i.e., error not related to real scores)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Error Variance

A

➢ A high degree of error variance = a poor measurement
❖ i.e., a low degree of reliability
➢ Is the variability in observed scores similar to variability on true scores
➢ Distribution of scores
❖ Observed scores: wider range of scores from the mean
❖ True Scores: closer range to the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

4 Ways to Think About Reliability

A

➢ According to CTT, there are 4 ways to consider reliability (p. 156- 163; Furr book)
❖ All involve observed scores, true scores, and measurement/ total error
➢ Two ways consider differences in the variances in observed, true, and error scores
1. Reliability = Proportion of true score variance compared to observed score variance
❖ Reliability is high when observed score variance is attributable to the true + error score variance
2. Reliability = A lack of measurement error variance
❖ Reliability is high when the error variance is small (compared to observed score variance)
➢ Two ways consider correlations between observed, true, and error scores
3. Reliability = Strong correlation between observed scores and true scores
❖ Reliability is high when the observed scores are correlated with the true scores
4. Reliability = A lack of correlation between observed scores and error scores
❖ Reliability is high when observed scores are not associated with measurement error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Reliability Coefficient (R)

A

➢ Each of the previous 4 methods provides a “coefficient of reliability” (R)
➢ The size of the reliability coefficient (R) indicates the degree of reliability
❖ A reliability coefficient is scored between 0 – 1
o If R = 0 then no reliability
o If R = 1 then the observed and true scores will be perfectly aligned
❖ Measurement error always occurs to some degree (very unlikely scores perfectly align)
➢ Larger R values indicate greater reliability
➢ As rule of thumb, R values greater than .70 are deemed to be satisfactory
❖ Although it is often expected these are higher in applied settings
❖ Tests that may have a practical implication for people (i.e., interventions)
(Remember this for the next lecture on tests for reliability)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Standard Error of Measurement

A

➢ Although the reliability coefficient (R) is extremely important….
❖ It does not directly reflect the size of any measurement error
o R indicates the degree of reliability but not the average size of error scores
➢ The size of measurement error has implications for interpreting the accuracy of tests
➢ Standard Error Of Measurement (sem)
❖ The standard deviation of the errors scores
❖ Represents the average size of error scores
➢ Greater standard error = greater difference between observed & true scores
❖ Higher standard error values = less reliability
➢ Closely linked to the reliability coefficient
❖ If standard error of measurement = 0 then reliability will be perfect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Reliability in Practice

A

➢ Reliability is framed in terms of observed, true, and error scores
➢ In reality, the only scores we actually obtain are the observed scores
❖ We know individuals will have a “true” score (but we won’t know what this is)
❖ We also know that measurement error exists (but this is random)
➢ Whilst we cannot truly know the true reliability or standard error in tests….
❖ There are established methods to estimate this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Estimating Reliability

A

➢ Classical Test Theory proposes different ways to estimate reliability of measures
❖ At least one method would be used every time a test is used with a new sample
➢ Essentially, these methods are based on administering two tests to the respondents
❖ Checking for consistency across the tests
1. Administer two different versions of a test (that measure the same thing)
❖ e.g., Give one depression measure at Time 1 and a different depression measure at Time 2
o Then check to see if the scores are related (i.e., consistent)
2. Giving people the same test twice (i.e., identical tests)
❖ e.g., Give a depression measure at Time 1 and the exact same depression measure at Time 2
o Then check to see if the scores are related (i.e., consistent)
3. View the different items on a single test as essentially “separate” tests
❖ e.g., Exploring how the respective items/ questions on a test were answered
o Have items that measure the same thing been answered “consistently”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Estimating Reliability Pt.2

A

➢ Which of these methods of estimating reliability is relevant relies on:
❖ Some key assumptions
❖ And what we are trying to test
➢ CTT indicates some essential assumptions for all estimates of reliability:
o Each test’s error scores are random
o The two tests reflect a single (same) construct
o The true scores on the tests are linearly related (people’s real experience does not change)
➢ Remember, these methods rely on administering and comparing two “tests”…
❖ We need to think about how these “test” scores are related to one another
➢ 4 models proposed that help decide the most applicable method of reliability
❖ A method of estimating reliability is only valid if it fits one of these models
❖ Each model varies in strictness about how to estimate reliability

17
Q

Models of Reliability

A

➢ Parallel Model (most restrictive)
❖ To accurately compare two test scores - the two tests must be “parallel”
o All items must measure the same construct and use the same unit of measurement
o All item true and error scores are assumed to be equal (can compare all scores)
➢ Tau-Equivalent Model
❖ Individual items measure the same construct and use the same measurement units
o Possibly different amounts of error
o Tests are not truly parallel – you can only evaluate the item scores on each test
➢ Essentially Tau-Equivalent
❖ Individual items measure the same construct and use the same measurement units
o But possibly with different degrees of precision and error variances
o Tests are not truly parallel - you can only evaluate the item scores test
➢ Congeneric Model (least restrictive)
❖ Individual items measure the same construct
o But may use different scales
o Possibly different degrees of precision and amounts o

18
Q

Models = Methods

A

Each model will allow different methods to estimate reliability:
➢ Parallel Model
1. Administer two different versions of a test (that measure the same concept)
2. Giving people the same test twice (i.e., identical tests)
3. Explore different items on a single test as essentially “separate” tests
➢ Tau-Equivalent Model
3. Explore different items on a single test as essentially “separate” tests
➢ Essentially Tau-Equivalent
3. Explore different items on a single test as essentially “separate” tests
➢ Congeneric Model
❖ None of the three methods as the tests have different scales, precision and error
o Can actually use an Omega indicator of reliability (see p. 205, Furr textbook)

19
Q

Estimating Reliability

A

➢ These methods can be distinguished by specific tests for reliability
❖ These methods differ in the kind of data they produce (and underlying assumptions)
1. Two different versions of tests = Alternate-Forms Reliability (uses overall test scores)
❖ Reliability estimated by the consistency of scores between two different versions of a test
o Two different tests that measure the same construct (at two different times)
o Check how the sets of scores correlate with one another (i.e., are they consistent)
2. Giving the same test twice = Test-Retest Reliability (uses overall test scores)
❖ Reliability estimated by the consistency of scores on the same measure at different times
o Giving a measure to a group and then again at a later time point in time
o Check the test-retest correlation between the two sets of scores
3. Explore different items on a test = Internal Consistency (inter-item relations)
❖ Reliability estimated by the consistency of scores on “parts” of the same measure
o Based on correlations between different items on the same test
o Indicates whether items measuring the same construct produce similar scores

20
Q

Practical Importance

A

➢ Reliability of test scores is important for the quality of decision making
➢ Reliable test scores impact the decisions of ordinary people
❖ Children can be removed from standard education based on intelligence/ achievement
o Or specific psychological diagnosis
❖ People may given specific diagnoses based on test scores (e.g., dyslexia, anxiety)
❖ Decisions about employment may be made on personality tests
➢ Test scores usually lead into making decisions about individuals and situations
❖ Some have direct consequences for individuals (social & psychological)
❖ Others may affect individual’s motivation & goals
❖ Other may impact how individuals/ groups are viewed in society
➢ Given the importance of test scores,
any mistake could have serious consequences

21
Q

Implications

A

➢ It is essential we can be confident that observed test scores reflect true scores
❖ i.e., scores/ differences are not heavily impacted by measurement error
➢ We never actually know the “true” level of individuals psychological attributes
❖ i.e., we can never truly know someone’s intelligence, motivation, self-esteem etc..
➢ But reliability tests to help us estimate an individual’s true level on an attribute
❖ At least, attempt to do so to the best of our ability
➢ Because we can only estimate reliability…
we need to evaluate the precision of individual’s test score
➢ Without considering reliability of test scores…
the scores can be scientifically questionable/ meaningless