Reliability Flashcards

1
Q

Who was Francis Galton?

A

childhood genius, Charles Darwin, human differences; first person to apply statistical methods in the differences between humans; found that the mean of OX weight guesses was the closest to the OX’s actual weight, even though the median would be best because it takes out extreme scores, which ended up accounting for each other as seen in the mean; plotted the scores and showed a bell curve. Came up with the idea that “error cancels out”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Galton’s Ox?

A

guessing the weight of the OX; law of errors: the average of guesses is correct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is reliability?

A

reliability refers to the degree to which test scores are free from errors of measurement; the accuracy of a mean score from an unreliable test depends on how they are distributed, e.g., if they are unevenly distributed it’ll be harder to tell the true score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the goal of psychological measurement?

A

Detect psychological differences. To be reliable.

test scores are used to indicate levels of psychological attributes; differences among people’s test scores are used to indicate true psychological differences among people; to what degree are differences in observed (test) scores consistent with differences in (true) levels of psychological attributes?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Classical Test Theory?

A

x (observed) = T (true ability) + e (random error)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the assumptions of classical test theory?

A
  1. Observed scores on a psychological measure are determined by a respondent’s true score and by measurement error
  2. Error is random (consequences: 1. Errors tends to cancel itself out across respondents 2. Errors scores are uncorrelated with true scores)

classical test theory says X=T+E; E(X)=T; if a person were to take a test repeatedly then the average would cancel out any error (testing is independent from one test to another AND error is random and independent) e.g., people guessing Ox’s weight didn’t powwow beforehand

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is signal and noise referring to?

A

true score is “signal”; measurement error is “noise,” obscuring the signal; observed score is affected by signal and noise; reliability = signal / (signal + noise); if a person takes a test on two different occasions under the same conditions we would expect the same result: 1. That the conditions under which the test was taken were exactly the same 2. That the person’s underlying true score did not in fact change; it’s about how closely observed (X) approach true scores (T); X=T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you estimate reliability?

A

Repeated Measures: logic: if the test was not reliable, then you would expect to see a difference in a person’s observed score, such that the score from Time 1 would be slightly different than the score from Time 2 score; if the test conditions were not the same then any unreliability might be due to the difference in conditions and not the unreliability of the test itself; it is possible that scores from both T1 and T2 could both be very poor estimates of the person’s true score for reasons other than the test; taking a mean of repeated measures is believed to be better estimate of the person’s true ability: sometimes the error would increase the score, sometimes decrease it, but after a while, the score would average out

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the relationship between error on different test scores?

A

Errors in the test score (E) will not be correlated with the test score (X)—errors should not be correlated on whether a person scored high or low on a test; correlation of E and X=0 (this can happen is error is random and independent); errors from two different tests are uncorrelated (parallel tests—errors from one test will be uncorrelated to true score on another test), conditions of test administration have to be exactly the same (very unlikely, however)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the goal of reliability theory?

A

estimate errors in measurement to suggest ways of improving tests so that errors are minimized; errors are random across a large number of individuals; variance of obtained scores = Sx^2=ST^2+Se^2, V observed score=V true score+V error score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is variance?

A

reflects the extent to which individuals differ (compared to the test mean); it is determined by the degree to which the scores in the group actually differ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does random error impact distribution?

A

distributions with error have higher variance compared to scores with no random error, however mean stays the same as error cancels out

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the reliability coefficient?

A

the ratio of the true score variance to the total test of test scores; the proportion of the true variance to the total test of scores, that’s due to/ accounted for in all the variability in the test scores. Variance of true scores over variance of observed scores( true scores plus error).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the two type of error?

A
  1. Random error
  2. Systematic error
    X = T + e (er random error and es systematic error)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is systematic error?

A

Any bias; the systematic error impacts the average, so we call it a bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some ways you can reduce measurement error?

A

pilot testing (making sure items are clear, etc), thorough training (for administrators), double-check the data, statistical correlation (any adjustment made to data due to knowing there’s a bias and accounting for it), multiple measures (give multiple measures of the same construct, e.g., create new measurements of depression, want to make sure it wasn’t biased, could give old/reliable tests on depression to make sure participant is depressed)

17
Q

Can we be sure about the reliability of a set of test scores?

A

NO!
reliability is defined by some terms we can’t truly know, we never truly know the reliability of a set of test scores; however, we might be able to estimate reliability, based on the information we do; various methods of estimating reliability; all involve administering two (or more) testings to a sample of respondents (two versions of the same test, one test administered at 2 times, dividing one test into two or more parts)

18
Q

What are four ways to estimate reliability?

A
  1. Test-retest reliability
  2. Parallel-forms reliability
  3. Internal consistency reliability
  4. Inter-rater reliability
19
Q

What is test-retest reliability?

A

(give the same test twice to one sample of respondents; compute the correlation between scores from the two testing occasions; correlation is an estimate of test’s reliability—can be influenced by reactivity [participants changing reactions to taking a test during taking it; history of taking the test, participants can be in different moods; test-retest is more of a stable estimate rather than reliability])

20
Q

What is parallel-forms reliability?

A

(create two versions of the test so don’t need to worry about history effects/learning; both forms measure construct equally well; same as alternate forms; internal consistency is difficult to achieve between tests as how do you choose which items belong in the test; administer both to a sample of respondents [must fit the parallel tests CTT model]; compute the correlation between scores from the two forms; correlation is an estimate of the test’s reliability; error should be uncorrelated while the true score should be the same; hard to get enough items and to randomly divide them so they are equivalent between tests as some questions tap into the constructs better than others)

21
Q

What is internal consistency reliability?

A

(many varieties of internal consistency estimates of reliability; most well-known is alpha [a, aka Chronbach’s alpha]; average inter-item correlation [how well a set of slightly different items measure the same underlying construct equally well], Chronbach’s alpha [takes every possible split-half and a correlation of it all, all the ways all of the items are related to each other], split-half reliability [how well one half of the test is correlated with other randomly chosen half])

22
Q

What is inter-rater reliability?

A

(degree of agreement among raters; how much homogeneity or consensus exists in the ratings given by various judges; Cohen’s kappa: developed to account for the possibility that raters actually guess on at least some variables due to uncertainty, range of -1 to +1)

23
Q

What is Cohen’s kappa?

A

developed to account for the possibility that raters actually guess on at least some variables due to uncertainty, range of -1 to +1; test of inter-rater reliability

24
Q

What impacts the accuracy of each reliability estimate method?

A

the accuracy of each method depends on the validity of key assumptions; each method depends on different sets of assumptions; sets of assumptions are referred to as measurement models

25
Q

What is Chronbach’s alpha? What are the steps for computing?

A

imagine a 6-item test; are the test’s parts consistent with each other? Do people with high scores on one item tend to have high scores on the other? Steps 1. Compute inter-item covariances and sum 2. Compute variance of total scores 3. Compute alpha

26
Q

What is the problem with alpha?

A

the accuracy of any estimation method depends on assumptions; alpha is appropriate only for tests meeting certain assumptions; concern is that alpha is too often used in data that do not meet the appropriate criteria, thus, many estimates based on alpha may be inaccurate

27
Q

How do you decide which reliability estimate to use? What is each one best for?

A

Interrater reliability is best for observation; parallel forms reliability is used when you intend to use the two forms to measure the same thing (constraint: must have multiple forms); test-retest reliability is best in most experimental and quasi-experimental designs that use a control group

28
Q

What is reliability of difference scores? Why would you calculate it?

A

pretest and posttest differences from an intervention; estimating reliability of difference scores (high correlation between the test scores leads to lower reliability of difference score; high reliability of each test, leads to higher reliability of difference score); the meaningfulness of the difference score depends on the degree to which the two tests have similar variabilities

29
Q

What are two ways to improve reliability?

A
  1. Longer tests are better (diminishing returns)
  2. Tests with stronger internal consistency are better (use clear, coherent items; identify and eliminate/replace bad items [items that aren’t consistent with the other items)
30
Q

Is it easier to make reliable inferences about variable or stable characteristics?

A

Stable

31
Q

What is a good alpha value (internal consistency)?

A

a>.9 Excellent, .9-.8 Good, .8-.7 Acceptable, .7-.6 Questionable, .6-.5 Poor, .5> Unacceptable

32
Q

When are high levels of test reliability necessary?

A
  1. Tests are used to make final decisions about people

2. Individuals are sorted into many different categories based on relatively small individual differences

33
Q

When are low levels of reliability okay?

A
  1. Tests are used for preliminary rather than final decisions
  2. Tests are used to sort people into a small number of categories based on gross individual differences
34
Q

What is the relationship between reliability and validity?

A

Reliability places an upper bound limit on validity; tests are reliable are not necessarily valid; a test cannot be valid if it isn’t reliable