Wk 4 - Reliability Flashcards

Question

``` How could test administration affect measurement error? (x1) Give examples (x3) ```

Answer 1

Administering the test under different conditions could affect performance and therefor error Distractions, fatigue, experimenter attitude

Answer 2

If scoring methods aren't standardised, error would increase | Biased examiners, ambiguous scoring methods, technical errors

Answer 3

Self-efficacy | Motivational factors

Answer 4

Higher variance means a higher spread of scores, | Indicating lower reliability of the test (more measurement error)

Answer 5

Inter-item consistency, internal coherence The average correlation between items on your scale Are responses consistent to items that supposedly measure the same thing?

Answer 6

Cronbach's alpha, when there are more than two possible outcomes for an item (eg likert scales) Kuder-Richardson-20 (KR-20), when there are only two possible outcomes (yes/no, in/correct, true/false)

Answer 7

Kr-20 | Because there are still only two possible outcomes - wrong or right

Answer 8

Select ANALYZE; SCALE; RELIABILITY ANALYSIS; select model ‘Alpha’. Select all the items in your scale. Click OK. It doesn't - if you have dichotomous variables, it automatically chooses the right formula

Answer 9

Work out the correlation between scores on the same test by the same people done at two different times Assumes stable trait, no practice effects/developmental changes

Answer 10

Use alternate - equivalent difficulty and content, or Parallel forms - Mean, SD of each test must be the same, but both versions also need equal correlation with entirely different tests Both rely on high Coefficient of Equivalence - the correlation between two versions of the same test

Answer 11

The correlation between scores on the same test by the same person provided by 2 different examiners As part of tutor training, one essay given to all markers, then argue over the different marks until they reach agreement/consistency

Answer 12

Homogeneous – test items all measure the same thing, use internal consistency Heterogeneous – more than one independent thing being measured (i.e. there are subscales that don’t intercorrelate highly), use test-retest instead (though you could instead look at the internal consistency of each subscale separately). Big Five scale – Cronbachs would be very low, because you're measuring five traits that aren't supposed to relate to each other

Answer 13

Are you measuring something that is supposed to be stable, or change, over time? State or trait? If it’s dynamic, test-retest reliability would be problematic Eg, data shows that intelligence doesn't vary much over time, while fatigue does And, trait anxiety is a background level, longer term trait, whereas state is moment by moment – which you would expect to change

Answer 14

Inappropriate restriction of the amount that the scores in our sample can vary will affect (likely reduce) the correlation (and therefor ALL our reliability estimates – as they are all based on correlation) Sensation-seeking increases with age - test people all the same age = potential restricting the range and artificial correlation reduction Testing the reliability of an IQ test using only clever people, my range will reduce

Answer 15

Speed vs level of difficulty of response Internal consistency inappropriate for speed - assumes a fair go at all items, which you don't get (just stop when clock runs out) Use alternate-forms or test-retest reliability instead, or split-half (where administer 2 halves of the tests, timed separately) For each row, determine whether the first number is the same or different than the 2nd number, write D if the numbers are different. Final score is how many you get through in given time

Answer 16

May be very little variation in responses This is restriction of range - No variation? No correlation, so reliability tests don't apply Eg pass/fail tests where virtually everyone passes, ie diving cert exam, first-aid

Answer 17

Cronbach's gets bigger, the more items in test – more reliable measure with more items measuring the same thing So much so, that you can estimate/predict by how much using Spearman Brown Adjusted Reliability formula BUT, diminishing returns - doubling test doesn't double reliability

Answer 18

n (number of items in new test divided by number of items in old test), times by rxx (reliability of original test (correlation) before adjusting); divided by 1 + (n - 1)rxx Reliability of original test: alpha = .67 Increase the number of items in test from 10 to 15 • n = 15/10 = 1.5 • New estimated r = (1.5*.67)/(1+(.5*.67)) • = 1.005/1.335 • = 0.75

Answer 19

Oral reading, comprehension, and fluency of students aged 6 to 12 years. Children read stories out loud, then complete a comprehension test. Administrator notes down errors and time taken, to give measures of reading accuracy, rate, and comprehension. Norm-referenced aptitude test - because scores are standardised against other children of the same age. Raw score is converted into percentiles and stanines using tables that they provide with the test.

Answer 20

To work out whether someone’s score is significantly different from: Their own score on the same test at a different time Their score on another test of the same thing Someone else’s score on the same test Someone else’s score on another test Client with depression, test at beginning, then post intervention - significance demonstrates change unlikely due to chance/nature

Answer 21

sx = standard deviation of test-takers’ (lots of people not one individual) scores rxx = reliability of test SEM = sx times sqrt(1 - rxx) WAIS IQ score of 105, SD of all IQ tests is 15, reliability is .98 • 15 x root(1-.98) = 2.12

Answer 22

IQ scores: Less than 70 counts disability But, if they score 71, their real IQ includes the range 67 - 75 Which makes people want access to extra support that's offered

Answer 23

SEdiff = sqrt(sq[SEM1] + sq[SEM2]) Where SEM1 and 2 are the standard error of measurement for tests 1 and 2 SEdiff = sd times sqrt(2 - r1 - r2) Where sd = standard deviation of test 1 (= standard deviation of test 2 because they’ve been standardized); and r1 & r2 = reliability of tests 1 & 2

Answer 24

95% CI means there need to be at least 2 x SEdiff between the two individual scores (because of normal distribution) If so, you can say they are significantly different at 95% CI (95% not likely due to measurement error) ie change in client’s happiness score needs to be more than 2 SEdiffs for it to be significant

Answer 25

Internal consistency Alternate forms Test-retest Inter-rater reliability.

Answer 26

His intelligence has not significantly changed after the crash (95% confidence) If the SEdiff is 3, then the man's intelligence score needs to decrease by at least twice this to represent a statistically significant difference (95% confidence) i.e. it needs to have decreased by 6 or more. In fact, it's only decreased by 5 (40 to 35) - so this change is not significant.

Answer 27

False | You average the correlations – you don’t multiply them

Answer 28

False You do indeed have to apply a version of the Spearman Brown formula, But not because of adjusting for the homogeneity of the test. It's because a half length test is likely to be less reliable than a full length test.

Answer 29

Virtually all of the total variance must be accounted for by true variance If virtually all of the total variance is accounted for by true variance, this means measurement error must be very low, So reliability is high.

Answer 30

No Using the formula for SEdiff, we put in the SEM of 3: square root of (3 squared + 3 squared) = square root of 18 = 4.24. To be significant, the difference between the students' scores must be more than twice the SEdiff (i.e. 2 x 4.24 = 8.48). The actual difference (94 - 89 = 5) is less than this. Therefore their scores are not significantly different

Answer 31

Internal consistency If your measure is heterogeneous then internal consistency might be an inappropriate estimate of reliability Because it assumes that all the items in your test are measuring the same thing.

Answer 32

When there are two of more versions of a test that are equivalent in content and difficulty, But may not have exactly the same means and standard deviation.

Answer 33

When there are two of more versions of a test that are equivalent in content and difficulty, And also have exactly the same means and standard deviation.

Answer 34

True Key words that solve this are “individual’s score” and “calculation” - To CALCULATE the confidence interval of an INDIVIDUAL’S test score (when we add and subtract twice the standard error of measurement), We have to assume the true score is in the middle of the distribution (we’re doing this whether we like it or not when we assume the confidence interval is symmetrical around their observed score)

Answer 35

True Increased test unreliability will be associated with an increased Standard Error of Measurement (remembering that the formula for estimating SEM uses test reliability and test standard deviation - where the latter remains unchanged in this question).

Answer 36

False Because the statement describes the Peabody Picture Vocabulary Test, Not the Neale Analysis of Reading (where the description doesn't actually involve testing reading).

Answer 37

Use the Spearman-Brown prediction formula shown in Lecture 3, Where n = 20/10 = 2 (the 20-question-long new test is double the length of the 10-question-long existing test) and rxx = .69 So the new reliability will be: (2 x .69)/(1 + ((2-1) x .69)) = .82

Wk 4 - Reliability Flashcards

(62 cards)