Reliability Flashcards

1
Q

Reliability

A

Red is always hitting the same spot and the correct spot. This qualifies for reliability and validity.

Green is hitting the same spot all of the time. This is reliable but not valid. Consistency but not in the same spot.

The other two are not reliable or valid.
All over the place is not valid or reliable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Reliability

A

Refers to the degree to which observed scores are “free of error of measurement for a given group” (Standards). It refers to consistency of scores obtained by the same person given the same test.
Reliability reflects the precision of measurement:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Error

A

Measurement free of error is an ideal that is basically never achieved especially in social sciences; a measurement free of error= true score (T)
Realistically, there will always be an amount of error (e) in the observed scores (x)
X= T + e
2 main sources of e (error):
1.Unsystematic
2. Systematic error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Unsystematic (Random) Error

A

= the fluctuations in test scores that occur when the same person takes the test several times due to:
Administration of test (standard procedures not followed)
Recording or reading (computation errors)
Instrumentation
Personal variation (fatigue, moods etc)
Environmental fluctuations (temperature, setting, noise, etc.)
The less random error, the more precise the measurement

Noise, Temperature, things that you aren`t in control of in life.

There is no pattern to these things.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Systematic Error

Random vs Systematic Error
What is a confound? : an extraneous variable in an experimental design that correlates with both the dependent and independent variables
In this example, the weather is a variable that confounds the relationship between ice cream sales and murder rates.

A

Comes from :

Systematic error: There is a pattern. One of the blocks were missing from the test. There is something fundamentally wrong with the test itself. Systematic error from the system itself. Ie) ethnic differences, or cultural differences not accounted for. It is a fault of the test. It is consistent, you can see it happening every time. Once you figure it out you can predict it, the bias and you can make changes to control it.

Instrument bias (different groups test differently
Tester Bias
Co-variation (you are measuring more than one thing at once).  Two things are married together.  The test is measuring two things at the same time.  Anxiety and Depression:  It is hard to separate:  This is a classic example of co-variation…hard to separate.  There is a lot of physical similarities, and there are confounds for the test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Reliability Coefficient

A
Reliability coefficient (rxx) = the ratio of true variance to observed variance
Question: How large is the error variance in relation to the true variance? 
Rule: The larger the reliability coefficient the smaller the error
Reminder: Standard deviation (S) is the common measure (statistic) for variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Reliability under Classical Test Theory

A

X = T+ E
Assumption: Random error is the same for everyone who takes a test.
This random error is called the Standard Error of Measurement.
In statistics, the true score is the mean score a person would get if the person took the same test measuring the same thing over and over again.
The closer together the observed scores are on a test taken multiple times the less the error on the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How reliable is a measure?

A

Reliability can be defined as the ratio of true variance to observed score

True Score Variance
_________________
Observed Score Variance

Error: variance occurring by random chance.

With reliability we estimate the reliability coeffiecient. We never have one value that is true. There is always uncertainty. We have ranges of probabilities. It is a useful tool.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Sampling Distribution

A

Sampling Distribution: All of the scores on a graph would be the sample distribution. Sampling is different. Sampling Distribution means we take a sample of 26 people, the same sample size of our class but take 1000 of these samples and we calculate each mean from each of them, and that is the sample distribution. It is an approximation of a sample. The distribution of many means of various sample sizes. It is the norm of a test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Standard Error of the Mean

Standard Error of the Mean is the standard deviation of the sampling distribution (= the distribution of all possible sample means)

A

Standard Deviation of original Distribution = 9.3
Sample Size 16

9.3
____

square root of sample = 16

9.3
____

16

= 2.32

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Standard Error of Measurement

A

Standard Error of Measurement: one person being tested many times and compute the average square distances from the mean in the distribution. This is the standard deviation of this example and is called the Standard Error of Measurement. It is the standard variation of this distribution.

If you plot you get the standard measurement of scores.

Standard Deviation of those scores attained from one person repeated repeatedly.

SD – average square distance from the mean from one person in a distribution.

We need standard area of measurement so we can get a better idea of what the true score is…..because there is no such thing as a true score.

It gives a range of possible scores around the true scores. Takes into consideration error within tests.

Real scores are impacted by error so your score falls in a range of scores around the true score.

There is always a range of scores. You don`t look for 1 score, you look for a range of scores. Things affect your score, ie) how much sleep you got, anxiety levels, temperature of the room. Your true score is likely to fall within this range.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Standard Error of Measurement

A

Take Observed Score

You can compute your standard error of measurement

You build an interval at the left and the right of the observed score using the Standard Error of Measurement.

SD of observed Score X (1 – Reliability coefficient)

The standard error of measurement:

Smallest measure (what is the smallest possible measurement) = 0 ie Reliability is 1

Reliability can be from 0 to 1

The largest measure is the Standard Deviation of the test scores

SEM and Reliability are negatively correlated.

When error goes up reliability goes down
When reliability goes up, Standard error goes down.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

SEM and Individual Scores

A

SEM is important in interpreting individual scores: use SEM to estimate the person’s true score and their range (“you’re a band, not a point”)
SD of an IQ test is 15
Reliability coefficient is .90
SEM = 15 1 - .90 = 5 (rounded from 4.74)
SEM = 5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Normal Distribution

A

Standard Deviastion: -3 -2 -1 0 1 2 3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Z Scores Normal Disbribution

A

68 - 95 - 99.7 %

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Creating Confidence Intervals

A

Suppose Michael’s mean IQ score on a test is 100, SEM =5

At a 95% confidence level (19 out of 20 times), Michael’s IQ true score will be + 1.96 SEM (why 1.96?)

SEM = 5

100 + 5 X 1.96 = 9.8 = 109.8

100 – 5 x 1.96 = 90.2

Range is 90.2 – 109.8 (90 – 110)

At 95% confidence level will be + or 1 9.8.

17
Q

Confidence Intervals cont’d

A

Since (1.96) X (5) = 9.8 points, Jim’s IQ will lie between 90.2 and 109.8  So if Jim’s IQ is measured again, then we expect the next score to lie within the range of 90 to 110 (with 95% confidence).

99% confidence level ?

18
Q

Reliability Estimation Procedures

A

Reliability cannot be measured directly because we do not ever know the true score (if we did then this would be a non-issue!)

Thus, we are computing reliability estimates based on what we know: the observed score (X)

19
Q

Reliability Estimation Procedures

A

Reliability Estimates – It is an approximation – it is not the exact of the true or observed value….because you do not have the true score.

You estimate based on what you have using the Standard Error of Measurement and the Standard Deviation.

Some of the estimates you are going to use are Crombachs Alpha: which is an estimate of reliabillity. It is a lower bound estimate of reliability. A lower bound means a more conservative estimate. It under-estimates reliability. Your reliability is probably higher than Crombachs Alpha.

Crombach`s Alpha is an estimate of internal consistency of the test. Internal consistency looks at each question and what it seeks to measure….that each question measures the same thing. All of the items on the test should measure the same thing. Ie) they all measure depression. If all the items on the test are highly correlated on the test, then the questions merry each other.

Crombach`s Alpha is a measure of reliability. Reliability is connected with internal consistency because it talks about precision. If your questions measure the same thing then you are hitting the target in the same area. All of the items together are hitting the spot. If the item is not hitting the same mark, then it is not good.

For each scale you would get the Crombach`s Alpha

Other measures of internal consistency estimates: Split Half (half of the test), Kuder – look up

Test – Retest Reliability: correlates between time 1 and time 2 and shows you how accurate the test is over time – gives you stability – a coefficient of stability. Over time does the test still hit the bulls eye. If the person doesnt come back for the second test, then you can do a parallel form or alternate form. A parallel form is the same items and the second test has to have the same mean and SD as the first one. This is very difficult. The second test must be parallel as the first one. The Alternate form, the second test doesnt have to have the same mean and SD.

When you do a parallel test, then you can have the practice effect because the time between tests is too small. Confidence level can also affect the test…(unsystematic). When there is a brain injury in between then that could affect the results as well. The interval has to be selected in a meaningful way. Mood test) within a day or two or the same day. With a personality test you could expect high reliability.
Inter-reliability: Kohen`s value gives you a score from 0 to 1. .8 or higher is good. .5 is moderate and .2 is low. Inter-reliablity : if two people watch the same thing do they se or comprehend the same thing.

20
Q

Test-Retest Reliability

A

Estimates test stability over time
Measurement errors are assumed to be random fluctuations in observed scores around the true score from one testing session to another.

21
Q

Problems with Test Retest

A

Will the reliability be under or over estimated if:
1. The time between testing is too long so some people learn and others forget?
Under estimated
here changes in scores are not actually measurement error
2. The time between testing is too short?
Overestimated
Memory of earlier responses creates an ability to make the scores more similar

22
Q

Test Retest

A

Time between re-testing needs to be long enough to prevent memory from effecting it and short enough that maturation and historical changes to effect it.
For some tests that can be several years
A low stability coefficient may imply that the construct itself is unstable.

23
Q

Parallel Forms/Alternate Form Method

A
Coefficient of Equivalence 
Here you do not take the same test twice but rather two forms of the test are used. 
They are called parallel tests 
For a parallel test to work 
They both need to sample the same content universe in some way 
They must have equal: 
Means
Variances
Inter-item covariances
24
Q

Parallel Forms and Test Retest

A

Administer Form A – wait – administer Form B

Coefficient is usually lower than a coefficient of equivalence or a coefficient of stability

25
Q

Internal Consistency: Split-Half

A

Single test administration
Split test in half and correlate the scores on the halves much like test-retest procedure.
A difficulty with this is that there are many different ways of splitting the test in half.

26
Q

Spearman-Brown Prophecy Formula

A

Reliability of test is low.

You need to add questions to increase the reliability.

The more items you have the better change you have to higher reliability.

You modify the number of items, and you use the formula to figure out how many items to add to make the test more reliable.

The more items the more reliable it is. But beyond a certain point there is no point in adding more questions.

27
Q

Spearman-Brown Prophecy Formula

A

You have a very expensive test that you have designed but the reliability level is really low. Should you throw out the test? Or is there just not enough questions to make the test reliable? This formula helps you estimate how many more questions you would need to increase reliability to a point where the test becomes useful.
Rxx = krxx/1 + (k - 1)rxx
Rxx = reliability estimate
rxx = odd-even correlation of scores
K = ratio to increase to desired length of a test

28
Q

Reliability Estimates Based on Item Covariances

A

Overcomes the difficulties with the split half correlations by considering all possible split half correlations
Kuder-Richardson Reliability
For Dichotomous Data