Ch 4 - Reliability Flashcards

1
Q

Reliability

A

Is based on the consistency and precision of the results of the measurement process

= trustworthiness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Measurement error

A

• Any fluctuation in scores that results from factors related to the measurement process that are irrelevant to what is being measured
*Measurements are ALWAYS subject to some fluctuation/error, but we want to limit it as much as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

True scores

A

the hypothetical entities that would result from error-free measurement (do not actually exist)
Not calculated the same way in individual and group scores

T is the value you would obtain if you were to administer the test to an individual an infinite number of times (without practice effect) and average all those scores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Individual true score

A

average score in the hypothetical distribution that would result if the person took a test an infinite number of times

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Observed scores

A

the scores that individuals actually obtain when taking a test
Composed of:
True Score + Error Score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Sample (or pop) variance is composed of…

A

true variance + error variance

The reliability of scores increase as the error component decreases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can we calculate a reliability coefficient (rxx) using the variance of a sample

A

rxx = true variance over total variance (result is error variance)
If all the test score variance were true variance, score reliability would be perfect (1.0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why is it an error to say that a test is reliable? Which factors can influence reliability?

A

It’s equivalent to saying that it will be reliable for every use, at every time, in every respects (which is not true)

This is why we consider that reliability is about scores, not about tests

Many factors can influence the reliability of a score
• Test taker (fatigue, unmotivated, mood, drugs, etc)
• Environment of test (room, temperature, noise, etc)
• Others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is the reliability of scores variable and not fixed?

A

When the score data is obtained from a large sample, in standardized conditions, the measurement errors resulting from this will be considered to be relatively small and cancel each other out in the individual scores (in this case, reliability will still vary from sample to sample)

But the extent to which possible sources of error in measurement enter into any specific use of a test must be taken into account each time a test is used, because some factors may vary

*The judgements about what is considered a source of error needs to be done in relation to what the test is trying to assess - different conditions could be interpreted differently (ex: if noise is intentionally used to distract the test takers or if it happens in the lab by mistake and distracts the test takers) - therefore it will vary at different uses of the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

3 sources of error in test scores

A
  1. The context in which the testing takes place (administrator, test scorer, environment, etc)
    1. The test taker (carelessness, etc) (can be difficult to eliminate)
      1. Specific characteristics of the test itself
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Consistent error

A

(for example, a scale can systematically weigh everyone with 2 extra kilos)

estimates of reliability may fail to detect this kind of error - which also affects the validity of the measurement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Interscorer/Interrater Differences

A

Label assigned to the errors that may enter into scores when the element of subjectivity plays a role in scoring a test
Can happen even if:
• The scoring guidelines are clear and well-explained
• The scorer are conscientious in applying the guidelines
It does not imply carelessness from the scorers

Scorer Reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Scorer reliability (AKA inter-rater reliability, AKA interscorer reliability)

A

Method for estimating error due to interscorer differences
Having at least 2 individuals score the same set of tests
The correlations between the sets of scores obtained are indications of scorer reliability
• Measures the degree to which score positions stay the same over 2 raters, NOT whether raters give the same score
High and positive = the error of scorer differences is <10%
• Symbol for inter-rater reliability - r (established by the prof, there is none specified in the book)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Time Sampling error

A

Refers to the variability inherent in test scores as a function of the fact that they are obtained at one point in time rather than at another
ANY construct/behaviour is subject to fluctuate from time to time
Some constructs/behaviours are less subject to change than others

In the realm of personality:
Traits - more enduring
States - fluctuating/temporary
Some cognitive abilities (like attention) may also be more vulnerable to change

Test-Retest Reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Test-retest reliability

A

Giving the same test on 2 occasions to account for time sampling errors
• The correlation between the scores is the test-retest reliability or stability, coefficient (rtt)
○ Index of how much scores are likely to fluctuate due to time sampling error
• The time interval between the administrations has to be specified too - no specific interval can be suggested because it can change based on various factors

* That interval should be selected with purpose - should be consistent with the theory and the intentions of the test (what it's supposed to measure)
* Attrition, practice, mood, etc could all influence the scores between time 1 and 2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Content sampling error

A

Trait-irrelevant variability that can enter into tests scores as a result of fortuitous factors related to the content of the specific items included in a test
• When the content of a test either favors of disadvantages some test takers, for reasons outside the test developer’s control

Ex: exam that went over only 2 of the 3 chapters - if some students focused on the 1 chapter that is not in the exam it’s unfair to them

Alternate-Form Reliability
Split-Half Reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Alternate-Form Reliability

A

Intended to estimate the amount of error in test scores that is attributable to content sampling error
• Two forms of the test (same purpose but different content) are administered to the same subjects
• Alternate-form reliability (r1I) coefficients are then obtained (Pearson correlation between the 2 scores that each examinee obtains)
• Chance/random factors are unlikely to affect participants in this case

In the book, the coefficient is designated by r11

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Split-Half Reliability - what does it estimate?

A

Administer the test to a group and create two scores by splitting the test in half

* Estimates content sampling error
* Interitem inconsistency - only up to a certain point, since it evaluates reliability between 2 halves of the same test, not between each items

• This method is a way to estimate content sampling error in test for which NO alternate form is available
	○ Which is true for most tests - there are little alternate forms tests
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How can we split the test for split-half reliability? What does it depend on?

A

• How to split? Depends on
○ Systematic differences across test items
§ Ex: increasing difficulty, spiral omnibus format (items pertaining to certain variables alternate in the same order for the whole test)
○ Test performance depends primarily on speed
§ Ex: clerical tests where you need to find the mistake in series of characters as fast as possible; time limit is set so that most won’t finish the test

There are various ways to split the test in half (even-odd, half-half, quarters, etc)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Spiral omnibus format

A

(items pertaining to certain variables alternate in the same order for the whole test)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How is the split-half reliability coefficient calculated? What adjustment do we need to make to it, and why?

A

the split-half reliability coefficient (rhh) is calculated with the correlation between the 2 halves of the test

Then, Spearman-Brown formula is applied to rhh to obtain an estimate for the full test, which will INCREASE the value of the coefficient (to account for both halves)
rS-B = (2rhh) / (1+rhh)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Interitem inconsistency

A

Refers to error in scores that results from fluctuations in items across an entire test, as opposed to the content sampling error emanating from the particular configuration of items included in the test as a whole

• Can result from:
	○ Many sources possible:
	○ Content sampling errors
	○ Content Heterogeneity

• Statistically, it will be visible with a low inter-item correlation --- degree to which responses to individual items maintain their positions during the whole item set - this method is NOT at the level of scores for the whole test, but for the level of individual item scores
• Ex: those who get item 17 correct also generally get item 92 correct (idem for those who fail)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Content Heterogeneity

A

Results from the inclusion of items/sets of items that tap content knowledge or psychological functions that differ from those tapped by other items in the same test
Cannot be considered a source of error if the test was intended to be heterogenous

Heterogeneity of item content across one scale within a particular test
• If responses across the individual item are not consistent, it’s hard to argue that they come from the same domain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Internal Consistency Measures - why can’t we use the split half reliability coefficient to measure this?

A

Are statistical procedures designed to assess the extent of inconsistency across test items

• Split-half coefficients can do that to some extent, BUT a test can be divided into so many ways that the coefficients will vary each time

	○ Solution 1: an odd-even split
	○ Solution 2: formulas that take into account interitem correlation
		§ Kuder-Richardson formula 20 (K-R 20)
		§ Coefficient alpha (AKA Chronbach's alpha)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Name the 2 factors that make the magnitude of the Kuder-Richardson formula 20 (K-R 20) and Coefficient alpha (AKA Chronbach’s alpha) vary

A
  • The number of items in the test
    • The ratio of variability in test taker’s performance across all the items in the test to total test score variance

Indeed, their magnitude will be higher as:
• Number of item increases
• The ratio of item score variance to total test score variance decreases

• BOTH formulas require a single administration of a test to a group
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Conceptually, the estimates of reliability that the Kuder-Richardson formula 20 (K-R 20) and the Coefficient alpha (AKA Chronbach’s alpha) produce are similar to…?

A

Both formulas produce estimates of reliability that are equivalent (conceptually) to the average of ALL the possible split-half reliability coefficient we could obtain if we split the test in all its possible ways

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What type of error do the Kuder-Richardson formula 20 (K-R 20) and Coefficient alpha (AKA Chronbach’s alpha) represent?

A

• They represent an estimate of content sampling error and content heterogeneity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What other techniques can be used to evaluate test homogeneity

A

Factor analytic techniques

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Time sampling and content sampling error combined can also be evaluated with..

A

Delayed Alternate-Form Reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Delayed Alternate-Form Reliability

A

Good for estimating time and sampling error in a single coefficient
Can be calculated when 2+ alternate forms of the same test are administered on 2 different occasions (separated by a time interval), to 1+ groups of people
If the time interval is small: mostly assessing content sampling
If the interval is larger: assessing both content and time sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Practice effects + myth about rtt

A

Practice effects: increase in performance due to repeated exposure to the test items (ex: taking the test twice)
• More significant with small intervals
• Procedures that require +1 trial are susceptible to practice effects
• Must be taken into account when relevant

Common myth about practice effects and the test-retest method: practice effects lower test-retest reliability coefficients
• Test-retest involves the EXACT SAME test being given twice - if the interval is short, there might very well be practice effects
• The myth is NOT true; because if practice effects are constant (ex: all people improve by approx. The same amount) throughout the sample, it will tend not to lower the rtt coefficient
○ BUT if practice effects are differential (vary), then yes it may lower rtt

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are the 2 times at which test reliability matters most to a test user?

A

○ The stage of test selection

○ Test score interpretation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Name the 4 steps to consider reliability in test selection

A
  1. Determine the potential sources of error that may affect the scores
  2. Examine the reliability data available on the instruments of choice, as well as the types of normative samples used for this data
  3. Evaluate the reliability data in light of other factors (time, cost, validity, etc)
  4. All other things being equal, choose the test that promises the most reliable scores

There are no fixed rules that apply when selecting a test - it ALWAYS depends on the circumstances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Name other aspects relating to reliability that must be taken into account when choosing a test (4)

A
  • Scoring involving subjective judgement (scorer reliability)
    • Possible time sampling error and practice effects (when evaluating scores over time)
    • High-delayed alternate-form score reliability, when the test includes people being tested more than once
      • The desire for homogeneity across the entire test (K-R20 or alpha coefficient)

What one will look at the most will depend on the intentions for using the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Why should we look at the composition of the sample used to calculate the reliability of a test?

A

It must be taken into account that the reliability coefficient shown in test manuals and cie are ONLY for the samples that were used by the test authors
• Therefore, small differences in coefficients between different tests do not matter as much as other considerations
○ For tests intended to be used in individual assessment, the sample composition IS very important
○ Overall, the higher the coefficient the better (usually over .80 is best)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What are the 2 purposes of reliability data once a test has been scored?

A
  • Acknowledge/quantify the margin of error in obtained test scores
    • Evaluate the statistical significance of differences between obtained scores to help determine the importance of those differences in terms of what the scores represent
37
Q

What is the SEM

A

The Standard Error of Measurement - represents the SD of the hypothetical distribution we would have if a respondent took the test an infinite number of times

(SEM cannot be taken as an indicator of reliability since its a function of the reliability coefficient and the SD of the test (it will be much larger for test with larger SD))

38
Q

To calculate the SEM, we need to find a “middle” value for the hypothetical distribution of scores. What are our 2 options?

A
• The obtained score (X0)
	• An estimation of her true score (T), using this formula:
T' = rxx (X0 - M) + M
(M=mean of test score distribution)
(T'=individual's estimated true score)
*If an obtained score is close to the mean, it is not useful to calculate the estimated true score (since it will be similar). We only calculate it when it's far from the mean to take the regression into account
39
Q

What are the 2 main comparisons we want to make with test scores?

A

Assessment using tests usually asks for the comparison:
• Between 2 or + scores obtained by the same person
• Between the scores obtained by 2 or + people

40
Q

What is SEdiff? What is it used for?

A

the Standard error of the Difference Between Scores

can be used to determine the probability of the likelihood that the obtained differences between scores (and what they represent) could have been due to chance

41
Q

What are the 2 formulas we can use to calculate SEdiff?

A

1 - using the SD of tests 1 and 2 (need to be the same), and the reliability coefficients of both tests

2- If the SD of the tests are not the same, we can calculate the SEM of each test and then use it to calculate SEdiff

42
Q

Once we have the SEdiff, what do we do to estimate the probability of the likelihood that the obtained differences between scores (and what they represent) could have been due to chance?

A

We need to divide the difference between the two scores by the SEdiff, which will give us a Z score. We then determine the area under the curve for that Z score, which is the probability that the obtained differences between scores could have been due to chance.

43
Q

2 reasons why CI around scores are important?

A
  • Confidence intervals for obtained scores remind us that test scores are not as precise as their numerical nature would suggest - we need to take that into account when making decisions
    • Confidence intervals prevent us from attaching undue meaning to score differences that may be insignificant in the light of measurement error
44
Q

What are the 2 things we need in order to have content validity?

A
  • Items that sample the whole domain of interest

* No items that sample other domains

45
Q

According to Streiner, what is a true score composed of?

A

True score = construct of interest + systematic error (AND random error is added to make the observed score)

46
Q

What are the 2 problems with split half reliability, according to Streiner? What are the solutions?

A
  • Reliability of a scale is proportional to its length (problematic with split-half) - solved by using S-B
    • There are a lot of ways to split a test - K-R20 solves this (BUT only for dichotomous items)

Chronbach’s alpha solves the fact that KR-20 is only for dichotomous items

47
Q

In simple terms, what do the coefficients of KR-20 and alpha represent?

A

The mean of all possible split-half reliabilities

KR-20: only for dichotomous test (true/false answers)
alpha: for both

48
Q

What did Cortina demonstrated about alpha?

A

Cortina; demonstrated that even with items that are not from the same dimensions (aka constructs), alpha can be significantly high if there are enough items

Alpha is greatly influenced by the length of the test - therefore a high value of a is a indication of high internal consistency, but it’s not a guarantee of it

49
Q

In which cases can we reasonably expect alpha to be “not that high”?

A

Some constructs (like anxiety) are not homogenous in and of themselves; therefore we can’t expect alpha to be high for tests measuring those constructs

50
Q

In which situations should alpha NOT be used?

A
  • Tests that measure how many items one can respond in x amount of time
    • Tests where the items are in order of difficulty
    • Tests where the answer to 1 item depends on the answer to a previous one
      • On tests where the scale is multifaceted (measures more than one construct)
51
Q

What is the JARS-Quant? Give examples

A

quantitative standards for APA studies
Guidelines for reporting about results - established by a task force

Best practice for reporting: report values of reliability coefficients for the scores actually analyzed in that sample
• Authors are welcome to report reliability coefficients from other sources, but they are obliged to report them for their own data
The right kind of score reliability coefficient should be reported

52
Q

Reliability is to ______ as validity is to _______

A

Precision, accuracy

Reliability: hits the target consistently
Validity: hits the right spot on the target

Remember: precise does not guarantee accurate
Reliable does not mean valid
ALSO: accuracy requires precision, validity requires reliability
Not precise, can’t be accurate

Reliability is a prerequisite for validity, but it doesnt guarantee it

53
Q

Exact definition of reliability

A

that scores from the same cases remain stable, maintain their position over variations in time, forms of the test, raters or scorers of the test where scoring is not objective, and variation over sets of items that come from the same domain

If a reliability coefficient is of 1.0, it means that the scores have maintained their position
If the rxx is of 0, it means that there is no preservation of position whatsoever, completely random
Random scores measure nothing - if their coefficient is close to 0

54
Q

Why do we say that reliability is a property of scores and not of tests? - Related, what is blackbox mentality?

A

If the same test were administered in a different sample, the reliability of those scores would not be the same
• Values of reliability coefficients vary over samples, because they have different scores - this is why researchers should always report it in their own data
Reliability is NOT an unchanging property of a test

Blackbox mentality: thinking that once the reliability is established in the normative sample, we can apply it to any sample with the same reliability

55
Q

Sampling error

A

Sampling errors are statistical errors that arise when a sample does not represent the whole population

56
Q

What are the expected values of a reliability coefficient?

A

Around 0.7-0.8 is ok, above .9 is best
Those guidelines do not apply to the alpha coefficient

Generally speaking, if the value of rxx is less than 0.5, it says that most of the observed variation in the scores is due to random scores - randomness measures nothing

57
Q

In a validity perspective, the true score would correspond to the exact construct we are trying to measure in that person, free from error. In a reliability perspective, what does the true score represent?

A

score that is free from error, regardless if it measures the intended construct

58
Q

2 ways to find the true score

A
  1. T is the value you would obtain if you were to administer the test to an individual an infinite number of times (without practice effect) and average all those scores
    1. In a pop, find all the individuals who have all the same true score and record, for everyone, their x score. With the distribution of x scores around the t score, you would get a normal distribution of x scores
59
Q

What is the E component, and what is its value?

A

Random Error Component: The value of e, on average in the pop, is 0
• Some values of e will be positive or negative, but on average over all cases of the pop it will be 0
• For individual scores, it can be ANYTHING. You can never know its value for an individual score

If e > 0 (a positive value), then X > t because of random measurement error
If e < 0 (a negative value), then X < t because of the random error term
If e = 0, then X = t because there is no error accompanying the true score

60
Q

Why would we want to calculate a CI around the observed score? What is the link between the width of the CI and rxx?

A

to account for the error component

It’s best practice to evaluate that CI, especially in high-stake testing
As rxx increases, confidence intervals get narrower because there is less margin of error
The opposite is true

61
Q

In terms of the variation in test scores, what indication does the rxx coefficient gives us about sources of variation?

A

The rxx exstimates, of total observed variation on x, what proportion of the total variance is systematic (seems to be due to the true component of the scores)

s^2t / s^2x

• If the value of rxx is 0.8, it means that 80% of total variability seems to be systematic
62
Q

In terms of the variation in test scores, what indication does the 1 - rxx coefficient gives us about sources of variation?

A

This is estimating the proportion of total observed variation (s^2x) attributable to random error (s^2e)

• If rxx is 0.8, then we can say that 0.2 (20%) is due to random error
	○ Means that 80% of observed variability is systematic
63
Q

What does the square root of rxx represent?

A

Taking the square root of an rxx approximates the correlation between the x (AKA t + e) and the true (t) part of x

* When a score reliability coefficient is 0.8
* The square root of 0.8 = 0.89 means that the true score correlates to the true component of those scores by +/-0.9
64
Q

What type of measurement error does rxx estimate?

A

A rxx estimates only time sampling error

Other kinds of measurement error like scoring error or item sampling error also exist, but are not measured by rxx

65
Q

Rater drift

A

inter-rater reliability tends to go down over time unless raters are re-trained about scoring methods
• If they are not, they get into their old habits
• In studies based on repeated measures / longitudinal, addressing rater drift is important

66
Q

Features of the parallel forms used in alternate-form reliability

A
  • Items from the same domain
    • At least 2 forms of the same length, difficulty (if there is a skill component), and domain
      • Both forms would be co-normed (based on the same normative sample)
67
Q

What are the 2 versions of the alternate form reliability method? What type of error is each version prone to?

A
  1. Immediate alternate form reliability (most common and straightforward for interpretation)
    ○ Administered within the same sample, in the same occasion on the same day (not more than 1 day apart, timeframe is basically immediate)
    ○ Scores retain the same place between the 2 tests
    ○ Items are from the same domain, but are not exactly the same
    ○ Content sampling error probable
    1. Delayed administration
      ○ There is an important delay between the administration of the 2 forms (but STILL within the same sample)
      ○ Over the 2 occasions, there is a variation of time and content of the test
      ○ Content sampling and time sampling error probable

For example - immediate administration
• R11= 0.85 (15% of the observed variability is attributable to content sampling error)
• (since 1-0.85=0.15)

For example - delayed administration
• R11 = 0.2 (80% of the observed variability is attributable to content sampling error AND time sampling error altogether)
• (since 1-0.2=0.8)
• For that reason, the immediate administration is preferred because there is only 1 error component

68
Q

Ideal types of test for the split-half reliability method

A

• The items are presented in order of difficulty (first are easy, last are hard)
• Stopping rule/criteria: if an examinee fails a certain number of items in a row, then the administration is stopped
○ Not all examinees are going to be administered ALL the items in the test (those less skilled will meet the stopping rule faster)
If a test has these 3 characteristics, the split half technique is perfect (internal consistency and alpha consistency require that all items are given to all examinees, but not split-half)

69
Q

What is the KEY differentiating factor between the split-half reliability coefficient and alpha?

A

Alpha coefficient KEY differentiator with the split half: alpha requires that all participants are given all items

70
Q

What does the internal consistency method measures

A

The Internal consistency reliability coefficient estimates:
• Interitem inconsistency
• Content heterogeneity
○ Content heterogeneity can explain interitem inconsistency (AKA the possibility that the items on a test don’t come from a single domain but maybe from 2+ domains)
• The alpha coefficient also measures a third component, that we will add later

71
Q

How do we calculate the internal consistency coefficient

A

In the internal consistency method, 1 test (a single version) is administered on one occasion to one sample
• The responses/scores for each individual item on the test are recorded
Conceptually, it’s like taking a test and splitting it up into n items (however many items there are)
• Each item is kind of a mini-test by itself, it is analyzed as such
When analyzing the items, Pearson correlations of n x n items will be analyzed
• Average interitem correlation (/rjj) (part of what determines the alpha coefficient)

Correlations are made between the items (ex: scores on item 1 with scores on item 2)
• The average of ALL these Pearson correlations is done = interitem correlation (VERY hard to calculate by hand)

72
Q

How can we interpret the /rjj coefficient when its positive, 0 and negative?

A

If the /rjj is positive (above 0) - though it’s impossible for it to be 1
• Means that overall, higher scores on 1 item predict higher scores on another item
• Maintaining their positions
• There is consistency in reporting over the set of items
If the /rjj is = to 0
• Means that higher scores on 1 item does not necessarily predict higher scores on another item
• There is no consistency in responses over the items
• The value of alpha would also be 0, meaning the same thing (no predictability)
If the /rjj is negative
• Means that higher scores on 1 item predict lower scores on another item
• The scores are reversing - they are switching positions
• There is something wrong with the items!
The alpha coefficient will be negative, BUT that’s an invalid result (it’s impossible to have a proportion of variance that is negative)

73
Q

Conceptually, the alpha coefficient represents

A

• N x (/rjj)
Test length x Average interitem consistency

With a single value of alpha coefficient, it’s impossible to separate the 2 (it can, with other techniques that we will not learn in undergrad)

74
Q

What is the main difference between the KR-20 and the alpha coefficient?

A

The formula for Chronbach’s alpha is general and can accommodate any response format
The formula for KR20 is only for items with a binary response format (true/false)

75
Q

If the alpha coefficient = 0.70, how do we interpret it?

A
  • Then 1 - a =0.30
    • Meaning that: 30% of the variation in scores is due in combination to (to the product of) the inconsistency and test length
    • Inconsistency x test length
    • NOT just inconsistency

We can’t say that 0.7 = a is “just enough” because it also accounts for test length
• Alpha is not adjusted for test length
• Therefore there is no universal value of a that is good enough for all tests (value will differ depending on test length)
In the same way, we can’t say that 0.9=a is “perfect”, it also depends on the test

Example; a test includes some subtests evaluating different skills
• Doing an alpha coefficient (we know in advance that the items are heterogenous), it won’t be a surprise that the coefficient is low
• We could instead calculate alpha JUST for the items from the same domain together, which would increase alpha
• This is why the value of alpha can be interpreted differently depending on the tests on which it’s used

76
Q

What is the influence of item wording on the alpha coefficient, and how can we adress this problem?

A

Having a set of items that vary in wording (some items are positively worded and others are negatively worded) - you should not calculate the alpha coefficient until this feature is taken care of

• This discrepancy will affect the alpha coefficient, because 0 and 2 do not always mean the same thing in terms of agreement and feelings about one's health

Reverse coding: addresses this problem

77
Q

Aside from adjusting the rhh coefficient, how can we use the SB formula?

A

Example: there is a test with n items
• We will estimate the effect of changing test length on the reliability for that test
• Current score reliability for the test, for 20 items, is already known (rxx)
○ Adding items will increase rxx
○ Removing will decrease it
○ RS-B is the new rxx once we modified the length of the test

• *Note* We are using this formula to ESTIMATE rxx, it's ONLY an estimate and nothing more
78
Q

2 things that determine the predicted score reliability coefficient (RS-B)

A

• Rxx (value of the score reliability coefficient for the actual test in its actual form)
• K: the factor by which test length will be changed (theoretically)
○ NOT a number of items to be added/deleted to the test, its a factor

Calculated by dividing: number of proposed items in the new version of the test / old number of items in the actual version of the test

79
Q

What are the assumptions of S-B when we use it to estimate the reliability of adding/removing test items?

A
  • All the psychometric characteristics of the test and its items stay the same after we change the number of items (may not be true irl)
    • When deleting some items from a test, it’s irrelevant WHICH items are being deleted (since they are assumed to all have items with the same psychometric characteristics)
    • In the case where k > 1, the formula assumes that any items added to the rest have the same psychometrics as those that are already in the test

In reality:
• You may not get the same reliability coefficient if you add bad items (not with the same psychometric characteristics) to your test

80
Q

What are the 2 ways to create a sampling distribution of x scores, which would have the true score as its mean?

A
  1. Retest to infinity (we already talked about this perspective, if we administer the same test repeatedly the observed scores will not be the same, but their average will equal the true score)
    1. Taking everyone in the same pop that has the same true score (NOT observed), and record for all those individuals, their observed scores (x) on the test. The x scores will have an error component (e):
    If e is positive, the x is bigger than t, it’s smaller if e is negative
    All the x scores cluster around the true score
    If you get all the x scores and average them, you would get the true score
81
Q

What is the St DEV of the sampling distribution of x scores? What does a large St DEV means

A

The standard deviation in this sampling distribution is the Standard Error of the Mean
The larger the SEM, the greater the difference between the observed scores from the true score
Smaller SEM = smaller difference between the x and t scores

SDe = St deviation for the error component (AKA SEM - it’s the same thing)
• As it goes up, there is more and more error (difference between the x and t scores)

Go check formula for SEM

82
Q

If rxx = 1, then SEM=?

A

0

If the reliability is perfect, then x=t (the observed score is the true score and there is no distribution of the error component)

83
Q

If rxx=0, then SEM=?

A

SDt: standard deviation for total scores on the test (observed scores)
• When score reliability is 0, then x scores (observed) are only random numbers and mean nothing (their variation is 100% due to error) - this also never happens in real life

84
Q

Difference between SDt and SEM (SDe)

A

SDt: standard deviation for total scores on the test (observed scores)
SDe (SEM): St deviation for the error component only

85
Q

To create a confidence interval around a x score, which type of St dev we use?

A

the SEM; we want to create an error margin around the observed score, while “ignoring” the error that is systematic/due to variation (AKA the other part of SDt)

86
Q

How do we calculate a CI around a x score?

A

X +/- SEM x (Z score for 0.05 test, depending on the type of interval)

87
Q

If the interval of a score is [78.85, 105.15], what are the correct and incorrect ways to interpret it?

A

CORRECT:
• This means that the TRUE SCORE of this x OBSERVED score may lie anywhere between 78.85 and 105.15
• If the person was to be re-tested, the range of their possible observed scores is in this interval
95% of the CIs constructed around the x scores in the population will include the true score, BUT 5% will not

INCORRECT:

"the interval from 79 to 105 has a 95% chance of containing the true score for this examinee" - FALSE
	Why is it false?
		1- this interval either contains the true score of that person (with observed score of 92) OR it does not - it's a hit or miss - there is NO way to know if it's hitting or missing for an individual score, because you never know the value of the true component nor of the error component for only 1 case (see the graph above)
		2-  there is no way to know if this particular interval contains the score or not
88
Q

What is SEdiff? Additive error?

A

Standard error of the difference between 2 test scores for the same case
• We have 2 scores (can be from the same test, different parts of it, or from different tests, from the same person, etc)
• Issue: if we want to interpret the meaning of both test scores and the difference between them, there is a discrepancy.

When interpreting the difference between 2 test scores (x1 and x2), we need to understand that the error in both those scores is combined to contribute to the error component in the difference score
• The difference is less precise than the 2 elements separated because of increased error - called additive error

89
Q

How can we calculate the SEdiff? What can we expect in terms of results?

A

Square root of: SEM^2 for test 1 + for test 2 (both SEM are squared) - if the st dev of both tests are different

If the st dev of both tests are the same, we can do the sd x square root of 2 - reliability estimates of scores on test 1 - reliability estimate on 2 (only mentioned in the book)

The SEdiff is higher than BOTH the tests’ SEMs individually