W2 - Chapter 5 - Reliability (DN) Flashcards

1
Q

alternate forms

A
  • are simply DIFFERENT VERSIONS of a TEST that have been constructed to be as similar as possible to the original
    e. g., hard copy - online - oral etc.
  • a measure of reliability across time
  • does not have same mean & variance as original test so not as good as parallel forms
    p. 151
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

alternate-forms reliability

A
  • an estimate of the extent to which the ALTERNATE (different) FORMS of a test have been affected by ITEM SAMPLING ERROR, or OTHER ERROR
  • a degree of a test’s reliability across time
    p. 151-152, 161
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

average proportional distance (APD)

A

a measure used to evaluate the INTERNAL CONSISTENCY of a test

  • focuses on the DEGREE of DIFFERENCE that exists between ITEM SCORES
  • typically calculated for a GROUP of TESTTAKERS
    p. 157-158
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

classical test theory (CTT)

A
  • also known as ‘true score theory’ & ‘true score model’
  • system of assumptions about measurement
  • the composition of a TEST SCORE is made up of a relatively stable component which is what the test/individual item is designed to measure PLUS a component that is ERROR.
    p. 123 (164-166, 280-281)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

coefficient α (alpha)

A
  • developed by Cronbach (1951); elaborated on by others.
  • also referred to as CRONBACH’S ALPHA and ALPHA
  • a statistic widely employed in TEST CONSTRUCTION
  • the preferred statistic for obtaining INTERNAL CONSISTENCY RELIABILITY
  • only requires ONE administration of the test
  • assists in deriving an ESTIMATE of RELIABILITY; more technically, it is equal to the MEAN of ALL SPLIT-HALF RELIABILITIES
  • suitable for use on tests with NON-DICHOTOMOUS ITEMS
  • unlike Pearson r (-1 to +1), COEFFICIENT ALPHA ranges from 0-1 because it is used to gauge SIMILARITY of data sets so 0 = absolutely NO SIMILARITY
    1 = PERFECTLY IDENTICAL
    p.157
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

coefficient of equivalence

A

the estimate of the degree of relationship that exists BETWEEN various FORMS of a TEST

  • can be evaluated with an alternate-forms or parallel forms COEFFICIENT OF STABILITY (these are both known as the COEFFICIENT OF EQUIVALENCE) p.151
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

coefficient of generalisability

A

represents an estimate of the INFLUENCE of particular FACETS on the test score

e. g., - Is the score affected by group as opposed to one on one administration? or
- Is the score affected by the time of day the test is administered?
p. 168

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

coefficient of inter-scorer reliability

A

the estimate of the degree of CONSISTENCY AMONG SCORERS in the scoring of a test

  • this is the COEFFICIENT of CORRELATION for inter-scorer consistency (reliability)
    p. 159
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

coefficient of stability

A

the estimate of a test-retest reliability taken when the interval between tests is GREATER than SIX MONTHS

  • this is a significant estimate as the passage of time can be a source of ERROR VARIANCE i.e., the more time passed, the greater likelihood of a lower reliability coefficient p.151
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

confidence interval

A

a RANGE or BAND of test scores that is likely to contain the ‘TRUE SCORE’
p.177

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

content sampling

A
  • the VARIETY of SUBJECT MATTER contained in the test ITEMS.
  • one source of variance in the measurement process is the VARIATION among items WITHIN a test or BETWEEN tests
    i. e., the way in which a test is CONSTRUCTED is a source of ERROR VARIANCE
  • also referred to as ITEM SAMPLING p.147
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

criterion-referenced test

A
  • way of DERIVING MEANING from test scores by evaluating an individual’s score with reference to a SET STANDARD (CRITERION)
  • also referred to as “domain-referenced testing” & “content-referenced testing and assessment”

DISTINCTION:
CONTENT-REFERENCED interpretations are those where the score is directly interpreted in terms of performance AT EACH POINT on the achievement continuum being measured
- while CRITERION-REFERENCED interpretations are those where the score is DIRECTLY INTERPRETED in terms of performance at ANY GIVEN POINT on the continuum of an EXTERNAL VARIABLE.
p.139-141 (163-164, 243)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

decision study

A
  • conducted on the conclusion of a generalizability study
  • designed to EXPLORE the UTILITY & VALUE of TEST SCORES in making DECISIONS.
    p. 168
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

dichotomous test item

A
  • a TEST ITEM or QUESTION that can be answered with ONLY one of two responses e.g., true/false or yes/no
    p. 169
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

discrimination

A
  • In IRT
  • the DEGREE to which an ITEM DIFFERENTIATES among people with HIGHER or LOWER levels of the TRAIT, ABILITY or whatever is being measured by a test
    p. 169
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

domain sampling theory

A
  • while Classical Test Theory seeks to estimate the proportion of a test score due to ERROR
  • Domain Sampling Theory seeks to estimate the proportion of a test score that is due to specific sources of variation under defined conditions (i.e., context/domain)
  • in DST, the test’s RELIABILITY is looked upon as an OBJECTIVE MEASURE of how precisely the test score assesses the DOMAIN from which the test DRAWS a SAMPLE
  • of the three TYPES of ESTIMATES of RELIABILITY; measures of INTERNAL CONSISTENCY are the most compatible with DST
    p. 166 & 167
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

dynamic characteristic

A
  • a TRAIT, STATE, or ABILITY presumed to be EVER-CHANGING as a function of SITUATIONAL and COGNITIVE EXPERIENCES; contrast with static characteristic
    p. 162
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

error variance

A

error from IRRELEVANT, RANDOM sources - ERROR VARIANCE plus TRUE VARIANCE = TOTAL VARIANCE p.126,146

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

estimate of inter-item consistency

A
  • the degree of correlation among ALL items on a scale
  • the CONSISTENCY or HOMOGENEITY of ALL items on a test
  • estimated by techniques such as the SPLIT-HALF RELIABILITY method
  • p.152 - 154
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

facet

A
  • include things like the number of items on a test, the amount of training the test scorers have had & the purpose of the test administration
    p. 167
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

generalizability study

A
  • examines how GENERALIZABLE SCORES from a PARTICULAR test are if the test is administered in DIFFERENT SITUATIONS i.e., it examines how much of an IMPACT DIFFERENT FACETS of the UNIVERSE have on a test score p.167, 168
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

generalizability theory

A
  • based on the idea that a person’s test scores VARY from testing to testing because of variables in the TESTING SITUATION
  • test score in its context - DN
  • encourages test users to describe details of a particular test situation or (UNIVERSE) leading to a particular test score
  • a ‘UNIVERSE SCORE’ replaces a ‘TRUE SCORE’
  • Cronbach (1970) & colleagues
    p. 167
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

heterogeneity

A

the degree to which a test measures DIFFERENT FACTORS i.e, the test contains items that measure MORE THAN ONE TRAIT (FACTOR) (also NONHOMOGENEOUS) p.154

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

homogeneity

A
  • When a test contains ITEMS that MEASURE a SINGLE TRAIT i.e., the DEGREE to which a test measures a SINGLE FACTOR - i.e., the extent to which items in a scale are UNIFACTORIAL
  • the more HOMOGENEOUS a test, the more INTER-ITEM CONSISTENCY
  • it is expected to have higher Internal Consistency than a HETEROGENEOUS TEST
  • homogeneity is desirable as it provides straightforward INTERPRETATION (i.e., similar scores -= similar abilities on variable of interest)
    p. 154-155
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
inflation of range/variance
- SAMPLING PROCEDURES may impact the variance of either variable in a correlation analysis OUTCOME - if variance of EITHER variable is INFLATED by sampling procedure then the resulting CC tends to be HIGHER (i.e., giving a false indicator of correlation (thought to self - is this also a validity issue e.g., false positive) - conversely referred to as RESTRICTION OF RANGE/VARIANCE - if variance of EITHER variable is RESTRICTED by sampling procedure used, then tends to be a LOWER CORRELATION COEFFICIENT (i.e., masking true correlation) (thought to self - is this also a validity issue e.g., failing to detect - a miss!!!) p.162
26
information function
- an IRT TOOL - helps test users to determine the RANGE OVER THETA for which an item is most useful in DISCRIMINATING among groups of testtakers p. 171
27
inter-item consistency
- the CONSISTENCY or HOMOGENEITY of ALL items on a test - ESTIMATED by techniques such as the SPLIT-HALF RELIABILITY method - the DEGREE of CORRELATION among ALL ITEMS on a scale - p.154
28
internal consistency estimate of reliability
an ESTIMATE of the RELIABILITY of a test | - obtained from a MEASURE of INTER-ITEM CONSISTENCY p.152
29
inter-scorer reliability
- An ESTIMATE of the DEGREE of agreement or CONSISTENCY between TWO or more SCORERS on a test. - also referred to as INTER-RATER reliability; OBSERVER reliability; JUDGE reliability; SCORER reliability. - p.159, 161
30
item characteristic curve (ICC)
- graphic representation of the PROBABILISTIC RELATIONSHIP between a person's LEVEL of TRAIT (ability, characteristic) being measured and the PROBABILITY for responding to an item in a PREDICTED way; - also known as a CATEGORY RESPONSE CURVE, or, an ITEM TRACE LINE p. 177, 281
31
item response theory (IRT)
- another alternative to the true score model - a family of theories/methods (well over 100 varieties of IRT models) - each model is designed to HANDLE data with CERTAIN ASSUMPTIONS - a way of modelling (predicting?) the PROBABILITY that a person with X ability will be able to perform at a LEVEL OF Y. - also referred to as LATENT-TRAIT MODELp. p. 166, 168-173
32
item sampling
- one source of VARIANCE in the measurement process is the VARIATION among items WITHIN a test, or BETWEEN tests i.e., the way in which a test is CONSTRUCTED is a source of ERROR VARIANCE - also CONTENT SAMPLING p. 147
33
Kuder-Richardson formula 20 (KR-20)
a series of EQUATIONS developed by G. F Kuder & M. W. Richardson - designed to ESTIMATE the INTER-ITEM CONSISTENCY of tests - only appropriate for use on tests with DICHOTOMOUS ITEMS (true/false) p. 155-156, 163
34
latent-trait theory
- a synonym for IRT (Item Response Theory) in the academic literature - a system of ASSUMPTIONS about measurement - includes ASSUMPTION that a TRAIT being measured is UNIDIMENSIONAL - go back and check this pg 168 - the extent to which each test item measures the targeted trait - also referred to as LATENT-TRAIT MODELp. 168
35
measurement error
all factors associated with the PROCESS of measuring some variable OTHER than the actual variable being measured p.146
36
odd-even reliability
- an ESTIMATE of the SPLIT-HALF RELIABILITY of a test | - Splitting a test by assigning odd-numbered items to one half & even-numbered items to the other half of the test p.153
37
parallel forms
when on each FORM of the test, the MEANS & VARIANCES of OBSERVED TEST SCORES are EQUAL .151
38
parallel-forms reliability
- an estimate of the consistency of two versions of a test across time - an ESTIMATE of the extent to which ITEM SAMPLING & OTHER ERRORS have affected test scores on versions of the SAME test, for which MEANS & VARIANCES of OBSERVED TEST SCORES are EQUAL. (contrast with alternate forms reliability & also coefficient of equivalence) p.151-152
39
polytomous test item
a test item or question with THREE OR MORE ALTERNATIVE RESPONSES - where ONLY ONE is scored CORRECT or is CONSISTENT with a TARGETED TRAIT or other CONSTRUCT p. 169
40
power test
- a test, usually of achievement or ability has 1) either NO TIME LIMIT or such a long time limit that ALL TESTAKERS can attempt ALL ITEMS 2) some items are SO DIFFICULT that NO TESTTAKER can obtain a PERFECT SCORE (so its isolating the 'power' or 'ability' variable) (contrast with speed test) p.163
41
random error
a source of ERROR when measuring a target variable due to UNPREDICTABLE FLUCTUATIONS & INCONSISITENCIES of OTHER VARIABLES in the measurement process - sometimes referred to as "NOISE" - contrast with systematic error p.146
42
Rasch model
a reference to an IRT MODEL with VERY SPECIFIC ASSUMPTIONS about the UNDERLYING DISTRIBUTION p.169
43
reliability
the proportion of the total variance attributable to TRUE VARIANCE - the GREATER the proportion of TRUE VARIANCE = the GREATER the RELIABILITY of a test - p.157-158
44
reliability coefficient
- general term - an INDEX of RELIABILITY - or the RATIO of TRUE SCORE VARIANCE to TOTAL SCORE VARIANCE on a test p. 145
45
restriction of range/variance
- SAMPLING PROCEDURES may impact the variance of either variable in a correlation analysis OUTCOME - if variance of EITHER variable is RESTRICTED by sampling procedure used, then tends to be a LOWER CORRELATION COEFFICIENT (i.e., masking true correlation) (thought to self - is this also a validity issue e.g., failing to detect - a miss!!!) - conversely referred to as INFLATION OF RANGE/VARIANCE - if variance of EITHER variable is INFLATED by sampling procedure then the resulting CC tends to be HIGHER (i.e., giving a false indicator of correlation (thought to self - is this also a validity issue e.g., false positive) p.162
46
Spearman-Brown formula
allows a test developer/user to estimate the INTERNAL consistency reliability from a correlation of TWO HALVES of a test that has been LENGTHENED or SHORTENED. - inappropriate for use with HETEROGENEOUS tests or SPEED tests p. 153-154
47
speed test
- a test, usually of achievement or ability which has a TIME LIMIT - usually contains ITEMS of UNIFORM difficulty (usually uniformly low) - so that when given GENEROUS TIME ALL TESTTAKERS should be able to complete ALL ITEMS CORRECTLY (so its isolating the SPEED variable) (contrast with 'power test') p.163, 272
48
split-half reliability
an ESTIMATE of the INTERNAL CONSISTENCY of a test - obtained by CORRELATING two PAIRS of SCORES taken from EQUIVALENT HALVES of a SINGLE TEST administered ONCE - p.152- 154
49
standard error of a score
- in TRUE SCORE THEORY - a STATISTIC designed to ESTIMATE how far an OBSERVED SCORE DEVIATES from a TRUE SCORE (also called standard error of measurement (SEM) p.175
50
standard error of measurement (SEM)
- in TRUE SCORE THEORY - a STATISTIC designed to ESTIMATE how far an OBSERVED SCORE DEVIATES from a TRUE SCORE (also called STANDARD ERROR OF A SCORE) p.132, 175-178
51
standard error of the difference
- a STATISTIC designed to aid in determining HOW LARGE a DIFFERENCE between two scores should be BEFORE it is considered STATISTICALLY SIGNIFICANT p. 132, 178
52
static characteristic
a TRAIT, STATE or ABILITY presumed to be relatively STATIC OVER TIME (contrast with dynamic characteristic) p.162
53
systematic error
- a source of ERROR in the measurement process - typically CONSTANT or PROPORTIONATE to what is presumed to be the TRUE VALUE of the target variable being measured - once known, it is predictable & FIXABLE - relative standings remain unchanged - may not be VALID but is RELIABLE - p. 146
54
test battery
typically composed of TESTS designed to measure DIFFERENT VARIABLES. - quite often psychologists rely on a BATTERY of tests in the process of EVALUATION. p. 155n5, 502-504 see also specific batteries
55
test-retest reliability
an estimate of reliability obtained by CORRELATING pairs of scores from the SAME PEOPLE on TWO DIFFERENT administrations of the test - appropriate when EVALUATING the RELIABILITY of a test purporting to measure something relatively STABLE over TIME e.g., a personality trait p.150-151, 161
56
theta level (in IRT)
- a reference to the DEGREE of the underlying ability or trait that a TESTTAKER is presumed to BRING TO the test - also referred to as THETA p. 170
57
transient error
a source of error attributable to the testtaker's FEELINGS, MOODS, or MENTAL STATE OVER TIME p.160
58
true score
- according to CLASSICAL TEST THEORY | - a value that GENUINELY reflects an individual's ABILITY or TRAIT level as measured by a particular test p.164
59
true variance
- in the TRUE SCORE MODEL - the COMPONENT of a score attributable to TRUE DIFFERENCES in the ability or trait being measured - can be in an OBSERVED SCORE or a DISTRIBUTION of SCORES p.146
60
universe
- in GENERALIZABILITY THEORY - the TOTAL CONTEXT of a particular test situation - including ALL the FACTORS that lead to an individual testtakers score - p.167
61
universe score
- in GENERALIZABILITY THEORY - a test score corresponding to the PARTICULAR UNIVERSE being assessed or evaluated p. 167
62
variance
a statistic useful in describing SOURCES of test score variability - equal to the MEAN of the SQUARES of the DIFFERENCES between SCORES in a distribution and THEIR MEAN - calculated by SQUARING & SUMMING all the DEVIATION SCORES then DIVIDING by the total number of scores p.95, 146
63
What is the main challenge of a test creator?
to MAXIMIZE the proportion of TOTAL VARIANCE that is TRUE VARIANCE and to MINIMIZE the proportion that is ERROR VARIANCE - p.147
64
What are four main SOURCES of ERROR VARIANCE?
1) TEST CONSTRUCTION - item sampling/content sampling 2) TEST ADMINISTRATION - test environment; testtaker variables; examiner related variable. 3) TEST SCORING and INTERPRETATION - scorers; scoring systems). 4) OTHER SOURCES OF ERROR - sampling error - methodological error - researchers not trained, ambiguous wording, item biases) p. 147 - 149
65
What are some methods of measuring INTERNAL CONSISTENCY of a test's items?
1) SPEARMAN-BROWN FORMULA p.153-4 2) KUDER-RICHARDSON FORMULAS p.155-6 3) COEFFICIENT ALPHA p.157 4) AVERAGE PROPORTIONAL DISTANCE (APD) p.157
66
How is obtaining estimates of ALTERNATE-FORMS reliability & PARALLEL FORMS reliability SIMILAR?
1) Two test administrations with the SAME GROUP are required 2) Test scores between tests may be AFFECTED by factors such as MOTIVATION, FATIGUE, or INTERVENING EVENTS (practise, learning or therapy) - although not as much as if the EXACT SAME test had been administered twice
67
What is an INHERENT source of ERROR-VARIANCE when computing an ALTERNATE or PARALLEL-FORMS reliability coefficient?
ITEM SAMPLING ERROR | p. 152
68
What are the THREE steps of computation of a COEFFICIENT of SPLIT-HALF RELIABILITY?
Step 1 - Divide the test into EQUIVALENT HALVES Step 2 - calculate a Pearson r between scores on the TWO HALVES of the test STEP 3 - adjust the HALF-TEST reliability using the SPEARMAN-BROWN FORMULA (p.152-153)
69
Contrast the Coefficient alpha & Pearson r
Ca - 0-1 Pr - -1 to +1 Ca - gauging how SIMILAR data sets are PR - dealing with SIMILARITY & DISSIMILARITY
70
What is the DIFFERENCE between the FOCUS of Average proportional distance (APD) and SPLIT-HALF methods & CRONBACH's ALPHA?
APD - focus is on the DEGREE of DIFFERENCE between item scores SH & CA - focus is on SIMILARITIES between item scores p.157
71
What are the 3 approaches to ESTIMATING RELIABILITY?
1) test-retest 2) alternate or parallel forms 3) internal or inter-item consistency method chosen will depend on a number of factors - e.g., the PURPOSE, NATURE for obtaining the measure p. 160
72
How do we decide which RELIABILITY COEFFICIENT to CHOOSE (use)?
- the method chosen will depend on a number of factors - e.g., the PURPOSE, NATURE for obtaining the measure NOTE: the various RELIABILITY COEFFICIENTS DO NOT all reflect the same SOURCES of ERROR VARIANCE see pg. 161 (impt to understand why each test is selected, also refer to Table 5-4)
73
What are the 3 ASSUMPTIONS made when using IRT?
1) Unidimensionality 2) Local Independence 3) Monotonicity p. 170