ch.5 for unit 3 Flashcards

1
Q

reliability coefficient/Test-retest reliability

A

a statistic that quantifies reliability, ranging from 0 (not at all reliable) to 1 (perfectly reliable)./ is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test. The test-retest measure is appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time, such as a personality trait. If the characteristic being measured is assumed to fluctuate over time, then there would be little sense in assessing the reliability of the test using the test-retest method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

measurement error

A

refers to the inherent uncertainty associated with any measurement, even after care has been taken to minimize preventable mistake

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

true score

A

the measurement of a quantity if there were no measurement error at all.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

carryover effects/ practice effects/ fatigue effects

A

Measurement processes that alter what is measured are termed carryover effects.
In ability tests, practice effects are carryover effects in which the test itself provides an opportunity to learn and practice the ability being measured. Fatigue effects are carryover effects in which repeated testing reduces overall mental energy or motivation to perform on a test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

why is the true score not nessecarily the “truth”?

A

By definition, a true score is tied to the measurement instrument used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

construct score.

A

If you are interested in the truth independent of measurement, you are not looking for the so-called true score, but what psychologists call the construct score. A construct is a theoretical variable we believe exists, such as depression, agreeableness, or reading ability. A construct score is a person’s standing on a theoretical variable independent of any particular measurement. If we could create tests that perfectly measured theoretical constructs, the true score and the construct score would be identical. Unfortunately, all tests are flawed. The long-term average of many measurements using a flawed measurement procedure is still called a true score, flaws, and all.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why bother with true scores when construct scores are clearly more important?

A

Because true scores help us understand and calculate reliability, and without reliability a test cannot be valid. In Chapter 6, we will discuss test validity in greater detail. Whenever we evaluate a test’s validity, we first check that its reliability is sufficient. The lower the test’s reliability, the lower the test’s validity. Yet high reliability does not guarantee high validity. A deeply flawed test that gives consistent measurements is reliable but not valid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

true variance, vs error

A

r. Variance from true differences is true variance, and variance from irrelevant, random sources is error variance. If σ2 represents the total observed variance, its relation with the true variance and the error variance, can be expressed as
In this equation, the total variance in an observed distribution of test scores (σ2) equals the sum of the true variance (
) and the error variance (
)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Random error vs systematic error

A

Random error consists of unpredictable fluctuations and inconsistencies of other variables in the measurement process. Sometimes referred to as “noise,” this source of error fluctuates from one testing situation to another with no discernible pattern that would systematically raise or lower scores.
vs Random errors increase or decrease test scores unpredictably. On average and in the long run, random errors tend to cancel each other out. In contrast to random errors, systematic errors do not cancel each other out because they influence test scores in a consistent direction. Systematic errors either consistently inflate scores or consistently deflate scores.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

bias

A

bias refers to the degree to which systematic error influences the measurement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

item sampling or content sampling,

A

terms that refer to variation among items within a test as well as to variation among items between tests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

testtaker variables.

A

Pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication can all be sources of error variance. Formal learning experiences, casual life experiences, therapy, illness, and changes in mood or mental state are other potential sources of testtaker-related error variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Examiner-related variables

A

s are potential sources of error variance. The examiner’s physical appearance and demeanor—even the presence or absence of an examiner—are some factors for consideration here. Some examiners in some testing situations might knowingly or unwittingly depart from the procedure prescribed for a particular test. On an oral examination, some examiners may unwittingly provide clues by emphasizing key words as they pose questions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

methodological error

A

interviewers may not have been trained properly, the wording in the questionnaire may have been ambiguous, or the items may have somehow been biased to favor one or another of the candidates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

coefficient of stability.

A

When the interval between testing is greater than six months, the estimate of test-retest reliability is often referred to as the

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

replicability crisis.

A

Here it will be argued that the major causal factors are (1) a general lack of published replication attempts in the professional literature, (2) editorial preferences for positive over negative findings, and (3) questionable research practices on the part of authors of published studies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Preregistration

A

involves publicly committing to a set of procedures prior to carrying out a study. Using such a procedure, there can be no doubt as to the number of observations planned, and the number of measures anticipated. In fact, there are now several websites that allow researchers to preregister their research plans.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

coefficient of equivalence.

A

The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Parallel forms/ parallel forms reliability

A

exist when, for each form of the test, the means and the variances of observed test scores are equal
parallel forms reliability refers to an estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test when, for each form of the test, the means and variances of observed test scores are equal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Alternate forms/ alternate forms reliability

A

e simply different versions of a test that have been constructed so as to be parallel. Although they do not meet the requirements for the legitimate designation “parallel,” alternate forms of a test are typically designed to be equivalent with respect to variables such as content and level of difficulty
alternate forms reliability refers to an estimate of the extent to which these different forms of the same test have been affected by item sampling error, or other error. Estimating alternate forms reliability is straightforward: Calculate the correlation between scores from a representative sample of individuals who have taken both tests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How is Obtaining estimates of alternate-forms reliability and parallel-forms reliability similar?

A

(1) Two test administrations with the same group are required, and (2) test scores may be affected by factors such as motivation, fatigue, or intervening events such as practice, learning, or therapy (although not as much as when the same test is administered twice).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

internal consistency estimate of reliability or as an estimate of inter-item consistency.

A

nt. An estimate of the reliability of a test can be obtained without developing an alternate form of the test and without having to administer the test twice to the same people. Deriving this type of estimate entails an evaluation of the internal consistency of the test items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

split-half reliability i

A

is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once. It is a useful measure of reliability when it is impractical or undesirable to assess reliability with two tests or to administer a test twice (because of factors such as time or expense). The computation of a coefficient of split-half reliability generally entails three steps:

Step 1. Divide the test into equivalent halves.

Step 2. Calculate a Pearson r between scores on the two halves of the test.

Step 3. Adjust the half-test reliability using the Spearman–Brown formula (discussed shortly).

GOAL = a primary objective in splitting a test in half for the purpose of obtaining a split-half reliability estimate is to create what might be called “mini-parallel-forms,” with each half equal to the other—or as nearly equal as humanly possible—in format, stylistic, statistical, and related aspects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Why should a test not be split in half when calculating split-half reliability?

A

Simply dividing the test in the middle is not recommended because it’s likely that Different amounts of fatigue for the first as opposed to the second part of the test, different amounts of test anxiety, and differences in item difficulty as a function of placement in the test are all factors to consider.this procedure would spuriously raise or lower the reliability coefficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
odd even reliability
One acceptable way to split a test is to randomly assign items to one or the other half of the test. Another acceptable way to split a test is to assign odd-numbered items to one half of the test and even-numbered items to the other half. This method yields an estimate of split-half reliability that is also referred to as odd-even reliability
26
Spearman–Brown formula
allows a test developer or user to estimate internal consistency reliability from a correlation between two halves of a test. The Spearman–Brown prediction formula can be used to see how the sum of many parallel tests becomes more reliable as the number of tests increases. When a single test has a low reliability, many parallel tests must be combined to achieve high levels of reliability. The Spearman-Brown formula is used to estimate the reliability of a test after changing the number of items, while Cronbach's alpha assesses internal consistency within a single test
27
F5shortening test and spearman brown formula
Usually, but not always, reliability increases as test length increases. Ideally, the additional test items are equivalent with respect to the content and the range of difficulty of the original items. Estimates of reliability based on consideration of the entire test therefore tend to be higher than those based on half of a test. If test developers or users wish to shorten a test, the Spearman–Brown formula may be used to estimate the effect of the shortening on the test’s reliability. Reduction in test size for the purpose of reducing test administration time is a common practice in certain situations. For example, the test administrator may have only limited time with a particular testtaker or group of testtakers. Reduction in test size may be indicated in situations where boredom or fatigue could produce responses of questionable meaningfulness.
28
number of test items and spearman brown formula
A Spearman–Brown formula could also be used to determine the number of items needed to attain a desired level of reliability
29
Inter-item consistency
refers to the degree of correlation among all the items on a scale. A measure of inter-item consistency is calculated from a single administration of a single form of a test.
30
coefficient alpha
a measure of the average strength of association between all possible pairs of items contained within a set of items. It is a commonly used index of the internal consistency of a test and ranges in value from 0, indicating no internal consistency, to 1, indicating perfect internal consistency. Also called alpha coefficient; coefficient alpha. [Lee J. Cronbach]
31
diff between pearsons r and coefficient alpha
Unlike a Pearson r, which may range in value from −1 to +1, coefficient alpha typically ranges in value from 0 to 1. The reason for this range is that, conceptually, coefficient alpha (much like other coefficients of reliability) is calculated to help answer questions about how similar sets of data are.
32
why is a higher cronbachs alpha not good?
there is no value in higher internal consistency if it is achieved by items that are so similar that they yield no additional information
33
limitations of cronbachs alpha
Cronbach’s alpha is the most frequently used measure of internal consistency, but has several well-known limitations. It accurately measures internal consistency under highly specific conditions that are rarely met in real measures =The paths from the true score (T) to the observed scores X1 to X4 have coefficients with the Greek letter lambda (λ). These coefficients are called loadings, and they represent the strength of the relationship between the true score and the observed scores. Coefficient alpha is accurate when these loadings are equal. If they are nearly equal, Cronbach’s alpha is still quite accurate, but when the loadings are quite unequal, Cronbach’s alpha underestimates reliability. = Cronbach’s alpha assumes that all the test loadings (λ) are equal, but McDonald’s omega relaxes this assumption.
34
McDonald’s (1978) omega.
Many statisticians use a measure of reliability called McDonald’s (1978) omega. It accurately estimates internal consistency even when the test loadings are unequal.
35
coefficient of inter-scorer reliability.
Perhaps the simplest way of determining the degree of consistency among scorers in the scoring of a test is to calculate a coefficient of correlation. This correlation coefficient is referred to as a
36
why does the audio-recording method might inflate estimates of diagnostic reliability for a variety of reasons
used prior to the dsm 5 First, if the interviewing clinician decides the patient they are interviewing does not meet diagnostic criteria for a disorder, they typically do not ask about any remaining symptoms of the disorder (this is a feature of semistructured interviews designed to reduce administration times). However, it also means that the clinician listening to the audio-tape, even if they believe the patient might meet diagnostic criteria for a disorder, does not have all the information necessary to assign a diagnosis and therefore is forced to agree that no diagnosis is present. Second, only the interviewing clinician can follow up patient responses with further questions or obtain clarification regarding symptoms to help them make a decision. Third, even when semistructured interviews are used it is possible that two highly trained clinicians might obtain different responses from a patient if they had each conducted their own interview.
37
test-retest method, of diagnosios
separate independent interviews are conducted by two different clinicians, with neither clinician knowing what occurred during the other interview. These interviews are conducted over a time frame short enough that true change in diagnostic status is highly unlikely, making this method similar to the dependability method of assessing reliability (Chmielewski & Watson, 2009). Because diagnostic reliability is intended to assess the extent to which a patient would receive the same diagnosis at different hospitals or clinics—or, alternatively, the extent to which different studies are recruiting similar patients—the test-retest method provides a more meaningful, realistic, and ecologically valid estimate of diagnostic reliability.
38
three approaches to the estimation of reliability: (
(1) test-retest, (2) alternate or parallel forms, and (3) internal or inter-item consistency.
39
5 considerations for nature of the test
(1) the test items are homogeneous or heterogeneous in nature; (2) the characteristic, ability, or trait being measured is presumed to be dynamic or static; (3) the range of test scores is or is not restricted; (4) the test is a speed or a power test; and (5) the test is or is not criterion-referenced.
40
homogenity vs heterogenity of test items
homogeneous in items if it is functionally uniform throughout. Tests designed to measure one factor, such as one ability or one trait, are expected to be homogeneous in items. For such tests, it is reasonable to expect a high degree of internal consistency. By contrast, if the test is heterogeneous in items, an estimate of internal consistency might be low relative to a more appropriate estimate of test-retest reliability. It is important to note that high internal consistency does not guarantee item homogeneity. As long as the items are positively correlated, adding many items eventually results in high internal consistency coefficients, homogeneous or not.
41
Dynamic versus static characteristics
A dynamic characteristic is a trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experiences. (a static characteristic), such as intelligence.
42
restriction of range or restriction of variance (or, conversely, inflation of range or inflation of variance)
( the limitation—via sampling, measurement procedures, or other aspects of experimental design—of the full range of total possible scores that may be obtained to only a narrow portion of that total. For example, in a study of the grade point averages of university students, restriction of range occurs if only students from the dean’s list are included. Range restriction on a particular variable may lead to such negative effects as failing to observe or improperly characterizing a relationship between the variables of interest.)If the variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower. If the variance of either variable in a correlational analysis is inflated by the sampling procedure, then the resulting correlation coefficient tends to be higher
43
Speed tests versus power tests
When a time limit is long enough to allow testtakers to attempt all items, and if some items are so difficult that no testtaker is able to obtain a perfect score, then the test is a power test. By contrast, a speed test generally contains items of uniform level of difficulty (typically uniformly low) so that, when given generous time limits, all testtakers should be able to complete all the test items correctly
44
what should a reliability estimate of a speed test be based on?
(1) test-retest reliability, (2) alternate-forms reliability, or (3) split-half reliability from two separately timed half tests Because a measure of the reliability of a speed test should reflect the consistency of response speed, the reliability of a speed test should not be calculated from a single administration of the test with a single time limit. If a speed test is administered once and some measure of internal consistency, such as a split-half correlation, is calculated, the result will be a spuriously high reliability coefficient.
45
why should the reliability of a speed test not be calculated from a single giving of test within single time limit?
When a group of testtakers completes a speed test, almost all the items completed will be correct. If reliability is examined using an odd-even split, and if the testtakers completed the items in order, then testtakers will get close to the same number of odd as even items correct. A testtaker completing 82 items can be expected to get approximately 41 odd and 41 even items correct. A testtaker completing 61 items may get 31 odd and 30 even items correct. When the numbers of odd and even items correct are correlated across a group of testtakers, the correlation will be close to 1.00. Yet this impressive correlation coefficient actually tells us nothing about response consistency. Under the same scenario, a split-half reliability coefficient would yield a similar coefficient that would also be, well, equally useless.
46
criterion-referenced test
is designed to provide an indication of where a testtaker stands with respect to some variable or criterion, such as an educational or a vocational objective. Unlike norm-referenced tests, criterion-referenced tests tend to contain material that has been mastered in hierarchical fashion.
47
when can trad measures of reliability be used with criterion tests?
As individual differences (and the variability) decrease, a traditional measure of reliability would also decrease, regardless of the stability of individual performance. Therefore, traditional ways of estimating reliability are not always appropriate for criterion-referenced tests, though there may be instances in which traditional estimates can be adopted. An example might be a situation in which the same test is being used at different stages in some program—training, therapy, or the like—and so variability in scores could reasonably be expected
47
why can the other methods of reliability (split half, alternate forms, test-retest)n not be used on criterion tests?
A measure of reliability, therefore, depends on the variability of the test scores: how different the scores are from one another. In criterion-referenced testing, and particularly in mastery testing, how different the scores are from one another is seldom a focus of interest. In fact, individual differences between examinees on total test scores may be minimal. The critical issue for the user of a mastery test is whether a certain criterion score has been achieved. As individual differences (and the variability) decrease, a traditional measure of reliability would also decrease, regardless of the stability of individual performance. Therefore, traditional ways of estimating reliability are not always appropriate for criterion-referenced tests, though there may be instances in which traditional estimates can be adopted.
48
classical test theory (CTT)
- CTT is the most widely used and accepted model in the psychometric literature today -One of the reasons it has remained the most widely used model has to do with its simplicity, especially when one considers the complexity of other proposed models of measurement. -Additionally, the CTT notion that everyone has a “true score” on a test has had, and continues to have, great intuitive appeal -its assumptions allow for its application in most situations: CTT assumptions are rather easily met and therefore applicable to so many measurement situations can be advantageous, especially for the test developer in search of an appropriate model of measurement for a particular application. Still, in psychometric parlance, CTT assumptions are characterized as “weak”—precisely because its assumptions are so readily met - final advantage of CTT over any other model of measurement has to do with its compatibility and ease of use with widely used statistical techniques
49
ctt definition of ntrue score
true score as a value that according to CTT genuinely reflects an individual’s ability (or trait) level as measured by a particular test
50
problems with ctt
- its assumption concerning the equivalence of all items on a test; that is, all items are presumed to be contributing equally to the score total; particularly questionable when doubt exists as to whether the scaling of the instrument in question is genuinely interval level in nature. - the length of tests that are developed using a CTT model. Whereas test developers favor shorter rather than longer tests (as do most testtakers), the assumptions inherent in CTT favor the development of longer rather than shorter tests
51
domain sampling theory and is better known today in one of its many modified forms as generalizability theory
Whereas those who subscribe to CTT seek to estimate the portion of a test score that is attributable to error, proponents of domain sampling theory seek to estimate the extent to which specific sources of variation under defined conditions are contributing to the test score. In domain sampling theory, a test’s reliability is conceived of as an objective measure of how precisely the test score assesses the domain from which the test draws a sample A domain of behavior, or the universe of items that could conceivably measure that behavior, can be thought of as a hypothetical construct: one that shares certain characteristics with (and is measured by) the sample of items that make up the test. In theory, the items in the domain are thought to have the same means and variances of those in the test that samples from the domain.
52
most compatible reliability measures with domain sampling theory.
measures of internal consistency are perhaps the most compatible with domain sampling theory.
53
generalizability theory ad universie sacore
a “universe score” replaces that of a “true score” (Shavelson et al., 1989). Developed by Lee J. Cronbach (1970) and his colleagues (Cronbach et al., 1972), generalizability theory is based on the idea that a person’s test scores vary from testing to testing because of variables in the testing situation. Instead of conceiving of all variability in a person’s scores as error, Cronbach encouraged test developers and researchers to describe the details of the particular test situation or universe leading to a specific test score. This universe is described in terms of its facets, which include considerations such as the number of items in the test, the amount of training the test scorers have had, and the purpose of the test administration. According to generalizability theory, given the exact same conditions of all the facets in the universe, the exact same test score should be obtained. This test score is the universe score, and it is, as Cronbach noted, analogous to a true score in the true score model.
54
generalizability study
examines how generalizable scores from a particular test are if the test is administered in different situations. Stated in the language of generalizability theory, a generalizability study examines how much of an impact different facets of the universe have on the test score.
55
coefficients of generalizability.
The influence of particular facets on the test score is represented by coefficients of generalizability. These coefficients are similar to reliability coefficients in the true score model.
56
decision study,
developers examine the usefulness of test scores in helping the test user make decisions. In practice, test scores are used to guide a variety of decisions, from placing a child in special education to hiring new employees to discharging mental patients from the hospital. The decision study is designed to tell the test user how test scores should be used and how dependable those scores are as a basis for decisions, depending on the context of their use
57
Item response theory (IRT)/latent-trait theory.
The procedures of IRT provide a way to model the probability that a person with X ability will be able to perform at a level of Y. - Because so often the psychological or educational construct being measured is physically unobservable (stated another way, is latent) and because the construct being measured may be a trait (it could also be something else, such as an ability), a synonym for IRT in the academic literature is latent-trait theory. -RT is not a term used to refer to a single theory or method. Rather, it refers to a family of theories and methods—and quite a large family at that—with many other names used to distinguish specific approaches. There are well over a hundred varieties of IRT models. Each model is designed to handle data with certain assumptions and data characteristics.
58
difficulty level of an item and the item’s level of discrimination
Examples of two characteristics of items within an IRT framework are the difficulty level of an item and the item’s level of discrimination; items may be viewed as varying in terms of these, as well as other, characteristics. “Difficulty” in this sense refers to the attribute of not being easily accomplished, solved, or comprehended
59
discrimination in IRT
discrimination signifies the degree to which an item differentiates among people with higher or lower levels of the trait, ability, or whatever it is that is being measured
60
dichotomous test items vs polytomous test items
here are IRT models designed to handle data resulting from the administration of tests with polytomous test items (test items or questions with three or more alternative responses, where only one is scored correct or scored as being consistent with a targeted trait or other construct). Other IRT models exist to handle other types of data.
61
how is IRT diff from CTT
latent-trait models differ in some important ways from CTT. For example, in CTT, no assumptions are made about the frequency distribution of test scores. By contrast, such assumptions are inherent in latent-trait models. -Some IRT models have specific and stringent assumptions about the underlying distribution.
62
Rasch model
developed by the Danish mathematician Georg Rasch, each item on the test is assumed to have an equivalent relationship with the construct being measured by the test. A shorthand reference to these types of models is “Rasch,” so reference to the Rasch model is a reference to an IRT model with specific assumptions about the underlying distribution.
63
standard error of measurement,
SEM or SEM, provides a measure of the precision of an observed test score. Stated another way, it provides an estimate of the amount of error inherent in an observed score or measurement. In general, the relationship between the SEM and the reliability of a test is inverse; the higher the reliability of a test (or individual subtest within a test), the lower the SEM.
64
standard error of a score
denoted by the symbol σmeas, the standard error of measurement is an index of the extent to which one individual’s scores vary over tests presumed to be parallel.
65
confidence interval:
a range or band of test scores that is likely to contain the true score.
66
standard error of the difference,
a statistical measure that can aid a test user in determining how large a difference should be before it is considered statistically significant.