Chapter 3: Reliability and Validity Flashcards

Question

Generalizability Theory (G theory)

Answer 1

– Developed by Cronbach (1972) – CTT is often referred to as a “parent” of G theory – Error can come from a variety of sources – Try and systematically vary sources of error and study the error experimentally A second approach to reliability. Does not assume that a person has a “true” score on intelligence, or that error is basically of one kind, but argues that different conditions may result in different scores, and that error may reflect a variety of source.

Answer 2

The individual taking the test The influences of the examiner The test items Temporal consistency (intelligence is stable, mood is not) Situational aspects (noise)

Answer 3

– G theory allows us to disentangle sources of error – Separates error into systematic error and random error – Focus on what types of conditions can we expect these results to generalize to – CTT just has random error component – Focuses on our ability to generalize from one set of measures to a set of other plausible measures

Answer 4

– Compute a correlation coefficient to indicate percentage of agreement An objectively scored test could have a very high reliability – If your test can’t be objectively scored, you need to train your scorers A subjectively scored test is limited by the scorer reliability To improve reliability you can use test items that can be objectively scored

Answer 5

– Same as scorer reliability, but now dealing with ratings – Want to make sure that raters agree above chance For example, suppose that two faculty members independently read 80 applications to their graduate program and rate each application as “accept,” “deny,” or “get more information.”

Answer 6

Determine the level of agreement between the two observers Percentage agreement = [ (A + D) / (A+B+C+D) ] ×100 Coefficient Kappa: Po - Pe / 1 – Pe Po is the observer proportions of agreement Pe is the expected or chance agreement

Answer 7

– Use statistical means to estimate if we had a perfectly reliable test When reliability is less than perfect we say there is "noise in the system" There are statistical ways to remove the "static" called correction for attenuations r-estimated = r12 / √r11r22 r-estimated is the "true" correlation between two measures is both the test and the second measure were perfectly reliable r12 is the observed correlation between the test and the second measure r11 is the reliability of the test r22 is the reliability of the second measure

Answer 8

– Index of the amount of inconsistency or error expected in an individual’s test score – Used to assess reliability from the individual point of view – By calculating this value, we can estimate the probability of an individual’s score falling within a certain interval – As the standard deviation decreases and the reliability coefficient increases, the SEM is smaller Knowing the reliability coefficient for a particular test tells us the stability of the test If we knew the test-retest reliability was .92 over a 6 month period then we could conclude that the measure is fairly stable over that period of time The psychometrician is more interested in the test than in the subjects who took the test The person that uses the test (teacher, clinical psychologist, etc.) cares more about the individual These people assess reliability from the individual point of view by computing the standard error of measurement (SEM) If you test someone many times and then take the mean of all their test scores, the mean will be the "true" score because error deviations are assumed to cancel each other out (for every lucky guess there is an unlucky guess) It is sometimes not possible to have someone take their same test enough times to determine a SEM therefore we have a formula to help us get an estimate SEM= SD √ (square root) 1-r11 SD is the standard deviation r11 is the reliability coefficient The smaller the SD the smaller the range of the probability of the "true" item

Answer 9

– Compare test scores from two different measures – Uses the SEM for both measures – Calculate how different the scores need to be to judge one score as better than the other If a student received a 108 on a math test and a 112 on a spelling test we cannot conclude that she did better on the spelling test because there is "noise" (aka unreliability) on both test. The spelling test could be 105 or 107. The Math test could be 113. In order to compare her two scores in a reliability framework we can use the standard error of differences (SED) How much the scores deviate on average SED= √ (SEM)^2 1 + (SEM)^2 2 Which is equal to SED= SD √ 2−r11−r22 the first SEM and the first r refer to the first test and the second SEM the second r refer to the second test

Answer 10

If we are more interested in the relationship of pairs of scores rather than individual scores we must inquire into the reliability of difference scores: r-difference = [ 1/2(r11 + r22) − r12 ] / [ 1−r12 ] r11 is the reliability of the first test r22 is the reliability of the second test r12 is the correlation If the reliability of each of the two tests is similar then the reliability of the difference (r12) lowers rapidly The point here is that we need to be very careful when we make decisions based on difference scores. We should also reiterate that to compare the difference between two scores from two different tests, we need to make sure that the two scores are on the same scale of measurement; if they are not, we can of course change them to z scores, T scores, or some other scale.

Answer 11

– Does the test measure what it is intended to measure? – Is it valid for the purpose we intend to use it for? – We discuss different types of validity, but it all falls under one umbrella. – We want to make an overall evaluation of the interpretations of the test validity is best thought of as a unitary pro- cess with somewhat different but related facets

Answer 12

Content validity -face validity Criterion validity -concurrent validity -predictive validity Construct validity -convergent validity - divergent validity

Answer 13

Content: How well does a test measure (representative) every element of the attribute? Content validity refers to the question of whether the test adequately covers what is being measured and is particularly relevant to achievement tests. Need to make sure the test is truly representative and relevant to what we are testing – Two ways to assess – Subjective – ask experts (SMEs) to judge relevance and representativeness of the items – Empirical – use statistical methods Consider a test in this class that will cover the first five chapters. Should there be an equal number of questions from each chapter, or should certain chapters be given greater preeminence? Certainly, some aspects are easier to test, particularly in a multiple-choice format. But would such an emphasis reflect “laziness” on the part of the instructor, rather than a well thought out plan designed to help build a valid test? Taxonomies Helps achieve content validity by carefully planning the tests construction Mostly used in educational test Face Validity: When a test appears valid to the people taking it Concerned with how test takers perceive the attractiveness and appropriateness of a test A test could “appear” valid, but may not be valid

Answer 14

Criterion: How well does a test predict some external criterion measure? – Match test scores with an independent criterion – Types of criteria: Contrasted groups, GPA, Worker performance ratings If a test is said to measure intelligence, we must show that scores on the test parallel or are highly correlated to intelligence as measured in some other way – that is, a criterion of intelligence A test can never be better than the criterion it is matched against, and the world simply does not provide us with clear, unambiguous criteria. Criteria: The contrasted groups (a type of criteria) are groups that differ significantly on the particular domain For example, in validating an academic achievement test we could administer the test to two groups of college students, matched on relevant variables such as age and gender, but differing on grade point average, such as honors students vs. those on academic probation. Two Types – Predictive – test given first and criterion scores measured at a later time Example: we want the SAT to predict college GPA. We would need to administer the test to an unselected sample, then wait for them all to finish college and then correlate scores with GPA. It's unlikely to get a unselected group, will probably have a more homogeneous group. Most researchers don't want to wait that long – Concurrent – test given at the same time that the criterion scores are collected When the criterion and test scores are collected at the same time. The main purpose of such concurrent validation would be to develop a test as a substitute for a more time- consuming or expensive assessment procedure

Answer 15

Construct: How well does a test measure the attribute it claims? – What makes it different is that it is a process that takes place within a theoretical framework If we wish to validate a test of intelligence, we must be able to specify in a theoretical manner what intelligence is, and we must be able to hypothesize specific outcomes. – Look for the correspondence between the theory and the observed data Construct validity is an umbrella term that encompasses any information about a particular test; both content and criterion validity can be subsumed under this broad term. Test scores are a function three aspects: 1. Test items 2. Person responding 3. Context Inferences made from scores is very important Although we speak of validity as a property of a test, validity actually refers to the inference that is made from the test scores we infer how well the person will perform on a future task (predictive or criterion validity) whether the person possesses certain knowledge (content validity) or a psychological construct or characteristic related to an outcome, such as spatial intelligence related to being an engineer (construct validity) Two Types: Convergent Validity – High correlation with another test measuring the same construct D. P. Campbell and Fiske (1959) and D. P. Campbell (1960) proposed that to show construct validity, one must show that a particular test correlates highly with variables, which on the basis of theory, it ought to correlate with Divergent Validity – Low correlation with another test measuring a different construct They also argued that a test should not correlate significantly with variables that it ought not to correlate with Multitrait-multimethod matrix- assess both convergent and discriminant validity

Answer 16

group differences the statistical notion of correlation and its derivative of factor analysis, a statistical procedure designed to elucidate the basic dimensions of a data set internal consistency of the test Here we typically try to determine whether all of the items in a test are indeed assessing the particular variable, or whether performance on a test might be affected by some other variable. studies of change over occasions Is there a change in test scores over time/ or with different examiners? studies of process focuses on looking at the process, observing how subjects perform on a test, rather than just what

Answer 17

Another way that reliability and validity are related is that a test cannot be valid if it is not reliable. In fact, the maximum validity coefficient between two variables is equal to: Variability= √r11r22, where r11 again represents the reliability coefficient of the first variable (for example, a test) r22 the reliability coefficient of the second variable (for example, a criterion).

Answer 18

– There is no standard value to surpass – Determine if the validity is statistically and/or practically significant – Squaring the validity coefficient gives an estimate of how much overlap there is between the test and the criterion – Use the correlation of the test scores to make predictions of the criterion The purpose of administering a test such as the SAT is to make an informed judgment about whether a high-school senior can do college work, and to predict what that person’s GPA will be. Such a prediction can be made by realizing that a correlation coefficient is simply an index of the relationship between two variables, a relationship that can be expressed by the equation Y = bX + a Y might be the GPA we wish to predict X is the person’s SAT score b and a reflect other aspects of our data – Use an expectancy table Expectancy tables can be more complex and include more than two variables – for example, if gender or type of high school attended were related to SAT scores and GPA, we could include these variables into our table, or create separate tables. – Break the test and the criterion down into categories – Use the standard error of estimate to find the margin of error In talking about reliability, we talked about “noise in the system,” that is lack of perfect reliability. Similarly with validity we ordinarily have a test that has less than perfect validity, and so when we use that test score to predict a criterion score, our predicted score will have a margin of error. That margin of error can be defined as the SE of estimate which equals: SD √ 1−r^2 12 SD is the standard deviation of the criterion scores r12 is the validity coefficient If the test had perfect validity, that is r12 = 1.00, then the SE of estimate is zero; there would be no error, and what we predicted as a criterion score would indeed be correct If the test were not valid, that is r12 = zero, then the SE of estimate would equal the SD In general, validity coefficients are significantly lower because we do not expect substantial correlations between tests and complex real-life criteria. For instance many factors determine your grade other than intelligence (student-teacher relationship, motivation, social life, etc.) A test may correlate significantly with a criterion, but the significance may reflect a very large sample, rather than practical validity. Even though an r of .40 looks rather large, and is indeed quite acceptable as a validity coefficient, its explanatory power (16%) is rather low – but this is a reflection of the complexity of the world, rather than a limitation of our tests.

Answer 19

Cronbach and Gleser (1965) used the term bandwidth to refer to the range of applicability of a test – tests that cover a wide area of functioning such as the MMPI are broad-band tests; tests that cover a narrower area, such as a measure of depression, are narrow-band tests. These authors also used the term fidelity to refer to the thoroughness of the test. These two aspects interact with each other, so that given a specific amount (such as test items) as bandwidth increases, fidelity decreases.

Answer 20

Once the test is shown to be valid on a measure, we can use it to predict a criterion. Since no test is 100% valid, our predictions will have errors. The test and reality produces four categories Category A consists of individuals who on the test are positive for TB and indeed do have TB. These individuals, from a psychometric point of view, are considered “hits” – the decision based on the test matches the real world. Category B consists of individuals for whom the test results indicate that the person does not have (is negative for) TB, and indeed they do not have TB – another category that represents “hits.” Category C consists of individuals for whom the test results suggest that they are positive for TB, but they do not have TB; these are called false positives. Category D consists of individuals for whom the test results are negative. They do not appear to have TB but in fact they do; thus they are false negatives.

Answer 21

Proportion of correctly identified positives (i.e., how accurately does a test classify a person who has a particular disorder?) 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (A) / 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (A) + 𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 (D)

Answer 22

Proportion of correctly identified negatives (i.e., how accurately does a test classify those who do NOT have the particular condition?) 𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 (B) / 𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 (B) + 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (C)

Answer 23

The ratio of true positives to all positives 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (A) / 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (A) + 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (C)

Answer 24

An ideal test would have a high degree of sensitivity and specificity, as well as high predictive value, with a low number of false positives and false negative decisions.

Answer 25

– The more valid the measure or procedure on which decisions are based, the fewer the errors. – The more comprehensive (larger sample size) the database available on which to make decisions, the fewer the errors. – Use sequential strategies, use a cheap test and likely get a lot of false positives then use more expensive test to find the true positives – Change the decision rules, admissions would rather admit someone who might fail then omit someone who would have succeeded. To do this they lower the or wave standards. Problem is that more people will fail; False positives is lowered; False negatives are increased If you do the opposite and raise the standard then: False positives increase and false negatives decrease – Type of error we are willing to tolerate Do you want to let people in who might fail or keep people out who might have succeeded? It would be fine to let everyone into college and have some not graduate but you wouldn't want to let just anyone be an astronaut – Selection ratio One of the issues that effects our decisions and the kinds of error we tolerate refers to the number of individuals we need to select from the pool of applicants. If you have a 100 scholarships and 100 applicants then you don't need to even look at the applicants but if you only have 2 scholarships then you need to be very demanding (this will probably result in high number of false positives) – Base rate that is the naturally occurring frequency of a particular behavior when the base rate of the criterion deviates significantly from a 50% split, the use of a test or procedure that has slight or moderate validity could result in increased errors. – Sample size Another aspect that influences validity is the size of the sample that is studied when a test is validated You will recall whether or not a correlation coefficient is statistically significant or is different from zero, is a function of the sample size. with a small sample of N = 10, we would need to get a correlation of at least .63 to conclude that the two variables are significantly correlated, but with a large sample of N = 150, the correlation would need to be only .16 or larger to reach the same conclusion – Validity considered in terms of generalizability What we have discussed above about validity might be termed the “classical” view. Currently, validity is focused on validating a test for a specific application with a specific sample and in a specific setting; it is largely based on theory, and construct validity seems to be rapidly gaining ground as the method. If we correlated GPA and SAT scores at one school we would not expect the same correlation from another school. We expect a certain amount of stability of results across studies, but on the other, when we don’t obtain such stability, we need to be aware and identify the various sources for obtaining different results.

Answer 26

primary validity is basically similar to criterion validity If someone publishes a new academic achievement test, we would want to see how well the test correlates with GPA, whether the test can in fact separate honors students from nonhonors students, and so on. This is called primary because if a test does not have this kind of basic validity, we must look elsewhere for a useful measure. Similar to criterion validity, how well a test correlates with the criteria How well the test predicts you will do on a job

Answer 27

If the evidence indicates that a test has primary validity, then we move on to secondary validity that addresses the psychological basis of measurement of the scale. To obtain information on secondary validity, on the underlying psychological dimension that is being measured, Gough (1965) suggested four steps: (1) reviewing the theory behind the test and the procedures and samples used to develop the test (2) analyzing from a logical-clinical point of view the item con- tent (Is a measure of depression made up primarily of items that reflect low self-esteem?) (3) relating scores on the measure being considered to variables that are considered to be important, such as gender, intelligence, and socioeconomic status (4) obtaining information about what high scorers and low scorers on the scale are like psychologically. Reviewing the theory behind the test Kind of like construct and content validity

Chapter 3: Reliability and Validity Flashcards

(51 cards)