Test Construction Flashcards

Question

______________ is a type of construct validity. Construct validity refers to the degree to which a test measures a theoretical construct. The establishment of what type of validity may include correlating scores on the test with scores on another test that does not purport to measure the same or a related construct. If the correlations are low, evidence that the test measures the construct it purports to measure (and not the construct the other test purports to measure) is provided. (This is referred to as?

Answer 1

Discriminant validity

Answer 2

experts Construct validity refers to the degree to which a test measures a theoretical construct.

Answer 3

the relationship between the likelihood that an examinee will endorse the item and the examinee's level on the attribute measured by the test. As its name suggests, an item characteristic curve (ICC) provides information about an item's characteristics. It is part of the Item Response Theory and is constructed for each item. The curve provides information on the relationship between an Examinee's Level on that ability or trait and the Probability that she will respond correctly to that item.

Answer 4

Split Half reliability (Spearman Brown), Cronbach's coefficient Alpha (conservative/lower level) Kuder-Richardson Formula 20 (when test items are scored dichotomously). Error results in content sampling error (difference content between two halves of the test and Coeff. difference between individual test items-- heterogeneity of test items). Useful when Measuring 1 single characteristic. inter-rater reliability: scores or ratings represent an nominal or ordinal scale of measurement. the equivalence of items contained in two different forms of the test. reliability coefficient

Answer 5

to estimate what a predictor's criterion-related validity coefficient would be if the predictor and or or criterion had a reliability coefficient of 1.0

Answer 6

Metric statistical technique used to analyze the relationship between a single dependent variable and several independent variables (1) DV and (several) IV

Answer 7

the proportion of variability in a test that has not been explained by the factor analysis.

Answer 8

Remember that in a normal distribution, more scores are close to the middle of the distribution (e.g., 52nd percentile) than at the tail of the distribution (e.g., 14th percentile). So by adding five points to Nancy's raw score, she will surpass more people because her score was close to the middle of the distribution. In contrast, by adding five points to Pete's raw score, it will have less impact on his relative standing because there are fewer scores at the tail of the distribution. As a result, Nancy's percentile rank would increase more. (A good mnemonic for remembering this is "more movement in the middle.")

Answer 9

A test's reliability is an index of the degree to which scores on the test are free from random error (chance factors) and indicative of examinees' true scores. The greater the probability that an examinee can answer an item correctly by guessing (i.e., by chance), the lower a test's reliability. On a true-false test, the probability that the correct answer can be selected by chance is 50% (1/2). In other words, there is a higher probability that an examinee can guess a correct answer on a true-false test than on the other tests, and thus, an examinee's score on a true-false test will reflect, to a greater degree, error rather than his/her true score.

Answer 10

Convergent validity V Divergent Validity ^ Convergent Multitrait-Multimethod Matrices

Answer 11

A large monotrait- heteromethod coefficient provides evidence of a test's convergent validity. Large monotrait-heteromethod coefficient -monotriat-monomethod coefficient (a test and it's self) need to be large Heterotrait-Monmethod coefficent: indicates the correlation between two similar methods that measure different traits. A large heterotrait-monomethod coefficient indicates that a test has a lack of divergent validity and a small indicates it has divergent validity.

Answer 12

50 In factor analysis, a test's communality indicates its "common variance," or the amount of variability in test scores that is explained by the factors that the test shares in common with the other tests included in the analysis.

Answer 13

Convergent : divergent validity Ulitimately construct validity 1- administer a number of menath with the same construct and different constructs to several groups of examinees 2- correlate scores with every other scores on other tests to obtain a correlation matrix 3- convert correlation matrix with factor matrix or factor loadings that will indicate the degree of association between each test and factor 4-simplfy by rotating : either orthogonal or oblique 5- interpret and name the factors. Last column: communality: "common variance" or the amount of variability in test scores that is explained by the factors that the test shares in common with the other tests included in the analysis. - communality - specificity is variability that is due to factors that are unique or specific to the test and not measured by other tests in the factor analysis. --is not explained by the factor analysis because it is not related to anything

Answer 14

kappa statistic percent agreement

Answer 15

Spearman-Brown formula Standard error of measurement 10* (1- square root of .84) = 10( Change to 100 -84 = 16) = 10 ( square root of 16= 4) = 10(.4) = 4.0 Think bell curve 95% is 2 SD from the mean therefore and the SD or standard error of the measure is 4 then 2(4) = or - 80 = 72 to 88 Again to interpret an examinee's true score a confidence interval needs to be constructed. 95% confidence interval is usually used. the SEM gives an estimate range to which the score will fall. SEE = SD (Square root of ( 1 - (r squared))) SEE+ 1,000 (sq. root ( 1- (.6 squared )= 1,000 Sq Rt (1- (.36) turn into 100's ) 1,000 * Sq. Rt. (100- 36) = 1,000 * Sq Rt ( 64) = 1,000 (8 or .8) = 800 then a 68% interval is 800 + or - 7,000 7,000 because it is the predicted score and we are estimating

Answer 16

An orthogonal rotation is used when the variables included in the analysis are believed to be uncorrelated. For example, if you conduct a factor analysis on 50 questionnaire items designed to measure a leader's task- or person-orientation and you believe that these two orientations are independent (uncorrelated), you would perform an orthogonal rotation An oblique rotation is used when the variables in the analysis are believed to be correlated. For example, if you conduct a factor analysis on three tests that are designed to measure verbal ability and three tests that are designed to measure nonverbal ability and there's evidence that verbal and nonverbal ability are correlated, you would perform an oblique rotation

Answer 17

Item Characteristic Curve As its name suggests, an item characteristic curve (ICC) provides information about an item's characteristics. The curve provides information on the relationship between an examinee's level on the trait or ability and the probability that he/she will respond correctly to the item. Item Response Theory

Answer 18

Scatterplot (Actually successful but predictor say no/ Successful and predictor says yes ) False Negative ][ True Positive ________________________ True Negatives ][ False Positives (not successful and predictor says not/ Successful and predictor say not) Incremental validity = (pos. hit rate- Base rate) Positive hit rate= true positive/total positive Base rate = (true positive + False Neg.)/total people

Answer 19

``` Sensitivity Specificity positive predictive value negative predictive value positive likelihood ratio negative likelihood ratio ``` Specificity _ Do not have and identified as not having (the Doctors specifically said that you do not have the disorder) Sensitivity_ Have and identified as having (it makes sense that you have it everyone in your family is sick) Sensitivity refers to the proportion of people with the condition who are correctly identified by the test and is calculated by dividing the true positives by the true positives plus false negatives. When the sensitivity is high, this means that most of the people with the disorder will be identified as having the disorder by the test (i.e., there will be few false negatives) but that there will be some people without the disorder who will also be identified as having the disorder (i.e., there will be some false positives). - Positive and negative predictive values - Positive (LR+) and Negative likelihood ratios (LR-) -greater: lower

Answer 20

p is the item difficulty ranges from 0 to 1.0 Larger values = the easier items 0 = no items were answered correctly 1 = all answered correctly--easier question Moderate- .5 are retained guessing alone impacts optimal level- for 1.0, .5 is likely guessing the goal of testing determines the optimal level- corresponds to the portion of examinees to be selected. For True and False: .50 is likely chance of getting the answer right so, .75 is optimal level. which is half way between 1.0 and .5 p= total number passing/ total number of examinees (D) discriminates between the those who obtained the high and low scores on the entire test. To find this you need to D = Upper score - Lower score ranges from - 1.0 to 1.0 D = 1.0 X: examinee's obtained test score T: True Score E: Error Component SEM= SD * (square root) (1-r) r: reliability coeff. Communality = a test's factor loadings can be squared and summed to calculate the communality (the amount of variability in test scores explained by the identified factors). square each factor-- Factor 1 (squared) = % of variability is explained by factor 1. repeat for factor 2 SEE = SD * (square root) ( 1- r squared) 95% confidence interval: adding and subtracting 2 SEE around the predicted (estimated) criterion score.

Answer 21

The kappa statistic is used to measure the consistency of ratings assigned by two raters when data are nominal or ordinal. (Note that some authors use the term "discontinuous" to refer to nominal and other discrete data - i.e., data that represents noncontinuous categories.) - the name (nominal) of your ordinary (ordinal) Kap.

Answer 22

Nonlinear Linear

Answer 23

higher: A test to be valid, has to be reliable. A test which possesses poor reliability is not expected to yield high validity. Higher (exceed) For criterion related validity --r is squared = shared variability- (accounted for; explained by) -- a test reliability always places CEILING ON VALIDITY- WHEN A TEST HAS LOW RELIABILITY IT CANNOT HAVE HIGH VALIDITY, HOWEVER, HIGH RELIABILITY DOES NOT guarantee validity

Answer 24

communality = variability between common test factors: true score variability Test's reliability will Always be at least as large as it's communality. The communality is the lower limit of a test's reliability Coeff.

Answer 25

Criterion-referenced scores tell you how well an examinee did in absolute terms (e.g., how many questions he or she answered correctly) and, therefore, would be most useful for the purpose described in the question. Norm-referenced Standard Ipsative

Answer 26

method of interpretation is often employed in mastery (criterion-referenced) testing which involves specifying the terminal level of performance required for all learners and periodically administering the test to learners to assess their mastery. Another type involves interpreting examinee's test scores in terms of their likely status on an external criterion -- use a regression equation or an expectancy table

Answer 27

- to identify a set of variables that explains all (or nearly all) of the total variance in a set of test scores. reduce the number of variables of a data set, while preserving as much information as possible. Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. The variables are called: principal component or eigenvestors and is defined as a linear combination of the set of tests that describes as much of the intercorrelations between the tests as possible. Turn to line. Principal components are extracted so that the first component accounts for the largest amount of variability in test scores; the second component accounts for the second largest amount of variability, etc. (make things simple) Orthogonal: each component accounts for a unique amount of variability in test scores To calculate a eigenvalue: must sum the squared correlations between that component and each test: chart

Answer 28

A problem with percentile ranks is that, when the raw scores are normally distributed, raw score differences near the center of the distribution are exaggerated when they are converted to percentile ranks, while raw score differences at the extremes are reduced. (A useful mnemonic for remembering this is "more change in the middle.")

Answer 29

Concurrent validity Incremental: increase in correct decisions that can be expected if the predictor is used as a decision making toll: uses data collected from a concurrent or predictive validity study. Involves cutoff score in a scatter plot: determines true positives: those who were predicted to succeed and were successful on the criterion (false +) predicted to succeed on predictor but did not (criterion) true negative: predicted to be unsuccessful: were (false -) predicted to be successful: were not

Answer 30

Difference in Timing Concurrent Criterion related data is collected prior or at the same time Predictive: measured some time after the predictor. Concurrent :used to estimate current status predictive: to predict future performance

Answer 31

All other things being equal, tests containing items that have a low probability of being answered correctly by guessing alone are more reliable than tests containing items that have a high probability of being answered correctly by guessing alone. Of the types of items listed, multiple-choice items with five answer options have the lowest probability of being answered correctly by guessing alone. -multiple choice with six possible answers

Answer 32

The kappa statistic (which is also known as Cohen's kappa and the kappa coefficient) provides a more accurate estimate of reliability than percent agreement because its calculation includes removing the effects of chance agreement.

Answer 33

incremental validity = positive hit -Base Rate Base rate = true positive + false negatives (divided by) total number of people Positive hit rate = True positive (divided by) total positives

Answer 34

Sensitivity and specificity provide information about a predictor's accuracy when administered to a group of individuals who are known to have or not have a disorder. Sensitivity is the percent who HAVE the DX and are accurately identified by the predictor (calculated: number of true positive (divided by) the number of true pos, + false negatives (above cut off)) Sensitivity refers to the probability that a test will correctly identify people with the disease from the pool of people with the disease. It is calculated using the following formula: true positives divided by (true positives + false negatives). Specificity: percent of people who DO NOT have DX and accurately identified as not having the Dx (calculated: number of true negative (divided by ) tue negative + false positives) (below cut off). Can the test find them? (sensitive people have the disorder: specific people do not) Positive and Negative likelihood ratios: the extent to which a positive or negative result on a predictor changes the probability that a person has the disorder that is assess by the predictor: negative or positive results both affect the probability that they person as the disorder (likelihood) -positive and negative predictor values: The positive predictive value is the probability that a person identified by the test as having the disease actually has the disease. It is calculated with the following formula: true positives or (true positives + false positives). The negative predictive value is the probability that a person identified by the test as not having the disease doesn't actually have the disease. The following formula is used to calculate the negative predictive value: true negatives or (true negatives + false negatives).-- if they do or do not have the disorder

Answer 35

Cross validation: cross validate a predictor on another sample to eliminate chance factors related to the uniqueness of that sample. Because the same chance factors of the original group are not in the second group-- the cross validation coeff. will shrink. Shrinkage: the smaller the original validation sample the greater the shrinkage of the validity coeff. when it is cross validated Criterion Contamination occurs when the accuracy of a criterion measure is impact/influenced by the rater's knowledge of the employee's predictor scores when rating. Criterion contamination occurs when a rater's knowledge of a person's predictor performance biases how he/she rates the person on the criterion. Criterion contamination has the effect of artificially inflating the correlation between the predictor and the criterion.

Answer 36

KR-20 is used to determine a test's internal consistency reliability when test items are scored dichotomously. (KR-20 lube )

Answer 37

Content sampling refers to the extent to which test scores depend on factors specific to the particular items included in the test (i.e., to its content). Because test-retest reliability involves administering the same test (i.e., the same content) twice, content sampling is not a source of error.

Answer 38

``` The classical (tripartite) view of validity distinguishes between three major types -- content, criterion-related, and construct. From this perspective, construct validity refers to the degree to which a test measures the construct it was designed to measure. Due to changes in the conceptualization of validity, methods for evaluating validity, and methods of interpreting test scores, construct validity is now described by many experts as a unifying concept of validity. ```

Answer 39

This is a difficult question, but you may have been able to identify the correct answer if you know that a low base rate means that there are very few people in the population who have the disorder, which implies that the most likely predictive error will be to falsely identify those who do not have the disorder as having it. To understand why this answer is incorrect (and answer d is correct), assume that the base rate for the disorder is 1% and that you test a random sample of 10,000 people with the screening test. In this situation, 100 people will have the disease, and the test (which has a 98% accuracy rate) will correctly identify 98 of them - i.e., 98 will be true positives and 2 will be false negatives. d. CORRECT Continuing with the example, when the base rate for the condition is 1%, 9,900 of the 10,000 people tested will not have the disease, and the test (which has a 98% accuracy rate) will correctly identify 9,702 of them - i.e., 9,702 will be true negatives and the remaining 198 will be false positives. In other words, there will be more false positives than false negatives - and this will be true whenever the predictor has a high accuracy rate and the base rate is less than 50%.

Answer 40

Metric statistical technique used to analyze the relationship between a single dependent variable and several independent variables

Answer 41

When using norm-referenced interpretation, an examinee's test performance is compared to the performance of members of the norm group (other people who have taken the test).

Answer 42

Ordinal The item difficulty index (p) is calculated by dividing the number of individuals who answered the item correctly by the total number of individuals. To understand why the item difficulty index represents an ordinal scale, assume that items 1, 2, and 3 of a test are passed by 10, 20, and 30 percent of examinees, respectively, which will result in p values for these items of .10, .20, and .30. Although these values indicate that item 1 is more difficult than item 2 which, in turn, is more difficult than item 3, it is not possible to say that item 2 is twice as difficult as item 1 or that the difference in difficulty between items 1 and 2 is equal to the difference between items 2 and 3. Moreover, an item difficulty index of 0 would not mean that the item completely lacks difficulty (which doesn"t really make any sense). In other words, p values represent an ordinal scale because they do not have the property of equal intervals or an absolute zero point.

Answer 43

The standard error of estimate equals 0 when the validity coefficient is equal to 1 (its maximum value) and equals the standard deviation when the validity coefficient is equal to 0 (its minimum value). Therefore, the standard error ranges in value from 0 to the size of the standard deviation (which, in this case, is 9). (The standard error of estimate is calculated by multiplying the standard deviation of the criterion scores times the square root of 1 minus the validity coefficient squared.)

Test Construction Flashcards

(70 cards)