Ch 6 - Item Statistics Flashcards

Question

2 other correlational indexes (other than D) to measure item validity

Answer 1

○ Most widely used classical test theory methods for expressing item validity ○ The type of coefficient chosen depends on the nature of the 2 variables that are to be correlated (AKA the item scores and the criterion measures) § When item scores are dichotomous, and criterion measure is continuous - point biserial (rpb) is best § When item and criterion measures are both dichotomous - phi coefficient is best Both of these can range from -1 to +1 and are interpreted same as Pearson r

Answer 2

Speed: speed of performance Tests can be classified in 3 types • Pure speed tests ○ Simply measure the speed with which test takers can perform a task ○ Difficulty is manipulated mainly through timing ○ Score is often the number of items completed in the allotted time • Pure power tests ○ Have no time limits ○ Difficulty is manipulated by increasing or decreasing the complexity of items ○ Items are in ascending order of difficulty ○ Only the best respondents can answer all items • Tests that blend speed and power In any test that's closely timed, the p value is a function of the position of items within the test rather than of their intrinsic difficulty/validity

Answer 3

Necessary to calculate the proportion of individuals at each total score level who passed a given item Item-test regression graphs combine info on both item difficulty and item discrimination - allow to visualize how each item functions within the group that was tested

Answer 4

Variety of models that can be used to design/develop new tests and to evaluate existing ones • IRT differs from classical theory in: ○ The mathematical formulas they employ ○ The nbr of item characteristics they account for ○ The number of trait/ability dimensions they specify as the objective of measurement ○ Use different methods depending on dichotomous/polytomous items

Answer 5

• Test Length • Comparison of scores: ○ CTT: compares total test score ○ IRT: focuses on the scores on individual items since the items can be different for each examinee (CAT) § Goals of IRT: □ Generate items that provide the maximum amount of information possible concerning the ability/trait levels of examinees who respond to them in one fashion or another □ Give examinees items tailored to their abilities □ Reduce the number of items needed to pinpoint any given test taker's standing on the ability while minimizing measurement error

Answer 6

• CTT indexes of item difficulty and item discrimination are group dependent: their values may change when computed for samples of test takers who differ from the ones used for the initial item analyses in some aspect of the construct being measured ○ The characteristics obtained through IRT are assumed to be invariant and provide a uniform scale of measurement that can be used with different groups • For tests of fixed length developped with CTT, the trait/ability estimates (AKA the scores) are test dependent: they are a function of the specific test items selected for inclusion in a test. Comparisons of scores derived from different tests are therefore not possible wihtout equating procedures ○ With IRT, estimates of abilities/traits are independent of the particular item set adminstered to examinees - trait estimates are linked to the probabilities of examinees' item response patterns - they can be compared without equating • In CTT, the reliability of scores is usually gauged by means of the standard error of measurement (SEM), which is assumed to be of equal magnitude for all examinees. BUT in reality, accuracy is not equal throughout the score range - it depends on how well suited test items are to examinees' trait or ability levels. When IRT is combined with adaptive testing procedures, the standard errors of trait/ability estimates resulting from a test administration depend on the particular set of items selected for each examinee - these SEM estimates vary appropriately at different levels of the trait dimensions and convey more appropriate info about the accuracy of measurement

Answer 7

• That the items comprising a test measure a single trait • The items responses of test takers depend only on their standing with regard to the trait being measured None of these assumptions can ever be fully met, but they can be met enough for the model to be workable

Answer 8

• IRT is based on the prediction that a person's performance on any test item is a function of one or + traits/abilities ○ The models seek to specify the relationship between the response to items and the traits that underlie them ○ IRT models can be evaluated in terms of how well they predict this relationship • IRT employ tests and item response data from large samples known to differ on the ability/trait that the test is designed to assess, not necessarily representative of a defined pop • After item/test score data is collected, it's used to derive estimates of item parameters that will place test takers/items along a common scale for the ability/trait dimension

Answer 9

the numerical values that specify the form of the relationships between the abilities/traits being measured and the probability of a certain item response

Answer 10

express the difficulty of an item in terms of the ability scale position where the probability of passing the item is 0.5 (AKA dichotomous item)

Answer 11

is the graphic representation of a mathematical function that relates item response probabilities to trait levels, given the item parameters that have been specified ○ Ex: the ICC of dichotomous abiltiy test expresses the expected relationship between ability level and probability of passing an item

Answer 12

reflects the contribution an item makes to trait or ability estimation at different points in the trait or ability continuum. • Helps to decide whether and where to incorporate items into a test • The test information function corresponds to the CTT notion of score reliability • Test information functions are used to obtain standard errors of estimation at each level in the trait/ability scale ○ Can be used to create Cis for the ability estimates in a similar way that the traditional standard errors of measurement in CTT are used to create confidence intervals for obtained scores

Answer 13

• Writers need to be aware of the principles of universal design: desire to make a product available to older people and to people with disabilities (in test items - ensuring that their format is accessible to everyone regardless of their age, language, gender, ethnicity, disability, etc) ○ CAT allows for even more accessibility than before

Answer 14

1- During the initial phase of construction (items are being written/generated) § Screening out any stereotypical depictions of subgroups § Eliminating items whose content may be offensive to minorities or place them at a disadvantage § Ensuring that subgroups are appropriately represented 2- Once the items have been administered and item performance data has been analyzed for subgroups § Items that show subgroups differences in difficulty, discrimination or both are examined/modified/discarded

Answer 15

Sometimes just the difference in relative difficulty of test items for individuals in diverse demographic groups • View not shared by all professionals Many specialists differentiate item bias from differential item functioning (AKA DIF): what occurs when different groups who have the same standing on a trait differ in the probability of responding to the item in a specified manner • BUT, both terms are still often used interchangeably

Answer 16

Involves analysis of item difficulty and discrimination for subgroups • This analysis is more complicated because of the fact that subgroups differ in their average performance and variability (especially on ability tests)

Answer 17

○ Item difficulty statistics become confounded by valid differences between groups in the ability that a test measures ○ Correlational indexes of item discrimination are affected by the differences in variability within the groups being compared

Answer 18

• Each of the groups in question is divided into subgroups based on total test score • And item performance is assessed across comparable subgroups ○ Caveat: the total score (an internal criterion) may be insensitive to differences in item functioning across groups ○ Caveat no 2: its ability to detect DIF is dependent on the use of very large groups

Answer 19

* IRT: identifies anchor items that show no DIF across the groups of interest * If the parameters / ICCs for 2 groups for a given item are the same, it may be inferred that the item functions equally well for both groups

Answer 20

(we write items, try them out, modify them, etc) • Writing clarity is a difficulty - to make sure that they are clear, well-written, straightforward, have a single meaning (alpha coefficient would be relevant here, since unclear items would lower its value) ○ The item should have the same meaning for everyone, and the variation will come from the respondent's answers

Answer 21

§ The interviewer has some discretion about questions, but is also guided in which questions to ask § The answer on some questions will dictate which questions should be asked next

Answer 22

where it's a discussion and the interviewer asks questions and records the responses § Issue; how can we extract meaningful information from that type of exchange? § Strengths/advantages: □ Rich in data (however it depends on the interviewer's skills - even with training there are still huge sources of individual differences) □ Hard to process the data (again, there is a big element of subjectivity)

Answer 23

□ Many variations possible for Likert scales ® Generally have from 3 to 9-10 items (rarely have more than that because the distinctions become too narrow/subjective) - odd number ® The odd number signals that the middle category is neutral/don't know/undecided/etc. ® Those with an even number of categories - no mid way response (between 2 and 8 options usually) - we don't want "don't know" answers - a further way to restrict the answers ® Can have verbal descriptors for the levels as well as numerical values (typically integers) along with them (ex: Strongly Agree is a 5) ◊ Whether integers are shown to the examinee or not, they are understood at being at the ordinal level of measurement ◊ The numbers associated with the responses are arbitrary (ex: in different questionnaires, a 1 could be Strongly Disagree or Strongly agree) the numbers could even be something like 0,2,4,6,8, or any set of 5 numbers

Answer 24

® Schwarz made a study: if the numerical values were present, it might make a difference in how they are coded (the examinee seeing the numbers might changer their answer) ◊ Ex: 2 groups were each administered one of those scales with the same question: ◊ Same question, same end points, but the numbers representing each category changes between the 2 conditions } For those who were presented with scale from 0 to 10, those who answered in the lower half (0-5) was about 30% } In the other condition, those who answered -5 to 0 was 15% ◊ These results were interpreted as indicating a difference between unipolar scales (1st option, there is less or more success only) and a bipolar success scale (2nd option, there is NO success (failure), not just low success or some success - it has a different interpretation / connotation)

Answer 25

Can be 0 or 1, more or less If the p value for a particular item is either very close to 0 or to 1, those items are worthless/need to be deleted/revised because they do not indicate differences among the respondents (everyone has or does not have it) S2i (variance for an item) = p x q • Where p = proportion passing the item • Where q = proportion not passing the item (=1-p) • P and q MUST add up to 1.0 • Max value of s2i is obtained when p and q is 0.5 • You have a maximum range of individual differences as measured by indiviudal variance when p = 0.25 ○ When p heads towards 0, the item variance gets smaller (when no one passes the item, the variance is also 0, AKA there are no individual differences)

Answer 26

1- • General knowledge/ability test ○ If you're constructing an ability/knowledge test used on a general population where you want to make the rate of passing about in the middle (p = more or less 0.5) The items should have p values of about 0.3-0.7 so that they average to about 0.5 (therefore an average responder will have a score in the middle of the distribution) 2- • Goal: out-of-level testing (the purpose of administering the test is to identify those who have the highest level of skill/knowledge, you're not too worried about the rest) ○ Ex: awarding of scholarship, etc ○ Example: goal is to identify the top 20% § The average p value will be about 0.2 on average (the items are harder) § The resulting distribution will be positively skewed (80% will not pass the item (q), while 20% will pass it (p)) § The range of p values for individual items will be about from 0.1 to 0.3 so that it averages to 0.2 3- • Goal: diagnostic testing (opposite scenario of out-of-level; we want to identify those at the lower end of the distribution) ○ Ex: a school system making a tutoring program where they can accomodate 20% of their students only - we will want to identify the students in the lower 20% of the distributions of grades ○ Called compensatory attribution § The average p value will have to be about 0.8 § The distribution will be negatively skewed § The test will be relatively easy and about 80% will pass and 20% will not

Answer 27

difference in the passing rate for each individual item over 2 groups defined as • D = p (U) - p (L) • Item difficulty for upper group - item difficulty for lower group (AKA the p value of the higher group should be higher than for the lower group) • *D is NEVER L - U, ALWAYS U - L

Answer 28

• The upper group should have a higher standing/skill/performance on a construct related to what the test is supposed to measure than the lower group

Answer 29

1. Some distinction external to the test (group membership on some variable that should be relevant for what the test is measuring, and upper/lower is determined by the score on that external variable) - pre-defined groups existing outside of the test 2. Internal distinction: groups are defined NOT b something external to the test, but by the basis of total scores on the test (internal measure), and this score is the only thing determining upper/lower

Answer 30

What is a higher score vs a lower score: you can take the 27% highest scores and 27% (any proportion basically) Kelley (1939) made the observation that if the lower group and higher group are defined as higher/lower 27% of the scores, then it is this split that will give you overall the highest value of the variance for D statistics • It's measuring a wider range of individual differences Optimal definition for an internal distinction between upper and lower

Answer 31

Unlike p and D, can accomodate a wider range of response format • Can be calculated for: ○ Dichotomous (0-1) - p and D too ○ Likert response scale - not p and D ○ Partial credit metric - from 0 to 0.5, 0.75, etc - not p and D

Answer 32

If rit > 0, it means that overall scores will tend to be generally higher (increasing rij and a as well) (higher scores on the items usually go with higher scores on ALL the items) Item 1 should then be positively correlated with item 2, 3, 4 etc

Answer 33

* Getting higher points on a particular item means having an overall lower score on the test - something is wrong, this item presents a problem (maybe it's not written clearly, maybe it's from hte wrong domain) * Items with negative correlations with total score will also tend to have negative correlations with other items in the test * It will draw rjj and a down (in the same way positive correlations brought them up)

Answer 34

Deleting an item wiht a negative correlation with total score might increase alpha for the remaining items although you just shortened the test (opposite of what S-B prophecy would predict, but removing an item with bad psychometric properties can actually increase the reliability)

Answer 35

meaning that for that particular item, the procedure calculates the total score across the other items but not item 1 (for item 1, for example) - the total score excludes the contribution of that particular item

Answer 36

It drops significantly (also depending on the nbr of items on the test, little items = bigger drop)

Answer 37

Alpha does not change If an item had a negative correlation, alpha would probably go up if we removed it because the items disagrees with the general tendency of the test

Answer 38

* For a binary item like correct/incorrect, 0/1, etc * The mean will be a proportion of 1 (from 0-0.01 to 0.99-1) * The proportion of cases who have a score of 1 (ex: average=0.3, 30% of answers to this item were coded as 1, "right") * The average = the proportion passing (AKA p value for each item, AKA item difficulty)

Answer 39

* Higher numbers = easier items | * Lower numbers = tougher items

Answer 40

``` Classical: • Difficulty (p) • Discrimination (D) • Item total correlation (rit) Modern: • Difficulty (NOT the same as p in CTT) • Discrimination (NOT the same as D in CTT) • Guessing ```

Answer 41

1- You need large sample sizes - in the thousands | 2- You need large banks of items - in the hundreds

Answer 42

Most examples of IRT testing are in government testing - screening thousands of applicants - applications to companies, programs, etc

Answer 43

IRT is all about trying to relate the probability of getting an individual item correct as a function of latent ability • Or the estimates the status of an indiviudal on the underlying construct/ latent variable • DOES NOT deal with simple total scores Unlike factor analysis it tries to locate the standing of each person on a factor and tries to predict/generate what level of ability is required to pass an item

Answer 44

an s-shaped function (sigmodial function) In IRT analyses, for each item in the bank, the computer can generate its own ICC Go check example of ICC in notes

Answer 45

Difficulty: level of ability needed to have a 50% chance of passing the item • In the ICC: ○ Locating, on Y axis (probability of correct response), 0.5 and project a horizontal line until we reach the ICC curve ○ Drop down to the x axis to the corresponding level of ability (on the X axis, scores are in Z scores (0=average ability))

Answer 46

• Discrimination: How the slope of the tangent line at the ICC that corresponds to the difficulty to the item ○ (the dashed line is the discrimination line) ○ The steeper the line the more discriminant the test is, and the change in probability of getting the item correct increases faster with the increase of ability level ○ Higher discrimination is a desirable characteristic of an item ○ Discrimination: how quickly does the predicted probability of getting an answer correct increases with the ability level

Answer 47

• Guessing: relevant only for items where its possible for examinees who don't know anything about the subject to guess a correct answer ○ For some items, guessing is not a possibility (ex: math exam) Value of the ICC curve at the lowest ability level on the graph (virtually no knowledge) (in the example above it's -3 SD but it could be lower on some graphs)

Answer 48

Unlike CTT, where the same test is given to everyone in the sample, in MTT the items are selected for each person and are tailored to the examinee • For example, the first item will probably have an item with 50% difficulty level • If the person gets it, they will be given a tougher item, until they get one incorrect, where they will lower the level of difficulty until the person passes half and fails half of the items given to them, it will correspond to their ability level - the computer will make them answer questions of their ability level until they answer enough questions to have a sufficient level of reliability ○ The computer will calculate for each examinee, their personal reliability coefficient (person-level rxx) • In CTT, there is no way to calculate a reliability coefficient for each examinee (or it would be very difficult)