Chapter 2: Test Construction, Administration, and Interpretation Flashcards

Question

What is a disadvantage of matching tests?

Answer 1

mismatching one item can affect other items and thus the questions are not independent

Answer 2

on personality tests

Answer 3

personality tests Have to pick one of a few options (I would rather spend time alone or I would rather spend time with friends)

Answer 4

A brief scenario, like the synopsis of a play or novel. The subject is asked to react in some way to the vignette, perhaps by providing a story completion, choosing from a set of alternatives, or making some type of judgment.

Answer 5

Constructed-response items: subject is presented with a stimulus and produces a response Example: essay exams or sentence completion Selected-response items: subject selects the correct or best response from a list of options Example: multiple choice

Answer 6

One single response is labeled as “correct.”

Answer 7

There is not one single answer or response that is labeled as “correct.”

Answer 8

Try to increase variation If it is multiple choice then have many choices such as "strongly agree, agree, undecided, disagree, strongly disagree" Use more items, a 10 items test can yield scores ranging from 0-10, if each item is scores 0-5 then raw scores can range from 10-50

Answer 9

A plan is to use a spiral omnibus format, which involves a series of items from easy to difficult, followed by another series of items from easy to difficult, and so on Some scales contain filler items that are not scored but are designed to “hide” the real intent of the scale

Answer 10

“Authentic” Direct measurement of the product or performance generated. If we wanted to test the competence of a football player we would not administer a multiple-choice exam, but would observe that person’s ability to play football.

Answer 11

1. By fiat 2. Criterion-keyed tests 3. Factor analysis

Answer 12

A decree on the basis of authority Claiming that your test effectively measures depression because you are an expert on depression and the content of the items clearly relates to the subject The Beck Depression Inventory and Standford-Binet test of intelligence

Answer 13

Standardization is important We want to minimize all influences that contribute to the error variance and may decrease test validity

Answer 14

should prepare in advance of administration • Memorizing or familiarizing self with instructions • Preparation of test materials • Layout of necessary materials • Checking and calibration of equipment • Before administering individual testing unsupervised, the examiner should complete supervised training, including demonstration and practice sessions • If testing is to be administered in a group setting with multiple examiners, then a briefing of examiners should be completed beforehand to assign what functions each will perform

Answer 15

• Testing environment should be standardized • Suitable testing room • Free from undue noise and distraction • Adequate lighting, seating, ventilation, and workspace • Prevent interruptions • Desks and chairs can make a difference • Type of answer sheet • Medium of administration (paper and pencil, computer) • Is examiner familiar or a stranger? • Manner of the examiner (e.g., smiling, nodding, making positive comments) • Presence of the examiner in the room (projective tests) or other people

Answer 16

the “bond” between the examiner and the test taker • Rewards should be consistent across respondents • Will vary on the test, age of the respondents, group versus individual testing, personalities of respondents, special difficulties of respondents • Reassure respondents at the outset • Eliminate elements of surprise • With adults, “sell” the purpose of the test and it is in their best interest to do their best and to reduce faking and encourage frank reporting (personality)

Answer 17

• More likely with projective tests and individual intelligence tests • Children are more susceptible • Have studied examiner age, sex, ethnicity, professional or socioeconomic status, training and appearance, personality, and appearance. Results are inconclusive • Examiner’s behavior preceding and during administration • Interactions with examiner • Examiner expectations • Timing of the test (e.g., military recruits testing shortly after induction) • Test taker’s activities shortly prior to the test (e.g., emotional disturbance, fatigue, success or failure) • Effects of feedback

Answer 18

Relates the position of a raw score either to • Other scores in the same distribution • The distribution of raw scores obtained by a representative group Norm group Test norms

Answer 19

Norm group – the reference group with known characteristics Test norms – the distribution of test scores obtained for the norm group

Answer 20

Provide a standard frame of reference within which the meaning of a score can be better understood. Make it possible for people, under certain conditions, to computer scores from different measures

Answer 21

1. Those that preserve the proportional relation of interscore distances in the distribution (z scores and other linear transformations of raw scores) 2. Those that do not (e.g., percentiles)

Answer 22

𝑖= k/100 x ( 𝑛+1 ) 𝑖= 𝑘/4 x ( 𝑛+ 1 ) i is the index (ranking or position of a data value) n is the total number

Answer 23

the mean (μ) is always 0 and the standard deviation (σ ) is always 1 shape of the original distribution is not changed when converted 𝑧 = 𝑋 − 𝑀 /𝑆𝐷 𝑜𝑟 𝑥 − 𝑥 /𝑠 𝑜𝑟 𝑥 − 𝜇 /𝜎

Answer 24

1. Approximately 68% of the measurements will have a z-score between -1 and 1. 2. Approximately 95% of the measurements will have a z-score between -2 and 2. 3. Approximately 99.7% of the measurements will have a z-score between -3 and 3.

Answer 25

use the table in the back of the boo

Answer 26

A peaked test measures those people at the peak well, but others very poorly [high fidelity(precision), but low bandwidth] A rectangular distribution tries to have a few questions for each difficulty level, so that the average difficulty level is around .50 Thus it will help differentiate people no matter what level they are on the trait The test will only have a few items at each difficulty level, so it won’t be able to differentiate between the individuals at the various levels well With this type of test it has good bandwidth, but low fidelity

Answer 27

In psychology, we define item difficulty as the percentage of examinees who answer an item correctly. 𝑝 𝑣𝑎𝑙𝑢𝑒 𝑓𝑜𝑟 𝑖𝑡𝑒𝑚 𝑖= 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑒𝑟𝑠𝑜𝑛𝑠 𝑎𝑛𝑠𝑤𝑒𝑟𝑖𝑛𝑔 𝑖𝑡𝑒𝑚 𝑖 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 / 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑒𝑟𝑠𝑜𝑛𝑠𝑡𝑎𝑘𝑖𝑛𝑔 h𝑡 𝑒𝑡𝑒𝑠𝑡 (𝑛)

Answer 28

p= 1.00 p= 100 people got it right / 100 people

Answer 29

p= 0.00 p= 0 people got it right / 100 people

Answer 30

The relative frequency that examinees choose the correct response. It is a characteristic of both the item and the population taking the test. If we give the item to two different groups, the difficulty will not be the same. The difficulty of items can be compared across domains.

Answer 31

When the item difficulty is closer to .50

Answer 32

It is difficult for a test to measure all people well. Generally, it measures some people at a specific ability level better than others. A peaked conventional test can provide high fidelity (i.e., precision) where it is peaked, but little bandwidth (i.e., it does not differentiate very well individuals at other positions on the scale).

Answer 33

A rectangular distribution tries to have a few questions for each difficulty level, so that the average difficulty level is around .50. Thus, it will help differentiate people no matter what level they are on the trait. The test will only have a few items at each difficulty level, so it won’t be able to differentiate between the individuals at the various levels well.

Answer 34

This inflates the p value because a p value of .60 really means that among the 60% who answered the item correctly, a certain percentage answered it correctly by lucky guessing

Answer 35

score = right − (wrong/k−1) The more answer choices the lower the significance of guessing (T/F is 50%, multiple choice is 20%) Tell all candidates to do the same thing – that is, guess when unsure, leave doubtful items blank, etc.

Answer 36

Item discrimination refers to the ability of an item to correctly “discriminate” between those who are higher on the variable in question and those who are lower. We expect that those people who do well overall on the test will also do well on individual items. We also expect the opposite to be true.

Answer 37

Can find the median and label everything above it high and everything below it low. Advantage: we use all the data we have Disadvantage: there is a lot of "noise" at the center of distribution Label the top five high and the bottom five low Advantage: scores are unlikely to change on retest; likely not a result of guessing; probably represent "real-life" correspondence Disadvantage: small sample size so we can't be sure that any calculations preformed are stable Resolution: (roughly) select (around) the upper 27% and the lower 27%

Answer 38

The index of discrimination is expressed as a percentage and is computed from two percentages. It is simply the differences between two percentages. This method breaks the test takers into the top test scores and the bottom test scores. We compare the number of people in each group who answered the item correctly. If the item is doing a good job of discriminating between the two groups, then more of the high scorers will answer correctly than the low scorers. If this is an item where there is a correct answer, a negative D would alert us that there is something wrong with the item, that it needs to be rewritten. If this were an item from a personality test where there is no correct answer, the negative D would in fact tell us that we need to reverse the scoring.

Answer 39

If we use the total test score as our criterion, then we will be retaining items that tend to be homogeneous, that is items that tend to correlate highly with each other.

Answer 40

If we use an external criterion, that criterion will most likely be more complex psychologically than the total test score. For example, teachers’ evaluations of being “good at math” may reflect not only math knowledge, but how likable the child is.

Answer 41

This statistic is the simple correlation between the score on an item (a correct response usually receives a score of 1; an incorrect response receives a score of 0), and the total test score

Answer 42

A positive item-total correlation indicates that the item successfully discriminates between those that do well on the test and those that do poorly.An item-total correlation near zero indicates that the item doesn’t differentiate between high and low scorers.A negative item-total correlation indicates that the item scores and the overall test scores disagree.Those that do well on an item with a negative item-total correlation do poorly on the test.

Answer 43

First, the item we are looking at may not be correlated with the other items in the test. If we want the test to homogeneous, then we should consider dropping the item or rewriting the item so that it assesses similar content to the other items. Second, the item may show positive correlations with some items, but zero or negative correlations with other items on the test. If a test measures more than one attribute, this could occur.

Answer 44

Compute the correlations among all the items.You can use this information to compute the reliability of a test given the average interitem correlation and the number of items on the test.You can also use this information to help you interpret the item discrimination numbers you found.

Answer 45

Factor Analysis – Tests should be pure measures of the dimension being assessed. Items are selected statistically and correlate highly with each other. Scale is homogenous. Con: Useful for understanding a psychological phenomenon but may not relate to real world behavior Empiricism – scales should predict real-life behavior. Items are dropped or kept depending on whether the correlate with the criterion. Scale is heterogenous.

Answer 46

In “classical” test theory, a test score is made up of two parts: “true” score + random “error” The more a person has this variable, the more likely the person will answer the question correctly. IRT also has a basic assumption and that is that performance on a test is a function of an unobservable proficiency variable. The characteristics of a test item, such as item difficulty, are a function of the particular sample to whom the item was administered. Certain vocabulary words are harder for 2nd graders than they are for college students IRT, on the other hand, focuses on a theoretical mathematical model that unites the characteristics of an item, such as item difficulty, to an underlying hypothesized dimension. IRT is concerned with the inter-play of four aspects: (1) the ability of the individual on the variable being assessed (2) the extent to which a test item discriminates between high- and low-scoring groups (3) the difficulty of the item (4) the probability that a person of low ability on that variable makes the correct response.

Answer 47

We need to have some way to make sense of a test score. We need to be able to compare a score with the scores of others who have taken the test. Usually we compare scores that have been obtained for a normative sample.

Answer 48

In a perfect world, tests are administered to a representative group, on the basis of random sampling. Normative groups are formed. From this data, we could learn what average scores are to be expected from particular samples. Norms can be formed on the basis of random sampling or on the basis of certain criteria Stratified Sampling is used when we test a normative sample that reflects specific percentages Sample of Convenience is more typical Neither is random or representative

Answer 49

Age norms relates a level of test performance to the age of people who have taken the test. In establishing age norms, we need to obtain a representative sample at each of several ages and to measure the particular age-related characteristic in each of these samples. We usually focus on the median because it shows what the typical performance level is at each age level. Remember that there is considerable variability within the same age.

Answer 50

Very similar to age norms, except the baseline is the grade level rather than the age. We need to be careful when interpreting scores with grade-level norms. A child at a lower grade may get a score which is the grade equivalent at a higher grade, but it doesn’t mean that the child should be in that higher grade. The higher score may only be true on a subset of material and doesn’t translate to all areas of what a child of that grade can do.

Answer 51

Norms can be based on inappropriate target populations. Test manuals can be based on samples that don’t adequately represent the populations to which the examinee’s scores should be compared. Normative data can become out of date quickly. The sample size of the norm group may be small, which may have more sampling error than a larger sample would

Answer 52

Expectancy tables present the data showing the relationship between test scores and some other variable based on the experience of members of the norm group. Shows what can be expected of a person with a particular score.

Answer 53

Depending on which set of norms you compare a score to may change the meaning given to the score.

Answer 54

Some times it may be more appropriate to compare a score to a set of local norms. This data is gathered from a local group of individuals. It may depend on what the scores will be used for and whether decisions are to be made using the scores.

Answer 55

You assess your performance in comparison with some standard or set of standards not in comparison of what others can do. We must first of all be able to specify the criterion. Second, criteria are not usually arbitrary, but are based on real-life observation. criterion-referenced decisions can be normative decisions, often with the norms not clearly specified Lastly, criterion- referenced and norm-referenced refer to how the scores or test results are interpreted, rather than to the tests themselves. In this course, I could develop a list of topics that I expect each student to master and then assess their ability to meet these standards. It is very difficult to develop good criterion-referenced tests. Specifying standards and determining whether people meet or exceed them is still evolving.

Answer 56

Carver (1974) used the terms psychometric to refer to norm referenced and edumetric to refer to criterion referenced. He argued that the psychometric approach focuses on individual differences, and that item selection and the assessment of reliability and validity are determined by statistical procedures The edumetric approach, on the other hand, focuses on the measurement of gain or growth of individuals, and item selection, reliability and validity, all center on the notion of gain or growth.

Answer 57

Combining scores using statistics: convert test scores to z scores so that they can be compared and combine Combining scores using clinical intuition: college admissions person deciding "accept" or "reject" based off of combining test score, GAP, recommendations, etc. Multiple Cutoff Scores: for college admissions a person may need a GPA of 3.0, anyone lower will not be considered. Cutoffs can be determined by clinical judgements or statistical evidence. Sometimes a high score in one area can compensate for a low score in another but not always. Multiple Regression: Essentially expresses the relationship between a set of variables and a particular outcome that is being predicted. Gives differential weighting to each of the variables. First, is a compensatory model, that is, high scores on one variable can compensate for low scores on another variable. Second, it is a linear model, that is, it assumes that as scores increase on one variable (for example IQ), scores will increase on the predicted variable (for example, GPA). Third, the variables that become part of the regression equation are those that have the highest correlations with the criterion and low correlations with the other variables in the equation.