Chapter 2: Test Construction, Administration, and Interpretation Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

How are tests constructed? (1)

A

Identify the Need
This could mean testing something not yet tested or improving a test that is already made.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How are tests constructed? (2) The Role of Theory

A

All tests are either implicitly/explicitly influenced or guided by the theory/theories held by the test constructer

A theory might yield some specific guidelines

Example: if a researcher though that depression was a disturbance in four specific areas (self-esteem, sleep quality, etc.) then this would dictate what test they make to measure depression

The theory may also be less explicit and not well formalized. The creation of a test is intrinsically related to the person doing the creating and to their theoretical views.

Even a tests said to be empirically developed (based on observations of real life behaviors) can be influenced by theory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How are tests constructed? (3) Practical Choices

A

What format will the items have?

Will they be true or false, multiple choice, or on a rating scale?

Will my instruments be designed for group administration?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How are tests constructed? (4) Pool of Items

A

The next step is to develop a table of specifications, much like the blueprint needed to construct a house. This table of specifications would indicate the subtopics to be covered by the proposed test.

The table of specifications may reflect the researchers thinking, theoretical notions in present literature, other tests on the topic, and thoughts of other experts.

The table of specifications can be formals, informal, or not present, but leads to the writing when present.

The items on a test reflect the constructor’s creativity or pertain to other researchers/literature. Writing good test questions is both a science and an art. Professionals know that they need to write a pool 4 or 5 times greater than the number they actually need.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How are tests constructed? (5) Tryouts and Refinement

A

The initial pool of items will probably be large and rather unrefined.

The intent of this step is to refine the pool of items to a smaller but usable pool.

pilot testing is used where a preliminary form is administered to a sample of subjects to determine whether there are any glitches

We may also do some preliminary statistical work and assemble the test for a trial run called a pretest.

Administer a test to two different groups and carry out item analyses to see which items in fact differentiate the two groups

Write down the items that are best and then preform a content analysis in which you sort them to determine which groups have too many and not enough questions in their category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How are tests constructed? (6) Reliability and Validity

A

We need to establish that our measuring instrument is reliable, that is, consistent, and measures what we set out to measure, that is, the test is valid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are tests constructed? (7) Standardization of Norms

A

We need to standardize the instrument and develop norms. To standardize means that the administration, time limits, scoring procedures, and so on are all carefully spelled out so that no matter who administers the test, the procedure is the same.

Raw scores in psychology are often meaningless. We need to give meaning to raw scores by changing them into derived scores

We also need to be able to compare an individual’s performance on a test with the performance of a group of individuals; that information is what we mean by norms.

Simply because a sample is large, does not guarantee that it is representative. The sample should be representative of the population to which we generalize.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How are tests constructed? (8) Further Refinements

A

Some- times the changes reflect additional scientific knowledge, and sometimes societal changes, as in our greater awareness of gender bias in language
One type of revision that often occurs is the development of a short form of the original test.
Typically, a different author takes the original test, administers it to a group of subjects, and shows by various statistical procedures that the test can be shortened without any substantial loss in reliability and validity.
Psychologists and others are always on the lookout for brief instruments, and so short forms often become popular, although as a general rule, the shorter the test the less reliable and valid it is.
Still another type of revision that occurs fairly frequently comes about by factor analysis.
The factor analysis will tell you if all the items on the test are useful or if some should be thrown out because their contribution is minimal. It will also tell you if different aspects of the test should be scored together or separately.
Finally, there are a number of tests that are multivariate, that is the test is composed of many scales
The pool of items that comprises the entire test is considered to be an “open system” and additional scales are developed based upon arising needs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What to avoid when writing test items

A

Biased Questions
Loaded Questions
Double-barreled Questions
Jargon
Double Negatives
Poor answer scale options

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Biased Questions

A

Leading questions that sway people to answer one way or another.
Example: How great is our hard-working customer service team?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Loaded Question

A

Contains an assumption about a person’s
habits or perceptions.
Example: Where do you like to go to happy hour after work?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Double-barreled Questions

A

Asks multiple questions within one
item.
Example: Was the product easy to find and did you buy it?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Jargon

A

An item includes words, phrases, acronyms that the
person is not familiar with or doesn’t understand.
Example: The product helped me meet my OKRs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Double Negatives

A

You need to use proper grammar.
Example: I don’t scarcely buy items online.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Poor answer scale options

A

Make sure you answer scales match
the content of your items. The should not be confusing or
unbalanced.
Example: How easy was it for you to complete the exam on time?
Answer: Yes | No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Types of Items

A
  1. Multiple-choice
  2. True-false
  3. Analogies
  4. Odd-man-out
  5. Sequences
  6. Matching
  7. Completion
  8. Fill-in-the-blank
  9. Forced choice items
    10.Vignettes
  10. Rearrangement or continuity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the incorrect items on a multiple choice test called?

A

Distractors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the correct items on a multiple choice test called?

A

Keyed response

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the keyed response on tests with no definitive answer?

A

In tests that assess mental health and there is no correct answer the keyed response is the response that reflects what the test assesses. If you are measuring depression than the keyed response will be the choice that correlate with depression. “I feel withdrawn from others”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the advantages of a multiple choice test?

A

can be answered quickly so the test can include more items, can be scored quickly & inexpensively

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the disadvantages of a multiple choice test?

A

better at assessing factual knowledge then problem-solving

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

When is the best time to use true or false?

A

when there is no right answer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Where are analogies usually found?

A

in tests of intelligents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are matching tests good at?

A

assessing factual knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is a disadvantage of matching tests?

A

mismatching one item can affect other items and thus the questions are not independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Where are completion tests usually found?

A

on personality tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Where are forced choice test usually found?

A

personality tests

Have to pick one of a few options (I would rather spend time alone or I would rather spend time with friends)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is a vignette?

A

A brief scenario, like the synopsis of a play or novel.

The subject is asked to react in some way to the vignette, perhaps by providing a story completion, choosing from a set of alternatives, or making some type of judgment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are the two categories of items?

A

Constructed-response items: subject is presented with a stimulus and produces a response
Example: essay exams or sentence completion

Selected-response items: subject selects the correct or best response from a list of options
Example: multiple choice

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Objective test formats

A

One single response is labeled as “correct.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Subjective test formats

A

There is not one single answer or response that is labeled as
“correct.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How to decide which Item Format to Use?

A

Try to increase variation

If it is multiple choice then have many choices such as “strongly agree, agree, undecided, disagree, strongly disagree”

Use more items, a 10 items test can yield scores ranging from 0-10, if each item is scores 0-5 then raw scores can range from 10-50

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Sequencing of Items

A

A plan is to use a spiral omnibus format, which involves a series of items from easy to difficult, followed by another series of items from easy to difficult, and so on

Some scales contain filler items that are not scored but are designed to “hide” the real intent of the scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Direct or Performance Assessment

A

“Authentic”
Direct measurement of the product or performance generated.

If we wanted to test the competence of a football player we would not administer a multiple-choice exam, but would observe that person’s ability to play football.

35
Q

How do we know when an item is working? (Philosophical Issues)

A
  1. By fiat
  2. Criterion-keyed tests
  3. Factor analysis
36
Q

Fait

A

A decree on the basis of authority

Claiming that your test effectively measures depression because you are an expert on depression and the content of the items clearly relates to the subject

The Beck Depression Inventory and Standford-Binet test of intelligence

37
Q

Criterion-Keyed Tests

A
38
Q

Factor analysis

A
39
Q

Test Administration

A

Standardization is important
We want to minimize all influences that contribute to the error variance and may decrease test validity

40
Q

Examiners and their tasks

A

should prepare in advance of administration

• Memorizing or familiarizing self with instructions
• Preparation of test materials
• Layout of necessary materials
• Checking and calibration of equipment
• Before administering individual testing unsupervised, the
examiner should complete supervised training, including
demonstration and practice sessions
• If testing is to be administered in a group setting with multiple
examiners, then a briefing of examiners should be completed
beforehand to assign what functions each will perform

41
Q

Testing Conditions

A

• Testing environment should be standardized
• Suitable testing room
• Free from undue noise and distraction
• Adequate lighting, seating, ventilation, and workspace
• Prevent interruptions
• Desks and chairs can make a difference
• Type of answer sheet
• Medium of administration (paper and pencil, computer)
• Is examiner familiar or a stranger?
• Manner of the examiner (e.g., smiling, nodding, making positive
comments)
• Presence of the examiner in the room (projective tests) or other
people

42
Q

Rapport

A

the “bond” between the examiner and the test taker
• Rewards should be consistent across respondents
• Will vary on the test, age of the respondents, group versus
individual testing, personalities of respondents, special difficulties
of respondents
• Reassure respondents at the outset
• Eliminate elements of surprise
• With adults, “sell” the purpose of the test and it is in their best
interest to do their best and to reduce faking and encourage frank
reporting (personality)

43
Q

Examiner and Situational Variables

A

• More likely with projective tests and individual intelligence tests
• Children are more susceptible
• Have studied examiner age, sex, ethnicity, professional or
socioeconomic status, training and appearance, personality, and
appearance. Results are inconclusive
• Examiner’s behavior preceding and during administration
• Interactions with examiner
• Examiner expectations
• Timing of the test (e.g., military recruits testing shortly after
induction)
• Test taker’s activities shortly prior to the test (e.g., emotional
disturbance, fatigue, success or failure)
• Effects of feedback

44
Q

Derived Scores

A

Relates the position of a raw score either to
• Other scores in the same distribution
• The distribution of raw scores obtained by a representative group
Norm group
Test norms

45
Q

Norm groups vs. test norms

A

Norm group – the reference group with known characteristics
Test norms – the distribution of test scores obtained for the norm group

46
Q

What do derived scores do?

A

Provide a standard frame of reference within which the meaning of a
score can be better understood.
Make it possible for people, under certain conditions, to computer
scores from different measures

47
Q

What are the two kinds of derived score?

A
  1. Those that preserve the proportional relation of interscore distances in the distribution (z scores and other linear
    transformations of raw scores)
  2. Those that do not (e.g., percentiles)
48
Q

What is the formula for finding the kth percentile and the kth quartile?

A

𝑖= k/100 x ( 𝑛+1 )

𝑖= 𝑘/4 x ( 𝑛+ 1 )

i is the index (ranking or position of a data value)
n is the total number

49
Q

How do you calculate z scores? What is the mean value and SD value?

A

the mean (μ) is always 0 and the standard
deviation (σ ) is always 1
shape of the original distribution is not changed when converted
𝑧 = 𝑋 − 𝑀 /𝑆𝐷 𝑜𝑟
𝑥 − 𝑥 /𝑠 𝑜𝑟
𝑥 − 𝜇 /𝜎

50
Q

What are the Mound-Shaped Distributions

A
  1. Approximately 68% of the measurements will have a z-score between -1
    and 1.
  2. Approximately 95% of the measurements will have a z-score between -2
    and 2.
  3. Approximately 99.7% of the measurements will have a z-score between -3
    and 3.
51
Q

How do you translate z scores to item difficulty?

A

use the table in the back of the boo

52
Q

Bandwidth-fidelity dilemma

A

A peaked test measures those people at the peak well, but others
very poorly [high fidelity(precision), but low bandwidth]
A rectangular distribution tries to have a few questions for each
difficulty level, so that the average difficulty level is around .50
Thus it will help differentiate people no matter what level they are
on the trait
The test will only have a few items at each difficulty level, so it
won’t be able to differentiate between the individuals at the
various levels well
With this type of test it has good bandwidth, but low fidelity

53
Q

Define Item Difficulty

A

In psychology, we define item difficulty as the percentage of examinees who answer an item correctly.
𝑝 𝑣𝑎𝑙𝑢𝑒 𝑓𝑜𝑟 𝑖𝑡𝑒𝑚 𝑖= 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑒𝑟𝑠𝑜𝑛𝑠 𝑎𝑛𝑠𝑤𝑒𝑟𝑖𝑛𝑔 𝑖𝑡𝑒𝑚 𝑖 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 / 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑒𝑟𝑠𝑜𝑛𝑠𝑡𝑎𝑘𝑖𝑛𝑔 h𝑡 𝑒𝑡𝑒𝑠𝑡 (𝑛)

54
Q

What is the p score if everyone gets the item correct (low difficulty)?

A

p= 1.00

p= 100 people got it right / 100 people

55
Q

What is the p score if everyone gets the item wrong (high difficulty)?

A

p= 0.00

p= 0 people got it right / 100 people

56
Q

What does calculating item difficulty tell us?

A

The relative frequency that examinees choose the correct response.
It is a characteristic of both the item and the population taking
the test.
If we give the item to two different groups, the difficulty will not be the same.
The difficulty of items can be compared across domains.

57
Q

When is variability maximized?

A

When the item difficulty is closer to .50

58
Q

Bandwidth-fidelity dilemma, peaked

A

It is difficult for a test to measure all people well. Generally, it measures some people at a specific ability level better than others.
A peaked conventional test can provide high fidelity (i.e., precision) where it is peaked, but little bandwidth (i.e., it does not differentiate very well individuals at other positions on the scale).

59
Q

Bandwidth-fidelity dilemma, rectangle

A

A rectangular distribution tries to have a few questions for each difficulty level, so that the average difficulty level is around .50. Thus, it will help differentiate people no matter what level they are on the trait. The test will only have a few items at each difficulty level, so it won’t be able to differentiate between the individuals at the various levels well.

60
Q

Problems with guessing

A

This inflates the p value because a p value of .60 really means that among the 60% who answered the item correctly, a certain percentage answered it correctly by lucky guessing

61
Q

Ways to minimize the problems of guessing

A

score = right − (wrong/k−1)

The more answer choices the lower the significance of guessing (T/F is 50%, multiple choice is 20%)
Tell all candidates to do the same thing – that is, guess when unsure, leave doubtful items blank, etc.

62
Q

Item Discrimination

A

Item discrimination refers to the ability of an item to correctly “discriminate” between those who are higher on the variable in question and those who are lower.
We expect that those people who do well overall on the test will also do well on individual items. We also expect the opposite to be true.

63
Q

How do we determine what are the high scores and what are the low scores?

A

Can find the median and label everything above it high and everything below it low.
Advantage: we use all the data we have
Disadvantage: there is a lot of “noise” at the center of distribution

Label the top five high and the bottom five low
Advantage: scores are unlikely to change on retest; likely not a result of guessing; probably represent “real-life” correspondence
Disadvantage: small sample size so we can’t be sure that any calculations preformed are stable

Resolution: (roughly) select (around) the upper 27% and the lower 27%

64
Q

Index of Discrimination (D)

A

The index of discrimination is expressed as a percentage and is computed from two percentages. It is simply the differences between two percentages.
This method breaks the test takers into the top test scores and the bottom test scores. We compare the number of people in each group who answered the item correctly. If the item is doing a good job of discriminating between the two groups, then more of the high scorers will answer correctly than the low scorers.
If this is an item where there is a correct answer, a negative D would alert us that there is something wrong with the item, that it needs to be rewritten. If this were an item from a personality test where there is no correct answer, the negative D would in fact tell us that we need to reverse the scoring.

65
Q

internal consistency

A

If we use the total test score as our criterion, then we will be retaining items that tend to be homogeneous, that is items that tend to correlate highly with each other.

66
Q

external criterion

A

If we use an external criterion, that criterion will most likely be more complex psychologically than the total test score. For example, teachers’ evaluations of being “good at math” may reflect not only math knowledge, but how likable the child is.

67
Q

Item-Total Correlation

A

This statistic is the simple correlation between the score on an item (a correct response usually receives a score of 1; an incorrect response receives a score of 0), and the total test score

68
Q

Meanings of Item-Total Correlation

A

A positive item-total correlation indicates that the item successfully discriminates between those that do well on the test and those that do poorly.An item-total correlation near zero indicates that the item doesn’t differentiate between high and low scorers.A negative item-total correlation indicates that the item scores and the overall test scores disagree.Those that do well on an item with a negative item-total correlation do poorly on the test.

69
Q

Low Interitem Correlations

A

First, the item we are looking at may not be correlated with the other items in the test. If we want the test to homogeneous, then we should consider dropping the item or rewriting the item so that it assesses similar content to the other items.
Second, the item may show positive correlations with some items, but zero or negative correlations with other items on the test. If a test measures more than one attribute, this could occur.

70
Q

What are Interitem Correlations

A

Compute the correlations among all the items.You can use this information to compute the reliability of a test given the average interitem correlation and the number of items on the test.You can also use this information to help you interpret the item discrimination numbers you found.

71
Q

What are the two Philosophies of Testing?

A

Factor Analysis – Tests should be pure measures of the dimension being assessed. Items are selected statistically and correlate highly with each other.
Scale is homogenous.
Con: Useful for understanding a psychological phenomenon but may not relate to real world behavior

Empiricism – scales should predict real-life behavior. Items are dropped or kept depending on whether the correlate with the criterion.
Scale is heterogenous.

72
Q

Item Response Theory (IRT)

A

In “classical” test theory, a test score is made up of two parts:
“true” score + random “error”
The more a person has this variable, the more likely the person will
answer the question correctly.

IRT also has a basic assumption and that is that performance on a test is a function of an unobservable proficiency variable.

The characteristics of a test item, such as item difficulty, are a function of the particular sample to whom the item was administered.
Certain vocabulary words are harder for 2nd graders than they are for college students

IRT, on the other hand, focuses on a theoretical mathematical model that unites the characteristics of an item, such as item difficulty, to an underlying hypothesized dimension.

IRT is concerned with the inter-play of four aspects:
(1) the ability of the individual on the variable being assessed
(2) the extent to which a test item discriminates between high- and low-scoring groups
(3) the difficulty of the item
(4) the probability that a person of low ability on that variable makes the correct response.

73
Q

Why do we have norms?

A

We need to have some way to make sense of a test score.
We need to be able to compare a score with the scores of others
who have taken the test.
Usually we compare scores that have been obtained for a normative
sample.

74
Q

How are norms selected?

A

In a perfect world, tests are administered to a representative
group, on the basis of random sampling.
Normative groups are formed.
From this data, we could learn what average scores are to be
expected from particular samples.
Norms can be formed on the basis of random sampling or on the basis of certain criteria

Stratified Sampling is used when we test a normative sample that reflects specific percentages
Sample of Convenience is more typical
Neither is random or representative

75
Q

Age Norms

A

Age norms relates a level of test performance to the age of people who
have taken the test.
In establishing age norms, we need to obtain a representative sample
at each of several ages and to measure the particular age-related
characteristic in each of these samples.
We usually focus on the median because it shows what the typical
performance level is at each age level.
Remember that there is considerable variability within the same age.

76
Q

School Grade Norms

A

Very similar to age norms, except the baseline is the grade level rather
than the age.
We need to be careful when interpreting scores with grade-level
norms.
A child at a lower grade may get a score which is the grade equivalent
at a higher grade, but it doesn’t mean that the child should be in that
higher grade.
The higher score may only be true on a subset of material and doesn’t
translate to all areas of what a child of that grade can do.

77
Q

Cautions for Interpreting Norms

A

Norms can be based on inappropriate target populations.
Test manuals can be based on samples that don’t adequately represent the populations to which the examinee’s scores should be compared.
Normative data can become out of date quickly.
The sample size of the norm group may be small, which may have more sampling error than a larger sample would

78
Q

Expectancy Tables

A

Expectancy tables present the data showing the relationship between
test scores and some other variable based on the experience of
members of the norm group.
Shows what can be expected of a person with a particular score.

79
Q

Relativity of Norma

A

Depending on which set of norms you compare a score to may change the meaning given to the score.

80
Q

Local Norms

A

Some times it may be more appropriate to compare a score to a set of local norms.
This data is gathered from a local group of individuals.
It may depend on what the scores will be used for and whether decisions are to be made using the scores.

81
Q

Criterion-Referenced Testing

A

You assess your performance in comparison with some standard or
set of standards not in comparison of what others can do.
We must first of all be able to specify the criterion.
Second, criteria are not usually arbitrary, but are based on real-life observation. criterion-referenced decisions can be normative decisions, often with the norms not clearly specified
Lastly, criterion- referenced and norm-referenced refer to how the scores or test results are interpreted, rather than to the tests themselves.

In this course, I could develop a list of topics that I expect each student to master and then assess their ability to meet these
standards.
It is very difficult to develop good criterion-referenced tests.
Specifying standards and determining whether people meet or exceed them is still evolving.

82
Q

What is the difference between psychometric and edumetric?

A

Carver (1974) used the terms psychometric to refer to norm referenced and edumetric to refer to criterion referenced.

He argued that the psychometric approach focuses on individual differences, and that item selection and the assessment of reliability and validity are determined by statistical procedures

The edumetric approach, on the other hand, focuses on the measurement of gain or growth of individuals, and item selection, reliability and validity, all center on the notion of gain or growth.

83
Q

What are the four ways to combine test scores?

A

Combining scores using statistics: convert test scores to z scores so that they can be compared and combine

Combining scores using clinical intuition: college admissions person deciding “accept” or “reject” based off of combining test score, GAP, recommendations, etc.

Multiple Cutoff Scores: for college admissions a person may need a GPA of 3.0, anyone lower will not be considered. Cutoffs can be determined by clinical judgements or statistical evidence. Sometimes a high score in one area can compensate for a low score in another but not always.

Multiple Regression: Essentially expresses the relationship between a set of variables and a particular outcome that is being predicted. Gives differential weighting to each of the variables. First, is a compensatory model, that is, high scores on one variable can compensate for low scores on another variable. Second, it is a linear model, that is, it assumes that as scores increase on one variable (for example IQ), scores will increase on the predicted variable (for example, GPA). Third, the variables that become part of the regression equation are those that have the highest correlations with the criterion and low correlations with the other variables in the equation.