- the meaningfulness, usefulness, or accuracy of a measure

- the degree to which a test subjectively appears to measure what it says it measures

Test Construction Flashcards by Robyn Campbell

true score variability

variability due to real differences in ability or knowledge in the test-takers

How well did you know this?

Not at all

Perfectly

error variability

variability caused by chance or random factors

How well did you know this?

Not at all

Perfectly

classical test theory

test scores = true score variability + error variability

How well did you know this?

Not at all

Perfectly

reliability

the amount of consistency, repeatability, and dependability in scores obtained on a test

How well did you know this?

Not at all

Perfectly

reliability coefficient

represented as ‘r’
ranges from 0.00-1.00
minimum acceptability is 0.8
two factors that affect the size are: range of test scores and the homogeneity of test content

How well did you know this?

Not at all

Perfectly

sources of errors in tests

content sampling
time sampling (ex - forgetting over time)
test heterogeneity

How well did you know this?

Not at all

Perfectly

factors that effect reliability

number of items (the more the better)
homogeneity (the more similar the items are, the better)
range of scores (the greater the range, the better)
ability to guess (true/false = the least reliable)

How well did you know this?

Not at all

Perfectly

test-retest reliability
(or coefficient of stability)

correlating pairs of scores from the same sample of people who are administered the identical test at two points in time

How well did you know this?

Not at all

Perfectly

parallel forms reliability
(or coefficient of equivalence)

correlating the scores obtained by the same group of people on two roughly equivalent but not identical forms of the same test administered at two different points in time

How well did you know this?

Not at all

Perfectly

internal consistency reliability

looks at the consistency of the scores within the test
2 ways: Kuder-Richardson or Cronbach’s coefficient alpha

How well did you know this?

Not at all

Perfectly

split half reliability

splitting the test in half (ex - odd vs even numbered questions) and then correlating the scores based on half the number of items

How well did you know this?

Not at all

Perfectly

Spearman-Brown prophecy formula

a type of split-half reliability
tells us how much more reliable the test would be if it were longer

*inappropriate for speeded tests

How well did you know this?

Not at all

Perfectly

Kuder-Richardson and Cronbach’s coeffcient alpha

involve analysis of the correlation of each item with every other item in the test (reliability/internal consistency)
KR-20 is used when items are scored dichotomously (ex - right or wrong)
Cronbach’s is used when items are scored non-dichotomously (ex - Likert scale)

How well did you know this?

Not at all

Perfectly

interrater reliability

the degree of agreement between two or more scorers when a test is subjectively scored
best way to improve = provide opportunity for group discussion, practice exercises, and feedback

How well did you know this?

Not at all

Perfectly

validity

the meaningfulness, usefulness, or accuracy of a measure

How well did you know this?

Not at all

Perfectly

3 basic types of validity

content
criterion
construct

How well did you know this?

Not at all

Perfectly

face validity

Study These Flashcards

the degree to which a test subjectively appears to measure what it says it measures

content validity

Study These Flashcards

how adequately a test samples a particular content area

true positive

Study These Flashcards

test takers who are accurately identified as possessing what is being measured

*correct prediction

false positive

Study These Flashcards

test takers who are inaccurately identified as possessing what is being measured

*incorrect prediction

true negative

Study These Flashcards

test takers who are accurately identified as not possessing what is being measured

*correct prediction

false negatives

Study These Flashcards

test takers who are inaccurately identified as not possessing what is being measured

*incorrect prediction

item difficulty

Study These Flashcards

represented by ‘p’
can range in value from 0 - 1+ (0 easy, 1 very difficult)
difficulty level = number of people who got it right
items should have an average difficulty level of 0.50, and a range of 0.30 and 0.80

example - if ‘p’ is 0.10, that means only 10% of people got the item right (therefore it was difficult)

diagnostic validity

Study These Flashcards

focuses on who DOES NOT have the disorder
an ideal situation would result in high sensitivity, specificity, hit rate, and predictive values with few false positives and few false negatives

convergent validity

the degree to which scores have high correlations with scores on another measure that assesses the same thing

standard error of measurement

used to construct a confidence interval

confidence intervals

68% = 1 standard error of measurement 95% = 2 standard error of measurement 99% = 3 standard error of measurement

content validity

Does the content accurately measure all of the topics/items it's intending to measure? Example: In a stats exam, content validity would be very low if it only asked you how to calculate the mean. Content validity would increase if it asked additional questions about other things within the stats world as well.

criterion validity

Does the content accurately reflect a set of abilities in a current or future setting? 2 kinds: Concurrent criterion = does this test accurately assess my student's current level of ability? Predictive criterion = does this test accurately assess how my students will do in the future?

construct validity

does your test measure the construct it claims to measure (ex - aggression) construct = a group of interrelated variables that you care about

Increase item difficulty

0= very difficult 1= very easy So to increase you want to add items that have p values as close to 0 as possible

Sensitivity vs specificity

Sensitivity: senses people WITH diagnosis = positive predictive value true positives/ (TP+FN) Specificity: specifies people who DONT have diagnosis. Negative predictive value = True negatives/ (TN+FN)

Factor matrix & commonality

Tool that helps us see how different things (variables) are connected or share common traits. Commonality: outlines proportion of variability We rotate a matrix to obtain a factor matrix that is easier to interpret

Size of Standard error of mean increases as

Population SD INCREASES and sample size DECREASES

Raising test’s cut off score. Will have which effects?

Raising cut off score: will result in fewer applications being hired. DECREASE: number of false positives INCREASE: number of true negatives

When test scores represent an interval or ratio scale and the distribution is skewed. The best measure of central tendency is what?

Median

Item characteristic curve intercepts Y (vertical axis) and provides information about which of the following?

Probability of answering item correctly by guessing

Banding (statistical banding)

Put test scores into groups based on a range of possible errors. People with scores in the same group are considered equally good at something (assuming small core differences don’t matter for job performance)

Attenuation formula

Used to estimate what the maximum criterion related validity coefficient would be if the predictor/criterion had a reliability coefficient of 1

When the prevalence of disorder increases how are its magnitude effects of a test’s positive and negative predictive values impacted?

When prevalence INCREASES the POSITIVE predictive value INCREASES and NEGATIVE predictive value DECREASES

How to calculate standard error of estimate

A 68% confidence interval is constructed by adding and subtracting one standard error of estimate to and from the person’s predicted criterion score, a 95% confidence interval is constructed by adding and subtracting two standard errors, and a 99% confidence interval is constructed by adding and subtracting three standard errors.

Test Construction Flashcards

(41 cards)