Psychometrics/testing Flashcards
four levels of measurement
nominal, ordinal, interval, ratio
nominal
categories- male/female
ordinal
ranks – 1st, 2nd, 3rd
interval
quantitative scores which tells relative rank and how far apart- GPA, IQ
ratio
has an absolute zero – temperature, time, length
mean
average (M)
median
middle number
mode
most frequent/common number
range
highest and lowest scores
-Do the scores cluster close to the mean or are they spread out?
variance
the average squared deviation from the mean; difference b/w each score and the mean, square the difference, taking average of the squares obtained
see equation
standard deviation
square route of the variance = variability of a distribution (s)
SD = √ variance
positive skew
inadequate floor, too HARD; to left
negative skew
inadequate ceiling, too EASY; to right
kurtosis
distribution that is flat or peaked at the top
platykurtic
flat top (flat like a plate)
leptokurtic
pointy on top (leap into the air!)
normal distribution
What % obtain above 1 SD:
-50% fall at or below mean
-½ of 68% = 34% score between mean and 1 SD
-THUS- 50% + 34% = 84% score below 1 SD
-100%- 84% = 16% score higher than 1 SD above mean
68% within 1 SD; 95% within 2 SD; 99% within 3 SD
see picture of normal distribution
developmental norms
norms that are based off of developmental milestones (i.e. grade or age)
-Used on tests of intellectual ability or academic achievement where skill being measured is thought to develop over time. i.e. WIAT
within group norms
how an examinee preformed relative to the norm group- same age, gender, etc.
-Use of within group norms is better to interpret tests because developmental norms can be easily misinterpreted.
types of within group norms
percentile rank, standard scores, z score, t score
percentile rank
what % of the normative sample obtained scores equal to or lower than that of the examinee
-Calculated directly from frequency distribution
standard scores
uses means and SD to transform a raw score into a new score to tell us where examine scores relative to their peers
-To obtain a standard score, convert the Z score to a scale with a mean or 100 and SD of 15
-SS = 15 Z + 100
-Used on IQ and achievement testing
-Mean of 100 and stdev of 15
z score
measures how far from mean the examine scored in units of standard deviation
-Subtract mean from raw score, divide score by standard deviation
t score
linear transformation of the Z score
- Multiply z score by 10 and add 50
-T = 10z + 50
-Used on MMPI
-Mean of 50 and Sd of 10
coverting t to z to ss scores
-T→ z → SS
Z = (T-50)/10
Z = (SS-100)/15
-T = 10z + 50
Example: Convert T = 63 to SS
63 – 50/10 = z = 1.4
1.4(15) + 100 = 121
correlation
statistic that describes the relationship between two variables, X and Y
Pearson: Range = -1.00 to +1.00 (magnitude and sign)
R = 0; X and Y not related
If r> 0, then higher scores on X associated with higher Y scores
If r< 0 then higher scores on X associated with lower scores on Y
curvilienar
r = 0…but still relationship
factors that affect the correlation coefficient
heteroscedasticity
homoscedasticity
restriction of range
heteroscedasticity
distribution is spread out- dots all over
Thus…correlation coefficient will not accurately reflect
homoscedasticity
evenly distributed – better
restriction of range
reduce the magnitude of r; not sampling full range
When a variable has restricted range, put little confidence in r
factor analysis
goal is to simplify a complex amount of information
…with an assumption that variables in the matrix correlate the way they do because there are one or more underlying themes or factors that link some of the variables together
factor extraction
extracting one or more factors
unrotated factor matrix
is what it originally looks like; then we rotate it to get rid of all the negative loadings
rotations
shifting the factors to see if they are correlated or not
rotated factor matrix
represent both how the variables are weighted for each f actor but also the correlation between the variables and the factor; shows loading of variables on the new rotated factors ; utilizes…
Turstone’s Criteria : to have a simple structure
systemic error
we can detect and eliminate- i.e. multiple choice test graded wrong
random error
cannot be fixed, inevitable, inescapable; most concerning b/c cant fix
classical test theory
your obtained score consists of your true score plus error
Obtained score = true score + error → Xo = Xt + e
time sampling and solution
Error due to time sampling; same group of examinees test again after a period of time has elapsed
ISSUES = Practice effect and time interval
SOLUTION: Test/ Re-test – not too long or too short!
Practice effect- 2nd admin never same as 1st
item sampling
error resulting from the need to select a subset of the total universe (domain) of items
alternate-form method
o 2 forms of test - same process to select items for both
o Is prohibitive because difficult to create twice as many good items
split half method
-Items divided into two half tests – often odd/even
-Decreases reliability when shorter (reliability related to length)
-Correlation must be adjusted by means of Spearman-Brown formula
internal consistency
are all the item on the scale equally good measures of the construct?; is the scale homogenous”
mean item-item (inter item) correlation
-Correlation between the scores on each pair of items on a test
-the larger the mean correlation, the more homogeneous the items
item total
-Correlation between each item and the total score of the test
-The correlation between the score on a particular item and the total score on the test is also known as the discriminability of the item
Kuder Richardson
reliability test for items that are either correct or incorrect
Chronbach’s alpha
measure of internal validity, used for non-binary items; MOST COMMON
-High value = homogeneity = interchangeable of construct
-Can be too high!! – identical questions, don’t survey variety
-Low value = might not be measuring the same construct
inter rate reliability
fixed with percent agreement calculating the percentage of the items they agree
Likely to underestimate the actual amount of error due to inter-rater difference
kappa
better way of assessing inter rater reliability- corrects for chance agreement and corrects for it
error variance
1.00 minus reliability
standard error of measurement (SEM)
confidence interval where your score, if retested, will fall
The SEM is INVERSELY related to the reliability
-If reliability is high, SEM is low
-If reliability is low, SEM is high
standard error of estimate (SEE)
confidence interval where your true score is
Example: 95% CI using SEE = 95-105
Means that there is a 95% chance that the person’s true score falls b/w 95 and 105
standard error of difference (SED)
used when you are comparing TWO scores and taking into account error associated with both scores to determine whether the scores are in fact different from each other; SED is always larger than then SEM of the two SEMs
SED = square root of (SEMsquared one + SEMsquared two) … aka
SED = √(SEM21 + SEM22)
validity
does the test measure what it claims to measure? *Reliability is a necessary condition for validity
face validity
-Does it look like a test that measures what is says it will? Does it look like they belong
-A cosmetic issue – not a requirement for validity
-Possible for test to be a highly valid measure, but low face validity - Ex. MMPI
content validity
-Do items have sufficient coverage? Covering what supposed to?
-A matter of professional judgment – no actual test for content validity
-Addressed during process of test construction
criterion validity
Goal is to demonstrate that the test correlates with other criteria that are important elements of the construct be measured
Ex. Self reports scale of depression compared to psychiatric diagnosis
concurrent validity
recorded at same time ; now
-To show that a test can be a substitute for more costly or inconvenient tests
-ex. Freshman GPA and college admissions test
predictive validity
recorded at a later point in time - is our test an accurate predictor?
-To show that the test can accurately predict future performance
-Ie SATs and college GPA
criterion appropriateness
-Basis measure manner in which the relationship between a test and criterion is expressed is via the correlation
- The correlation between a test and a criterion is called the validity coefficient
criterion contamination
those who assess the examinee should be blind to the test scores when assigning criterion
Ex. Psychiatrist should not know high score on depression scale before diagnosing depressing
criterion unreliability
poor reliability of the criterion measure
Correct for error with correction for attenuation – but might over correct
differential validity
validity of test differs depending on subgroup – i.e. gender and criterion is validity coefficient
incremental validity
Ability of a test to measure more accurately or precisely than other measures
-Does the test add to already existing measures?
stepwise multiple regression
-Comparison of the with the only existing measures and the after the new test has been added
-Statistical measure that identifies the optimal battery
shrinkage
a regression formula applied to a new sample will always shrink; helps to estimate by cross validation– concern
cross validation
to eliminate how much shrinkage
construct validity
Is the test adequately measuring the construct of interest
convergent validity
when a tests correlates highly with another test or observation with another test we would expect it to correlate with
Ex. High IQ and grades
discriminant validity
when a test does not correlate with observations it should not correlate with
multitrait
multimethod matrix - Methodology to assess convergent and discriminant validity
factorial validity
-Used to look at construct validity
-Examines the test’s factor structure to determine if it fits what is predicted or theoretically expected
exploratory factor analysis
No theories used – no hypothesis before - Let data tell you (hypothesis)
confirmatory factor analysis
start with hypothesis and test data against that (no hypothesis)
decision theory
statistics for evaluating the utility of a test in assigning a diagnosis
statistics depend on the cutting score
cutting score and what will increasing/decreasing the cutting sore do for sensitivity and specificity
the score that divides our scale into two parts
If above the cutting score – positive for a diagnosis
If below – negative for the diagnosis
Increasing the cutting score will DECREASE the sensitivity and INCREASE the specificity
Decreasing the cutting score with INCREAES the sensitivity and DECREASE the specificity
base rate within decision theory
Base rate must also be taken into account
Base Rate - Percentage of those in sample who actually have the disorder
Relative “cost” of false positive vs false negative error – depends on setting
What’s the risk of over or under diagnosing? How much do you adjust the cutting score?
true positive
test accurately identifies a person as having the disorder (A)
false positive
tests says that the person has the disorder when s/he actually does not (B)
true negative
test accurately identifies a person as NOT having the disorder (D)
false negative
tests says that the person does not have the disorder when s/he actually does have disorder (C)