3rd exam Flashcards
What is the formula for classical testing theory?
X= T + E (x- observed score), t (true score), E (error, systematic and random)
What creates a problem for classical testing theory?
Guessing on an achievement test could cause the true score to be wrong
Do we know when people guess?
We never know when someone is guessing
Abott’s formula
allows you to understand and calculate true score for blind guessing
If you are guessing wrong what happens within classical testing theory?
the observed score is not reflective of their true score
Abbotts actual math formula
R (correct responses) - W (wrong responses) divided by K (number of alternatives) -1
To overcome the influence of blind guessing
one should advise examinees to attempt every question– since not all guessing is blind. Guessing one can narrow down and get it correct and the number of times blind guessing goes on tends to be less frequent
What is an error in multiple choice questions?
not the question its self but the responses you chose from
What is the error within short-answer questions?
the issue is what is the question asking and how do I answer it? this affects reliability
Ebels idea of reliability and response options
reliability studies have been done on the number of response options, a better way to increase test reliability is to add more items (responses should be around 5)
Speed tests
best way to calculate reliability for speeded tests is to do a split half reliability on the test
With speed tests how should you do reliability
administer half the test and give half the time to complete the test, also administer 2 weeks apart, better indicator of reliability
Halo Effect
raters tendency to perceive an individual who is high (or low) in one areas is also high (or low) in other areas
2 kinds of halo effects
general impression model and salient dimension model
General impression model
tendency of rater to allow overall impressions of an individual influence judgment of a persons performance (ex: person may rate reporter as “impressive” and thus, also rate him/her as her speech as strong)
Salient dimension model
take one quality from the person and that affects the rating of another quality of the person (ex: people rated as attractive are also rate as more honest) (make inferences about an individual based on one salient trait or quality)
Simpson paradox
aggregating data can change the meaning of the data, can obscure the conclusions because of a third variable
Percentages are at the heart of the simpson paradox, why are they bad?
because they obscure the relationship between the numerator and denominator (ex: 8/10 is 80% but also 80/100 80% is the same but number of people who reviewed a restaurant is different)
What is important in knowing the percentage?
you need to know what the numerator and denominator are, or you are misinterpreting the percentages
What happens when you disaggregate the data?
you can truly see if the phonomenon is actually occurring in simpson paradox
Clinical Decision-Making
make decisions on own clinical experience
Mechanical decision-making
make decisions based on data or statistics
Clinical psychologists often feel that their decision making is
absolute, but it is flawed because there are biases that we pull that affect our decisions
Robin Dawes
asserts that mechanical prediction is better than clinical prediction
Dawes example
asked faculty to rate students in graduate program from 1964-1967. Asked them to rate each student on a 5pt scale , however was very low correlation between current faculty ratings and ratings by the admissions committee, but ratings were correlated with GRE and Undergrad GPA
quantitative data (mechanical decisions) were
more predictive than clinical judgment
When can mechanical or quantitative prediction work?
when people highlight what variable to examine to determine prediction-people are necessary to choose what variables to examine
dawes crude mechanical decision making
ex: marital relationship satisfaction was determined based on higher sex versus argument rations-people tend to rate relationships higher if have more sex and less fights
People are not good at what with the data according to Dawes?
integrating the data in unbiased ways
There is resistance to what prediction
mechanical prediction, our belief in prediction is reinforced by isolated incidents we can access (we rely on testing which is quantitative data)
Always need to know the base rate?
to make sure to not make clinical judgment errors
Clinical decision making always has to be balanced by
Mechanical decision making
When people seek out treatment, they seek it out when they are most
Severe, or something is really impacting them
When you are severe, you generally don’t get more severe, which relates to the
Regression to the mean, which relates to the middle
Why is mechanical better than clinical prediction?
Dawes says that humans make errors in judgment because they ignore base rates, ignore third variable, ignore regression to the mean
Third variable examples
ice cream sales go up, same as crime does in the summer, the third variable is heat
Representative thinking
we tend to make decisions based on the information we readily have access to. we use this as shortcuts to live our life, but with diagnosis we need to do more.
Using representative thinking
can sometimes cause errors in thinking.
Heuristic
simple rule to make decisions
Factor analysis goes under
Nondichotomous scoring systems
Item response theory goes under both
Item analysis for both dichotomous and nondichotomous
Generalize ability theory goes under the
Overall test
Factor analysis
determine which items are associated with latent constructs, these are constructs that cannot be measured directly, we do this mathematically (allows us to look at item quality).
Anxiety as a latent construct
3 buckets (overarching constructs): physical, emotional/psychological and cognitive (every disorder has buckets)
Within anxiety the latent construct, what would the 3 overarching constructs contain?
Physical (heart rate, sweating, shaking, GI distress), Emotional/psychological (irritability, worry, nervousness), Cognitive (poor concentration, rumination)
3 necessary conditions to write a factor analysis
- factor structure represents what we know about the construct
- factor structure can be replicated
- factor structure is clearly interpretable with precise scaling
what type of sample does a factor analysis require?
need a an over-inclusive larger sample between 200-500 subjects
facets
defined-homogenous item clusters that directly map onto the larger order factors
What happens when there are more items in a factor analysis?
created ability to tap into the constructs that you may have not anticipated, it can also produce facets or sub-constructs
With item format, where can you not do it?
cannot use dichotomous item response formats because it can cause a serious disturbance in the correlation matrix
why do authors suggest having rating scales or likert scales from 5 to 7 points?
more response items greater amount of variance can be captured
Who should you sample for factor analysis?
Heterogeneity is needed, researchers should get a sample that can represent all trait dimensions
one of the reasons for conducting a factor analysis
develop and identify a hierarchical factor structure
Hierarchical factor structure
allows us to statistical identify those items that appear to be relevant to the construct, may identify another area or construct that was not thought of before putting together the items
Major criticism of factor analysis
develop these items on constructs that may or may not have a measurable criterion
the second reason for conducting factor analysis
improving psychometric properties of a test
how to improve psychometric properties of a test?
factor analysis can help developers determine items to remove, revise, or add more items to improve the internal consistency reliability of items
all tests with sound items should have a strong?
Internal consistency
with the sample size if the factors are well defined you can use a
smaller sample of between 100-200
The third reason for conducting a factor analysis is developing items that discriminate between samples
some items maybe endorsed by certain groups and them you may need to revise those same items so they are more discriminating for another group
The fourth reason for conducting factor analysis, developing more unique items- decreasing redundancy
having identical items are inefficient- whatever error is present will be associated with both items
Why are short forms good?
more efficient, less time consuming, easier for examinee and assessor
2 primary objections to short form development
1) can the short form give the appropriate information for an appropriate assessment
2) is the short form accurate and valid
General problems for short forms
1) there is an assumption that all the reliability and validity of the long form automatically applies to the abbreviated short form
(due to the reduced coverage can not assume there is similar reliability and validity)
2) there is an assumption that the new shorter measure requires less validity evidence (primary problem when you have less items and content coverage you will compromise the validity of the measure as well)
Empirical evidence of short forms (Smith, McCarthy & Andersen)
Examined 12 short forms to examine equivalence to longer original form,
-found that if large measure does not have good validity, how can a short one?
-by reducing the items the content coverage maybe compromised
-significant reduction in reliability coefficients
-many researchers do not run another factor analysis on short forms
-need to administer short form to an independent sample to determine validity
-need to use short form to classify clinical populations and compare to long form
-need to establish genuine time and money savings with a short form
Item response theory 2 types
difficulty and discriminability
Item Response Theory
a mathematical and statistical tool to determine item quality, to see how items look differently based on specific groups or individuals who are apart of a group
Classical testing theory is limited because
all error is lumped together in one term E (in formula), we can’t determine error at the individual item level
Item Response theory relating to error from Classical Testing theory
allows to examine error at the item level using a hiearachial mathematical modeling to observe scoring patterns.
Two types of item analysis
item difficulty and discriminability
How do we know what a good item is on a test
First we did factor analysis, but sometimes problems with this, according to IRT we do item difficulty or discriminability
Item difficulty Dichotomous
defined by the number of people who get a particular item correct ex: if 84% of people get item #24 correct than the difficulty level for that is .84
Item difficulty levels based on higher or lower Dichotomous
the higher the number the easier the item, the lower the number the harder the item
Item difficulty is based on
Chance
What should item difficulty be set at?
should be set at a moderate level of difficulty by whose average difficulty should equal .50
When deciding difficulty levels need to consider what
depends on who you are testing ex: medical students should be .2 vs. disabled students .7-.9 (level of skill set is limited)
What are the best level of difficulty?
best tests choose items that are between .3-.7 in difficulty
Test floor
you should have a sufficient amount of easy items for disabled, testing the floor
Test ceiling
sufficient amount of hard items (for doctoral level students, medical students)
item discriminability Dichotomous
determines whether people who have done well on a particular item have also done well on the entire test
extreme group method
compares people who have done very well with those who have done very poorly on a test
How is discrimination found? Dichotomous
discriminating between the upper group and the lower group means its a very good item, because its able to discriminate between groups
difference between higher and lower numbers for discrimination Dichotomous
the higher the number the more discrimination, the lower the number the less discrimination
overthinking the problem
when there is a negative number in discrimination
D= index of discrimination
number of persons passing in Upper and Lower limits are expressed in percentages and the difference between those percentages is the index of discrimination
how do we know it is dichotomous?
whenever we have the word correct, because dichotomous is right or wrong
Point Biserial Method
find the correlation between the performance on the item and compare it with the entire test
Point Biserial positive meaning
ranges from -1 to +1, if the number is positive or closer to one, it tells us that it discriminates in that those that scored higher on the test also got this particular question or item correct
Point Biserial negative meaning
ranges from -1 to +1 if there is a negative point biserial, it indicates that their may be a problem with the item
Point Biserial chart explanation
showing higher number relationship to difficulty of question, amount of those individuals are getting it correct
item characteristic curves are
dichotomous and let you know if the item is good
overthinking representation
when the item starts going up and goes down on a item characteristic curve
ex: upper group goes up and goes down
we focus on which group?
the upper group for dichotomous
item response function
a mathematical function describing the relation between where an individual falls on the continuum of a given construct such as depression and the probability that he/she will give a particular response to a scale item designed to measure that construct
difficulty for non-dichotomous
is symptom severity, looking at L= mild, M= moderate and U= severe, the farther away from y-axis more severe
Discriminability nondichotomous
means that the item discriminates between individuals that have severe symptoms and mild symptoms
item difficulty curve non-dichotomous
the curve that is furtherest from y-axis is considered the most difficult
item difficulty and discrimination for non-dichotomous will always provide
mathematical model will always provide a curve to show these
item discrimination curve for non-dichotomous
the curve that has the steepest slope is most discriminating item
advantages of IRT over CTT
IRT can look at the probability of getting an item correctly based on test takers ability, qualities. Can adapt to computer administration to give specific items related to ability level, IRT lets us better test those at higher and lower abilities and it lets us compare different groups (ethnicities, gender) on same items to examine patterns of responding, allows us to move away from bias questions and greater accuracy at the item level
generalizability theory is based on what aspects of the test?
the overall test, a new understanding of reliability
Why is generalizability theory moving away from Classical Testing Theory?
to understand how reliability is affected by various sources of error
Classical Testing theory only assumes 2 sources of error
random and systematic error
measurement error
error thats associated when we try to quantify a specific construct or concept
measurement error is associated with 3 errors
procedural error, instrumental error and evaluator error
procedural error
a non-standardized administration, this is not chance based because the more you practice the less you will commit this error
instrumental error
error associated with the instrument or the items on the test
evaluator error
any error that is committed by the assessor, one could be making problematic interpretations about the data, or not scoring correctly
measurement error is similar to circumscribed error
accounting for all possibilities
2 compononets of generalizability theory
generalizability and dependability
generalizability
can we generalize this observed test score to all the possible universal scores to that person
ex: husband and wife test drove a prius one time, said it was great, they are generalizing saying that all prius’s are good
-when testing someone one time does their observed score represent their true score after testing.
dependability
will the observed score remain constant even if we change the testing parameters
ex: they have a new prius, it does great without crazy weather but it doesn’t work well when its raining, will it remain constant in how it drives if the aspect of the road changes
generalizability closer to 1 means
the closer it is to 1, it means that we are more confident that the observed score can be generalizable to all the possible scores for that particular person.
dependability closer to 1 means
the closer to 1, the observed score will remain constant irrespective of the testing parameters
within generalizability theory it allows us to look measurement error which could be
items on test, raters, setting, assessment, time
ex: setting in a prison, could give different responses
problems with classical testing theory is that they only recognize
two sources of variance (test-retest and internal consistenty)
variance and error in classical testing theory according to generalizability theory is that these are
synonmous words
how does the generalizability theory extends the true score model?
by acknowledging that multiple factors may affect the error associated with measurement of one’ true score
rater is another way of saying
assessor
Sources of error
noisy room, specific items, examinee fatigue, administrator of the test (some people will have minimal experience, some will have a lot) all of these we could not address in CTT
Fundamental equation
reliability= variance of T divided by variance of x (which is variance T + variance E)
the larger the variance of T in relation to X, the higher the reliability
sources of variance
p= person taking the test, i= items on the test, e= random error, pi= interaction b/w person taking the test and the items on the test
the bigger circle on the vinnediagram says what about error
there is more error of it
adding another source of variance in the vindiagram
j= judge (evaluator)
pj= person interacting with the judge
ij= item and judge interaction (some judges might favor certain items vs. other items)
pij= interaction with the person taking the test, the items on the test and the judge
Norm oriented perspective
tend to be associated with generalizability coefficients. only uses indices that have p or person involved
Domain-oriented perspective
associated with the dependability coefficient, and they look at all the indices
whenever you see a T in the formula, what is it equal to?
T is equivalent to P
true score if equivalent to person
What do we use to understand item discriminability with dichotomous scoring
Extreme group & Point Biserial