2nd exam Flashcards
Validity looks at what
Accuracy, is the test measuring what it is intended to measure
The overarching validity all others fall under this one
Construct Validity
Does validity look at entire test or item quality?
Falls under reliability looking at the entire test
What are the validities under construct validity
Content, convergent, discriminant, criterion related, incremental, ecological
Construct validity
assessing the accuracy of the test to measure certain personality or psychological constructs
One problem with construct validity
is that the test is robust enough that it can accurately measure psychological constructs that often might not be stable
With construct validity, we need to understand when the theories are changing
we know the testing is changing too
To truly measure what a test is intended to measure?
it has to have a certain level of timelessness
Content validity
items need to cover the material that the instrument is supposed to cover, how relevant are the items to the construct
Content validity is looking at 2 questions
does the test cover a representative sample of specified skills and knowledge? Is the test performance reasonably free from influence of irrelevant variables?
What does content validity assume?
assumes that you have a good detailed description of the content domain- which is not always possible
What are the issues with experts administering tests with content validity?
Issue with experts is that they may be too invested or biased to the construct, or don’t have lived experienced, they only have theoretical knowledge
What happens with content validity, if measure does not have appropriate content?
you will make incorrect or erroneous clinical judgments based on measure
Why is content validity important to consider
Since development of measure maybe valid only under circumstances, different time periods, such as issues with gender nowadays
ex: cultural variability- not all cultures experience the same depression
Differences between trait-based depression and state-based depression?
Trait- endogenous- no precipitating factors, no situation caused it its chemically happened
State- theres a stressor, has precipitating factors
One way we come up with content validity is by using
Focus groups
Focus groups
allow you to go to individuals who have experienced the construct for them to help give appropriate items for that construct in a group setting
How is focus groups beneficial?
get a deeper understanding of the construct and people may feel comfortable to share their experiences in a group
Lived experiences within focus groups helps with
the accuracy of the construct, creates better validity
Focus groups were not used as much in the 1967-2002 when tests were created why?
It is hard to find a sample and it is hard to get funding for it
What are some drawbacks of content validity focus groups?
some are limited in their ability to participate (chronically mentally ill), hard to develop focus groups for rare constructs and some facilitators in focus groups will lead the group in a biased fashion
Examples of questions for focus groups
-“what was it like when you were deployed?”- for veterans to help with the language content in the test
What happens in content validity after the items have been generated through experts and focus groups?
experts will evaluate your scales and response options, help you write more clear questions
Criterion related validity
assess the degree to which the scores on an instrument accurately compare with a relevant criterion variable (a real life variable)
Criterion
real world implication
Two types of criterion related validity
Predictive and concurrent validity, both have a validity coefficient, the closer it is to 1, the more accurate it is in predicting
Predictive validity
predict an outcome (criterion) (ex: SAT predicts 1st year college GPA (real life outcome)
In predictive validity, validity coefficients look at
correlating the tests scores and the criterion variable for each person
Concurrent validity
when new test is administered and criterion variable is also collected at the same time - want to see if the measurement is matched to the criterion (real life variable)
Example of concurrent validity
(ex: chicago school interview day, we wrote a paper and had a interview (this is the real life because interaction) want to see if they are correlated
Convergent validity
testing measurement is shown to be correlated with another measurement that is examining the same construct (BDI and Hamiliton Depression Inventory should have high convergent validity)
2nd type of convergent validity
measuring the same construct but have multiple of different measurement methods (ex: self report of impulse control and family reports of impulse control problems)
convergent validity should be close to
1, which indicates that both measures are considered to be measuring the same construct
Discriminant validity
is where the testing measurement is shown to be minimally correlated with another measurement that is examining a construct that is dissimilar
With discriminant validity we want the validity to be
Close to 0 to show there is no overlap between measures or minimal association
The problem with discriminant validity
very hard to find constructs that are opposites of one another, many overlap
Face validity (not a real validity)
pertains to if the test looks valid to the examinees taking the exam, subjective and imprecise validity
Procedures to ensure validity
contrasted groups, previously available tests, criterion established by the rater, age differentiation, physical evidence of the behavior, real time observations in the world, controlled stimuli that depict variations in behaviors
Contrasted groups
give a test to two different kinds of samples and both samples have difference with regard to a specific trait, they contrast
Previously available tests
using previous tests as the criterion to compare new test
ex: WAIS always compared to WISC should be correlated with one another
Age differentiation
want to make sure the test is checked against chronological age or developmental milestones to determine whether scores increase with advancing age (ex: a 14 year old cannot do what a 17 year old can do)
Criterion established by the rater
Wants clinical interview/diagnosis done to match the instrument that was given. What you find in the instrument, should be also found in a clinical intake interview
Criterion contamination
the rater doing the clinical intake, also knows what the testing has shown, it influences their diagnoses
The criterion established by the rater can sometimes cause
criterion contamination, this is a problem with ensuring validity
physical evidence of the behavior
have people wear pedometers and check to see if reporting walking behavior (exercise)
real time observations in the world
very labor intensive (ex: GRE and college performance)
Controlled stimuli can be created that depict variations behaviors
videotape marital relationships and then see if corresponds to their depiction of self report measures of marital satisfaction
Cautions when interpreting validity
- testing procedures are not always reproduced accurately, 2. the criterion variable means nothing unless it is important or reliable, 3. make sure the population is representative of the sample, 4. need adequate sample size, 5. dont confuse criterion with predictor (when someone does well on SAT does not mean they are getting a 4.0 in college), 6. check for restricted range on both predictor and criterion, want the full normal curve, 7. is it generalizable, 8. consider differential predictions (its not that the SAT predicts GPA could be motivation
sample size does what to validity
it increases it
outliers do what to validity
can increase validity, because it is the full normal curve, you get all of the data
Incremental validity
a statistical method that measures how much more predictive a new assessment is than existing ones
Incremental validity is often seen in a chart of a
Regression Table looking at R squared
R squared means
amount of variance that can be predicted by a test
What happens when all of the R squared in a regression table are added up
Increases incremental validity
Ecological Validity is related to what type of psychology?
Neuropsychology
Can your measure be valid and not reliable?
No, if you have accuracy you also have consistency
Can your measure be reliable and not valid?
Yes, you can have a test that is reliable but not valid
Ecological Validity
how well do tests be generalized to real life or real world settings
Why do you give tests looking at Ecological Validity to neuropsych patients?
They have issues of daily living
Two ways to establish Ecological Validity
Verisimilitude & Veridicality
Verisimilitude Pt. 1
concerned with equivalence of tests to simulate everyday activities (ex: grocery list to neuropsych patients)
Veridicality Pt. 2
the degree to which the test shows an empirical relation to measures of cognitive functioning (it should map into every part of problem solving)
Standardization group
group of test takers who represent the population for which the test was intended
Norms
Performance of standardization group
Why are norms important?
you want them because norms allow you to have some point of comparison to the individual you are testing compared to a larger group
Mental age
how far along normal development path one has progressed
Example of mental age
giving a math test to 7-year-old, getting 15 out of 30 questions correct are developmentally 7 but less than that seem developmentally younger, this could be a problem
Why is mental age a problematic norm?
Because it labels a child as delayed when they may have a simple issue of math or spelling
What happens if Norms are flawed?
The test becomes flawed, have to know what the norms are because if not the test is applied inappropriately
Tracking
determining growth on a specific biological path (ex: formula babies higher percentile and breastfed babies lower percentile, studies only look at formula for profit from companies, creates a problematic norm
Ordinal Scale
are designed to identify the stage reached by the child in the development of specific behaviors or functions–generally seen in infancy
Grade equivalent
they are determined by computing the raw scored obtained by children in each grade then converted to the grade placement scale
Problems with grade equivalent norms
often misinterpreted, do not mean same percentile rank for each content area
Gifted
gifted means they are strong in certain areas or tasks, performing above the norm
Example of grade equivalent
30Q math test for 4 graders, Johnny got 24 answers correct–9th grade equivalent
-My child is performing at 9th grade level in 4th grade, no this is incorrect, the child is doing extremely well in 4th grade math but they are not doing 9th grade math
What is the gold standard of norms?
Within group norms, it is easy to use
Within group norms
individuals performance if evaluated in terms of the nearly comparable standardization group (comparing a child’s raw score with other children of a similar age)
How is percentile expressed in within group norms?
expressed in terms of the percentage of persons in the standardization sample who fall below a given raw score
Problem with within group norms
there is an inequality of units at the extreme of the distribution
How do we determine a stratified sample?
Looking at census data
Standard score
express the individuals distance from the mean in terms of the SD and the mean of the distribution
Scaled scores
takes into account grade levels norms and the difficulty of the test, these range from 1 to 100 or 100 to 1
Does percentile change in scaled scores due to more points?
getting more points in score but percentile may not change
Stanine
consists of single digits ranging from 1 to 9. Mean score is 5 and standard deviation is 2, cuts up normal curve into 9 pieces
Who is domain referenced test used best for?
for developmental delays or autism because these children are not compared to others of the same age, only comparing them to their best performance
Domain referenced test
describe a specific types of skills or tasks that the test taker demonstrates, information is used to determine where child needs growth
Do domain referenced tests have norms?
They have no norms
Standardization requires
large enough sample, representative sample of the population
What could be an issue with standardization and norms?
tests are normed differently and even if different tests are designed to measure the same construct they may have very different standardization groups
Who was Cicchetti?
developed overlap sample, means you use the same standardization group (sample people) for multiple tests
Cicchetti said Two major areas focused developmental
cognitive function (cognitive abilities) & adaptive functioning (daily living skills (tying your shoes)) these both need to be in line w/ each other
Why do overlap samples tend to not happen?
very costly and time consuming, have to find other ways to develop norms
Criterion referenced testing
another test that does not have any norms, scores are compared to specified content ex: mastery testing
Fixed reference groups
scores on a test are compared to a fixed group at a specific time
ex: SAT (fixed reference norm test someone taking test in 1941 were 11,000 white males, their mean score became the mean of the SAT)
Cohort effect
everyone thinking the same thing (group thinking)
Does a fixed group norm have standardization?
No
Criterion Referenced Testing (mastery testing 2 basic components)
to determine the proportion of items which must be correct to establish mastery and how many items necessary to determine positive mastery
Do fixed referenced norm tests work?
They are not fair to everyone!
Is SAT biased?
Yes because of fixed referenced norms
Anything based on census data is not what?
Fixed anymore
With criterion referenced testing their are cut off points, what are the disadvantages of these?
may have increased error judgments (always should have more than one testing measure to determine mastery)
Test bias applies to both
looks at the item level, but the quality of overall test as well
Two types of test bias
gender bias in regard to math tests & African Americans (Racial bias) with cognitive tests (these have the most empirical data)
Gender Bias w/ math tests
test scores are systematically different for woman who take standardized math tests compared to men
Racial Bias
systematic difference of African American scores on cognitive tests as compared to their white counterparts
Most African Americans when they get to college they?
More than half drop out
With PHD, how is that racially biased?
Only 7% PHD go to African Americans, 56% go to Whites
Test bias
systematic error that occurs in test scores when tests are applied to other ethnicities
How do African Americans score on IQ tests?
they score 15 points lower on IQ tests than their white counterparts (puts them 1 SD lower)
The Bell Curve book said that
Reason why African Americans score less on IQ tests is because they have a predisposed cognitive deficits (Very Racist)
Claude Steele
developed the concept called the stereotype threat
Stereotype Threat
there are certain stereotypes that are in the air that get activated under certain situations, those stereotype dictate the way one behaves.
Steele did a study and gave 2 conditions (said one was problem solving and the other said it was math test)
It was the same test, but women were underperforming on the math tests whereas men were overperforming (this is a stereotype threat) because women think they cannot do good on math due to gender bias
Race example of stereotype threat by Steele
Gave same SAT test to individuals, broke them up into two conditions one told them to write down their race, the other told them not to and start the test. Those who are Black did worse when they put their race compared to the white people.
There is no way around the ?
Stereotype error, it becomes activated in certain situations
Disidentification (With Stereotype Threat) Steele
reject specific traits about their identity, they do this so they can maintain a status quo
Example of disidentification
People who may be bad at academics, want to go to school to be an athlete, they reject the specific trait of academics which closes the door to more academic opportunities
How do you stop from rejecting or disavowing specific traits?
By saying you are working on them
Issues with stereotype threat
According to Steele and Aronson does not imply that taking away the stereotype threat will eliminate the bias in scoring between Blacks and Whites
With test bias, thought biased questions
were removed, but their were no real differences in scoring
Differential item functioning
Attempts to identify on standardized tests those items that are biased against ethnic minority populations
Helms (2006) female African American psychologist
believes that issue is not psychometric equivalence, rather testing but its nature poses issues of fairness especially as it is applied to ethnically diverse populations
What does Helm feel is unfair?
to view African Americans that perform 1 SD below whites as apart of the achievement gap
According to Helms why are tests not fair?
-not psychometrically sound, but minimize the effect of “internalized racial or cultural experience” that effects the test taker and the testing process
What is the reason for ethnic suppression?
African American and other ethnic minority groups experience cultural discrimination and prejudice that affect his/her testing performance
Reasons for test bias? Helms
It is sample dependent and certain ethnic groups respond in specific ways
Frisby 1999 says what about test bias
When you have a stereotype threat and Helms model, it can affect the test scores even if it theoretically sound
How do African Americans suffer from test bias?
They would not be selected or promoted for educational or career advancement about 85% of the time since the gold standard achievement is how Whites perform
Predictive bias
that test scores show different prediction or classification based on the group (majority vs. minority)
Slope Bias
two different regression lines for each group creating differential predictions for each group, the test or measurement procedures yields systematically different validity coefficients for members or different groups.
Intercept Bias
when two groups have similar slopes, but their intercepts differ in other words, score similar on the test, but differ on the criterion score
What do you need with slope bias?
need both lines to have different slopes, meaning they are going different ways
Biases in tables look at
P (significance) less than .05 means their is significance showing some type of bias for African Americans
Y= mx+b what does this mean?
M= slope & b= y-intercept (where the line hits the y-axis)
When is there an overprediction of a groups scores in a graph?
when one line is higher or farther than the other line
When do you see intercept bias?
when the lines hit the y-axis at different points their is intercept bias
When is criterion the same in a graph?
when the scatter plot points are hitting the center on same line on horizontal axis
Scatterplot divided into four quadrants
correctly accepted- if all scattered closer together
incorrectly accepted- next to correctly accepted below it
correctly rejected- a lot but more scattered from each other
incorrectly rejected- less scatterplots next to correctly rejected
in a validity chart, the ones in parenthesis are
reliability coefficients of internal consistency
in a validity chart, the ones that are underlined?
convergent validity, measuring different methods but looking at same constructs
Heart of scaling and classification & heart of test construction
Content validity