Psych testing Flashcards
A. What are the two parts of test content evidence of validity.
B. For each of the two parts of test content evidence for validity, give an example of how a test might fail to meet these forms of evidence.
C. How would this test content evidence for validity be identified (i.e., what would a psychologist do to evaluate the evidence of test content validity?)
A. Relevance and Representativeness
B. Test items are not relevant if their content is not related to the construct intended to be measured e.g. if an employment test uses vocabulary that is beyond the English level needed on the job, certain groups would be at a disadvantage as the test assesses their understanding of words and not their capabilities.
A test would not be representative of the content if it does not cover critical elements of the construct or is biased towards one aspect e.g. if an English test only assesses spelling and not reading and writing, then it would not be an accurate representation of one’s English ability.
C. Get expert judges to assess the validity of test content.
Name at least four sources of evidence for test validity
Give a brief description of each of these.
- Content
- Item response processes - the cognitive process that is being tested should be captured by the items e.g. eye tracking, thinking aloud
- Internal structure - items should be related to one another theoretically and empirically
- Relationship to other variables - the construct should converge onto similar constructs and be different from other unrelated constructs (discriminant). The items should predict a criterion e.g. intelligence test scores are correlated with grades.
- Consequences of testing - the tests should be used for the intended purpose and not have negative consequences such as discrimination e.g. tests leading to league tables and students avoiding the lower scoring schools.
Consider a test measuring the personality trait of suggestibility, assessed with self-report items such as:
When I hear an unfamiliar statement, I think it is true: 1) strongly disagree; 2) disagree; 3) agree; 4) strongly agree
Name the five sources of validity evidence that could be obtained for this test.
Imagine you are a test developer who needs to design studies to collect validity evidence for this test. Describe a study or series of studies you could run to collect AT LEAST THREE forms of validity evidence.
Content, item response process, internal structure, relationship to other variables, consequences of testing
Suggestibility: how readily people accept and act on the suggestions of others - related to self-esteem, assertiveness
Content validity: judge whether the items are relevant to suggestibility and whether they represent the construct.
Internal structure: do the items relate to each other
Relationship to other variables: the test should be related to similar variables but not to different ones.
Consequences of testing: the test should only assess the extent of a participant’s suggestibility and not determine whether highly suggestible people lack confidence in all areas of the life.
A. Briefly describe computerized adaptive testing (CAT), outlining how it differs from non-adaptive testing.
B. What are the advantages of CAT?
C. What are the disadvantages of CAT?
D. Critically evaluate the CAT for a hypothetical application: nation-wide selection of students for selective high schools.
In adaptive testing, items are given based on the test-taker’s previous results. This ensures that their level of ability is matched by the items so they do not get all the answers correct or incorrect. Computerised adaptive testing uses an algorithm to adjust the difficult of the items for each individual.
Advantages: It reduces the cost of booking areas for people to take tests, and eliminates test proctors as the test is taken on the computer. This method avoids ceiling effects (when the items are too easy and people score high) and floor effects (when the items are too difficult and people score low). It also reduces cheating as they get different items. It reduces fatigue and boredom.
Disadvantages: It requires a large bank of test items and the analysis of all items in order to suit them to the test-taker. This is time-consuming and expensive. As the test is completed on a computer, technological issues can occur.
CAT for nation-wide selection of students for selective high schools: As there are many individuals taking tests, it would be more cost-effective to have students take them on a computer. However, since this is high stakes testing and determines whether a student proceeds to go to a selective school, there may be issues with computers and availability of technology therefore it is not appropriate for the purpose of this test.
A test-developer wants to collect data to test whether her rating scale of maturity (designed for sixth-graders) is reliable and valid. Critically evaluate EACH of the following scenarios for collecting such evidence:
A. She tests 50 students from the nearest primary school to her university (the Tiliopoulos School for the Gifted Child), getting them to complete the test 2 months apart, and also gets peer-ratings of maturity.
B. She tests 100 students drawn from 5 schools: 1 Catholic girls school, 1 Catholic boys school, 2 public mixed-gender schools from high and low SES areas, and 1 private mixed-gender school). They complete the assessment 2 years apart so she can monitor the growth of maturity. She obtains teacher and peer ratings of maturity
A. Small sample size - less reliable
Nearest primary school is for gifted children - test would produce ceiling effects, more mature -> not a valid test for this sample
Low test-retest reliability - 2 months, so they might remember the items
Low inter-rater reliability - could be biased as peers are rating them. It would be improved if teachers rated them as well.
B. Larger sample size - more reliable
Sampled from different types of schools - more reliable
High test-retest reliability
High inter-rater reliability - teachers and peer ratings can be correlated
What are the six domains of occupational interests in Holland’s interest inventory?
Provide a brief definition of each of these domains.
What is the framework for organizing these (i.e., how do the six interests relate to each other)?
Realistic - correlated .06 with social - practical jobs that are hands-on e.g. chef, plumber, firefighter, florist, driver, surgeon (close to C, I)
Investigative - scientific pursuits, thinker e.g. professor, pharmicist, psychologist, dietician (close to R, A)
Artistic - creative e.g. artist, designer, writer, musician (close to I, S)
Social - helping others e.g. cousellor, teacher, customer service (close to A, E)
Enterprising - business, leading, persuaders e.g. PR, marketer, manager, entrepreneur, HR (close to S, C)
Conventional - organising e.g. accountant, actuary, maths teacher (close to E, R)
Name and briefly describe FIVE major uses of psychological tests. An example use would be “certification for employment: when tests are used to provide a credential required to practice a particular occupation” (you will not receive marks for re-stating this example use).
For at least three of these major uses, name one or more example tests.
Classification: education - OC, Selective, GAMSAT, UMAT
Diagnosis: Depression, Anxiety and Stress Scale, McGill Pain Questionnaire
Coaching: Career Assessment Inventory
Forensic: to assess the individual’s ability to stand trial, for compensation purposes
Research: personality research uses NEO-PI-R, Satisfaction with life scale
Based on research and theory in response distortion, what recommendations would you make to psychologists using psychometric tests for high-stakes applications?
Make the items not transparent so they are harder to fake
A. Name and briefly describe each of the three research methods used to study faking on personality assessments
B. Critically evaluate each of these methods
Group Comparison - compare test-takers to another group to measure how much of the responses are likely to be fake.
Instructed Faking - compare the answers when told to answer honestly, to answers when told to maximise score. However, people might not be answering honestly due to self-deceptive bias and they are not completely aware of their own personality.
Incentive Manipulation - compare scores when there is an incentive to do well (e.g. $) and when there is not. If their answers are different, then there is a higher chance they are faking when there is an incentive, while the scores when there is no incentive are more accurate.
Test-takers are known to fake their responses on personality assessments.
A. Name and briefly describe the biases cause response distortion in psychological testing.
B. Give an example item for each of these.
C. What is the likely effect size for changes to big five personality traits due to response distortion?
Moralistic bias - people want to appear good and deny negative aspects e.g. never lying, stealing, losing temper
Egotistic bias - people value independence and try to appear brave, competent, talented
e.g. I can achieve xyz (when they are only somewhat competent)
Instructed faking causes large changes (0.93) on neuroticism - people want to fake lower N. There are large effects for O, C and moderate effects on E, A.
Group comparison shows moderate changes on C and N.
A. Name the two ways that mental speed is measured.
B. Describe the procedure that is used in each case.
C. How do each of these paradigms for measuring mental speed relate to intelligence?
Hick’s paradigm: Tests the time it takes for someone to react to a stimulus (simple RT) or many stimuli (choice RT). They start with their finger on the home button then when a light comes up, they must press a response button. Their RT is the time it takes for them to release the home button. Someone with a faster RT when there are multiple stimuli to choose from has greater mental speed. This is associated with higher IQ as the task is more complex with a greater number of alternatives to process.
Inspection time: The time it takes to judge a stimulus accurately based on its features after perception. E.g. line length test - two lines of different lengths are covered with a backwards mask then shown briefly. A lower IT is associated with higher IQ.
In both procedures, people can use strategies to increase their RT and IT, therefore they may not be accurate representations of intelligence.
What is a “lagged panel model”?
Describe the lagged panel models that were used to test the likely causal direction underlying the inspection time/intelligence association?
What is the causal direction (and is it the same for Gf and Gc)?
A lagged panel model illustrates the relationship between variables over time.
The correlation between inspection time at T1 and intelligence at T2 was greater than the corr between intelligence at T1 and inspection time at T2. Therefore, early IT caused later intelligence.
The correlation between auditory inspection time at T1 and scores on the Mill Hill Vocabulary Scale at T2 was larger than the corr between vocab scores at T1 and auditory IT at T2. Therefore, early auditory IT caused later intelligence.
Direction is the same for Gf and Gc.
People with a faster mental speed are more intelligent.
There are several tests that may be used as alternatives to the MSCEIT (Mayer-Salovey-Caruso Emotional Intelligence Test)
- Name and describe an assessment of emotion recognition
- Critically evaluate this assessment, outlining both the positives and negatives
Reading the Mind in the Eyes Test tests recognition of emotion just by looking at eyes. The emotion stays the same and does not represent real life emotion recognition such as talking to people (external validity).
There are several tests that may be used as alternatives to the MSCEIT (Mayer-Salovey-Caruso Emotional Intelligence Test)
- Name and describe an assessment of strategic EI
- Critically evaluate this assessment, outlining both the positives and negatives
STEM/STEU, MEIS - stream 1
MEIS (multifactor EI scale) is an ability-based EI test. It requires the individual to make a judgement of their own emotion understanding, management.
Strengths: the blends and changes tests are short, and there are mostly clear answers. As it is based on a judgement , it is harder to fake better performance on such items.
Weaknesses: the correct answer can be subjective - e.g. the emotion that results in blending a few simple emotions. If someone does not understand an emotion, it is difficult to choose the right answer - verbal intelligence might be measured.
Training programs targeting emotional intelligence are stronger for some types of EI than others. Which types of EI show the strongest effects of training?
How large are the improvements in EI?
Effects of training are strongest for strategic EI (EU, EM) which require analysis of emotion. The effect size was .83-1.3 which was an increase in 13-20 IQ points for EI. EU and EM improved by 5-6.5 IQ points.