Psychometrics Flashcards
What determines choice of format in item writing? (2 marks)
Objectives and purposes of test (eg do we want to measure extent/amount of interaction, or quality of interaction)
Difference between objective and purpose of a study?
Purpose - broad goal of research
Objective - how are we practically going to achieve that
List 4 of the 9 item writing guidelines
- clearly define what you want to measure
- generate an item pool (best items are selected after analysis)
- Avoid long items
- Keep the reading difficulty appropriate
- use clear and concise wording (avoid double-barrelled items and double negatives)
- Use both pos & neg worded items
- use culturally neutral items
- (for MCQS) - make all distractors plausible & vary position of correct answer
- (for true/false Qs) - equal numbers of both and make both statements the same lenth
List the 5 categories of item formats
- Dichotomous
- Polytomous
- The Likert format
- The Category format
- Checklists and Q-sorts
Advantage of the dichotomous format (3 marks)
- easy to administer
- quick to score
- requires absolute judgement
Disadvantages of the the dichotomous format (3 marks)
- less reliable (50% chance of correct answer)
- encourages memorization instead of understanding
- often the truth is not black and white (true false is an oversimplification)
Minimum number of options for a polytomous format?
3 (but 4 is commonly used, and considered more reliable)
3 guidelines in writing distractors in the polytomous format
- distractors must be clearly written
- distractors must be plausible correct answer
- avoid “cute” distractors
Advantages of polytomous questions (4 marks)
- easy to administer
- easy to score
- requires absolute judgement
- more reliable than dichotomous (less chance of guessing correctly)
Formula for correcting guessing
R-(W/(n-1))
Fields in which Likert scales are predominantly used (2 marks)
Attitude and Personality questionnaires
How can one avoid a the neutral response bias in Likert Scales
have an even number of options
How does one score negatively worded items from a Likert scale
Reverse score
Suggested best no. of options in a category format question?
7
Disadvantages of the category format (2 marks)
- tendency to spread answers across all categories
- susceptible to the groupings of things being rated (rate an item lower if those other items in the category are really good - i.e not objective)
When best to use category format questions? (2 marks)
- when people are highly involved in a subject (more motivated to make a finer discrimination)
- when you want to measure the amount of something (eg levels of road rage)
Two tips when using the category format
- make sure your endpoints are clearly defined
- use a visual analogue scale (ideal with kids, for e.g smily face on one side of scale and frowny face on the other to describe how they’re feeling)
Where are Checklists format questions commonly found?
Personality measures (e.g a list of adjectives, tick those that describe you)
Describe the process of Q-sort format questions
Place statements into piles, piles indicate the degree to which you think a statement describes a person/yourself
In terms of Item analysis, describe item difficulty and give another name for it
The proportion of people who get a particular item correct (higher value = easier item)
AKA facility index
p = no of correct answers/no of participants
Ideal range for optimum difficulty level
0,3 - 0,7
How to calculate ODL (optimum difficulty level) for an item
Halfway between 100% the chance of guessing the answer correctly (1+chance)/2
E.g: For a item with 4 options, ODL = (1+0,25)/2 = 0.625
How should difficulty levels range across items in a questionnaire
You want most items around the ODL and a few at the extremes. The distribution of p-values (difficulty levels) should be approximately normal
Why does one need a range of item difficulty levels?
To discriminate between ability of test-takers
List 3 exceptions to having optimum difficulty levels
- need for difficult items (e.g selection process)
- need easier items (e.g special education)
- need to consider other factors (e.g boost confidence/morale at start of test)
p (an item difficulty level) tells us nothing about…
…the intrinsic characteristics of an item. It’s value is related to a given sample
Item discriminability is good when…
people who did well on the test overall get the item correct (and vice versa)
Describe the extreme groups method when calculating item discriminability
calculated by looking at proportion of people in the upper quartile who got the item correct minus the proportion of people in the lower quartile who got the item correct
{in other words, the difference in item difficulty when comparing the top and bottom 25%}
Di = U/Nu-L/Nl
*Should be a positive number of item has good discriminability
A red flag in item discriminability?
A negative number
Describe the point biserial method when calculating item discriminability
Calculate an item-total correlation
(if test-taker fails the item but does well on the overall test, i-tc will be negative)
Can item-total correlations be used for likert-type scales and other formats such as category to polymous formats?
yes
Results from item-total correlations can be used to decide….
which items to remove from the questionnaire
Item characteristic curves (ICCS) are visual depictions of…
the relationship between performance on an item and performance on the overall test
Give the x- and y-axes of an ICC
x-axis = total score on test
y-axis = proportion {of test takers who got the item} correct
3 steps to drawing ICCs
- Define categories of test performance (eg specific total scores/percentages)
- Determine what proportion of people w/in each category got the item correct
- Plot your ICC
Briefly explain Item Response Theory (IRT)
Test difficulty is tailored to the individual - wrong answer = decrease difficulty, right answer = increase difficulty. Test performance is defined by the level of difficulty of items answered correctly
Name the program through which Item Response Theory is often administered
The Adaptive Computer-based test (ACT)
Advantages of Item Response Theory (3 marks)
- increase morale
- quicker tests
- decrease chance of cheating
In terms of measurement precision, name the three types of tests
- Peaked conventional
- Rectangular conventional
- Adaptive
Described Peaked Conventional tests (3 points)
- Test individuals at average ability.
- Doesn’t assess high or low levels well
- high precision for average ability levels, low precision at either end
Describe Rectangular Convention tests (2 points)
- equal number of items assessing all ability levels
- relatively low precision across the board
Describe Adaptive Conventional tests
- test focuses on the range that challenges each individual test-taker
- precision is high at every ability level
Describe criterion-referenced tests
The test is developed based on learning outcomes - compares performance with some objectively defined criterion (What should the test-taker be able to do?)
How does one evaluate items in criterion-referenced tests? And how should the score/frequency graph look
2 Groups - one given the learning unit and one not given the learning unit. Graph should look like a V
List 3 limitations of criterion-referenced tests
- tell you you got something wrong, but not why
- Emphasis on ranking students rather than identifying gaps in knowledge
- Teaching to the test - not to education
What is referred to as the “test blueprint”
The test specifications
List 4 of the 7 things that test specifications should describe
1 Test (response) format
2 Item format
3 Total number of test items (test length)
4 Content areas of the construct(s) tested
5 Whether items or prompts will contain visual stimuli
6 How test scores will be interpreted
7 Time limits
In terms of response format, list 3 ways in which participants can demonstrate their skills
- Selected response (eg Likert scale/MCQ/dichotomous)
- Constructed response (eg essay/fill-in-the-blank)
- Performance response (eg block design task)
In terms of response format, give an example of objective vs subjective formats
Obj - MCQ or Likert
Subj - Essays, projective tests
List 5 types of item response format
- Open-ended - eg open ended essay q (no limitations on the test taker)
- Forced-choice items - MCQS, true/false qs.
- Ipsative forced choice (leads the test-taker into a certain direction, but still somewhat open. e.g I find work from home….)
- Sentence completion
- Performance based items
List the two determinants of test length
- Amount of administration time available
- Purpose of the measure (eg screening vs comprehensive)
When test length increases compliance ….. because people get ….. and …..
decreases; fatigued and bored
How many more items should be in the initial version of the test than the final one?
50%
Having good ….. ensures that all domains of a construct is tested
Content areas
….. refers to the ways in which knowledge or symptoms are demonstrated (and these are therefore tested for)
manifestations
Reliability is the desired ….. or ….. of test scores and tells us about the amount of ….. in a measurement tool
consistency or reproducibility; error
Why is test-retest not always a good measure of reliability?
Participants learn skills from the first administration of the test
Normally we perform roughly around our true score, and so our scores are…..distributed
normally
……is something we can use to increase reliability
internal consistency
Name the 4 classical test theory assumptions (NB to know these)
- Each person has a true score we could obtain if there was no measurement error
- There is measurement error - but this error is random
- The true score of an individual doesn’t change with repeated applications of the same test, even though their observed score does
- The distribution of random errors and thus observed test scores will be the same for all people
List the 2 assumptions of Classical test theory: the domain sampling model
- If we construct a test on something, we can’t ask all possible questions - So we only use a few test items (sample)
- Using fewer items can lead to the introduction of error
Reliability = variance of observed score on test/ X
X = ?
X = variance of true score
* this is a logic estimation - not a calculation we can actually fo
An individuals true score is unknowable - but we can calculate the range in which it should fall by taking into account the reliability of the measurement tool, otherwise know as the….
….Standard Error of Measurement (SEM)
SEM= SD√1-r (r=reliability of the test)
Formula for creating confidence intervals with the SEM
The z-score for a 95% confidence interval = 1.96
Therefore:
Lower bound = x-1.96(SEM)
Upper bound = x+1.96(SEM)
List the 4 type is reliability (and two sub-types of type 4)
- Test-retest rel
- Parallel forms rel
- Inter-rater rel
- Internal consistency
- split-half
- coefficient/cronbach’s
alpha
Give the name of the correlation between the 2 scores in test/re-test reliability and the source of error in t-rt rel
- the correlation of stability
- source of error = time sampling
Issues with test-retest rel (3 mark)
- Carry over effects (attitude or performance at T2 influenced be performance at T1
- Practice effects
- Time between testing (too little time = remember responses, too much time = maturation)
In Parallel forms reliability, name the correlation between the 2 scores and give the source of error
Name = coefficient of equivalence
Source of error = item sampling
In terms of Parallel forms reliability, list four ways to create a parallel test to give
the participant
- response alternatives can be reworded
- order of questions changed
- change wording of question
- different items altogether
Explain Inter-rater reliability (1 mark). Give the names of the correlation between raters’s scores (2 marks) and give acceptable ranges of correlation scores
- IRR = how consistently multiple rates agree (more raters = more reliability)
- correlation between 2 rates = Cohen’s Kappa, between more than 2 raters = Fleiss’ Kappa
- > .75 = excellent agreement
- .50 - .75 = satisfactory
- > .40 = poor
Describe internal consistency and give the source of error
IC = the extent to which different items within a test measure the same thing
Source of error = internal consistency
Give one advantage and one disadvantage or split-half reliability
ADV = only need 1 test
DISADV = how do we divide the test into equivalent halves (correlation with change each time depending on which items go to each half)
What problem is created by splitting a test in half for split-half reliability?
halving the length of the test also decreases the reliability (domain sampling model says fewer items = lower reliability)
Name the correction used to adjust for the number of items in each half of the test when calculating split-half reliability
Spearman-Brown correction
What does Cronbach’s/ Coefficient Alpha measure
the error associated with each test item as well as error associated with how well the test items fit together
What level of reliability is satisfactory for Cronbach’s alpha?
≥ 0.70 = exploratory research
≥ 0.80 = basic research
≥ 0.90 = applied scenarios
When does Cronbach’s alpha become unhelpful?
When there are too many items - as this artificially inflates your CA scores
List 3 factors that influence Reliabilty
- Number of items in a test
- Variability of the sample
- Extraneous variable (testing situation, ambiguous items, unstandardized procedures, demand effect etc)
List 5 ways to improve reliability
- Increase/decrease the number of items
- Item analysis
- Inter-rater training
- Pilot-testing
- Clear conceptualisation
List three things that can affect your Cronbach’s Alpha score
- Number of test items
- Bad test items (too broad, ambiguous, easy etc)
- Multi-dimensionality
Explain the difference between internal consistency and homogeneity
IC = how inter-related the items are
HG = unidimensionality, the extent to which it is only made up on one thing
Name 3 of the 5 ways that Cronbach’s alpha is often described
- The mean of all split-half reliabilities (not accurate)
- A measure of first-factor saturation
- The lower bound of the reliability of a test
- It is equal to reliability in conditions of essential tau-equivalence
- A more general version of the KR coefficient of equivalence
Describe the difference between Cronbach’s Alpha and Standardized Item Alpha
CA -> Deals with variance (how much scores vary) and covariance(the amount by which items vary together (co-vary))
SIA-> deals with inter-item correlation (the correlation of each item with every other item)(SIA derived from CA)
Give an example to illustrate the difference between variance and co-variance
Think about your group of friends – all of you probably ‘fit together’ pretty well
You co-vary a lot: You have a lot of shared variance and little unshared variance
As a group, you are internally consistent
Now think about the PSY3007S class as a whole
There is a fair amount of shared variance between people in the class, but the class probably has a lot more varied people in it than your group of friends
The PSY3007S class therefore has more variance and less covariance than your group of friends
As a class, you are less internally consistent than your group of friends
Why is Cronbach’s Alpha a better measure of reliability than split-half?
SH reliability relies on inter-item covariance, but doesn’t take variance into account. CA takes variance into account, which accounts for error of measurement. CA will therefore be smaller than split-half rel, and is a better estimate
If a test measures only one factor =
If a test measures more than one factor =
unidimensional
multi-dimensional
True or false: the higher the level of Cronbach’s Alpha, the more likely it is that the test is made up of one factor
FALSE
Multi-dimensional tests can have high CA values too. (EG the WAIS-III measures 2 factors - Verbal IQ and Performance IQ, yet it has very good reliability and CA scores)
People assume the above to be true because the confuse the terms INTERNAL CONSISTENCY and HOMOGENEITY
Why do people assume high CA values indicate unidimensionality?
CA measure how well items fit together (the co-varience of items).
It makes sense that some people assume that the more covariance items have, the more they should fit together to make up one thing (i.e the more they should measure one factor only).
BUT, internal consistency and unidimensionality are not the same thing!
The question behind Validity is….
is the test measuring what it claims to measure?
Why is validity important? (2 marks)
- Gives meaning to a test score
- Indication of the usefulness of a test
If a test is not valid, then reliability is….
If a test is not reliable then it is…
moot
also not valid
Name the four broad types of validity, and two sub-types of two of the broad types
- Face validity
- Content validity
- Criterion validity - Concurrent & Predictive
- Construct validity - convergent & divergent
Briefly describe face validity and how it is determined
On the surface (its face) the measure seems to measure what it claims to measure.
Determined through a review of the items, not through a statistical analysis
Content validity is the….
…degree to which a test measures an intended content area
How is content validity established?
It is established through judgement by expert judges and statistical analysis such as factor analysis
Name and briefly describe 2 potential errors in of content validity
- Construct under-representation: A test does not capture important components of the construct
- Construct-irrelevant variance:
When test scores are influenced by things other than the construct the test is supposed to measure (e.g test score influenced by reading ability or performance anxiety
Which needs to be established first; reliability or validity?
Reliability
Criterion validity is…
how well a test score estimates or predicts a criterion behaviour or outcome, now or in future
Name and briefly describe 2 types of criterion vali
Concurrent criterion validity: The extent to which test scores can correctly identify the current state of individuals
Predictive validity: How well does performance on one test predict future performance on some other measure?
In construct validity we look at…
the relationship between the construct we want to measure and other constructs (to what other constructs is it similar or different
A construct is …
A hypothetical attribute
Something we think exists, but is not directly measurable or observable (e.g., anxiety)
Name and briefly describe the 2 sub-types of construct validity.
- Convergent validity:
High correlations with between tests that measure similar constructs - Divergent/discriminant validity
Scores on a test have low correlations with other tests that measure different constructs
Name and briefly describe 2 factors affecting validity
- Reliability (Any form of measurement error can reduce validity
But you can have reliability without validity, but your test would just then be useless) - Social Diversity (Tests may not be equally valid for different social/cultural groups
E.g., a test of superstition in one culture might be a test of religiosity in another)
How does one establish construct validity properly? Briefly describe this method
Multitrait-Multimethod (MTMM) matrix: A correlation matrix which shows correlations between tests measuring different traits/factors, measured according to different methods
List the 4 rules of MTMM
Rule 1: The values in the validity diagonal should be more than 0, and large enough to encourage further exploration of validity (evidence of convergent validity)
Rule 2: A value in the validity diagonal should be higher than the values lying in its column and row, in the heterotrait-heteromethod triangles (HTHM triangles are divergent validity values, validity diagonal values are convergent validity values - conv val must be > then div val)
Rule 3: A value in the validity diagonal should be higher than the values lying in its column and row, in the heterotrait-monomethod triangles (HTMM triangles also = divergent val values)
Rule 4: There should be more or less the same pattern of correlations in all the different triangles
To establish validity of a new scale, one could…..
correlate it with an already established scale via a MTMM matrix
In the MTMM, the reliability diagonals are:
A) the intercepts of different traits within the same method
B) the intercepts of the same traits across different measures
C) the intercepts of different traits across different methods
D) the intercepts of the same traits within the same method
D
Reliability diagonals are also called…
monotrait-monomethod values
In the MTMM, the reliability diagonals are:
A) the intercepts of different traits measured by the same method
B) the intercepts of the same traits measured by different measures
C) the intercepts of different traits measured by different methods
D) The intercepts of the same traits measured by the same measure
D
Validity diagonals are also called..
monotrait-heteromethod values
In the MTMM matrix, the hetermethod block is made up by the….
Validity diagonal and the triangles (HTHM values)
In the MTMM matrix, the monomethod blocks are made up by the….
Reliability diagonals and the triangles (HTMM values)
MTMM matrix rule 4 interpretation
This allows us to see if the pattern of convergent and divergent validity is about the same
- run through lecture 8/9 slides and see analysis of MTMM matrix
do it
Name 3 approaches to intelligence testing and what they are concerned with respectively
- The psychometric approach (structure of a test, its correlates and underlying dimensions)
- The information processing approach (how we learn and solve problems)
- The cognitive approach (how we adapt to real-world demands)
What are the four common definitions of intelligence
Ability to adapt to new situations
Ability to learn new things
Ability to solve problems
Ability for abstraction
Today, intellectual ability is conceptualized as ….
multiple intelligences
Name two NB intelligence concepts from Binet
- Age differentiation - older kids greater ability than younger kids, and mental and actual age can be differentiated
- General mental ability - which is the total product of different and distinct elements of intelligence
Name two of Weschler’s contributions to the field of intelligence testing
Intelligence has certain specific functions
Intelligence is related to separate abilities
Name the 4 critiques of Binet’s work by Weschler
- Binet scale was not appropriate for adults
- Non-intellective factors were not emphasized (e.g social skills and motivation)
- Binet did not take into account the decline of performance that should be expected with aging
- Mental age norms do not apply to adults
Briefly explain the difference between fluid intelligence and crystallized intelligence
Fluid intelligence (gf)
Abilities that allow us to think, reason, problem-solve, and acquire new knowledge
Crystallized intelligence (gc)
The knowledge and understanding already acquired
What is the purpose of intelligence testing in children ( 1 mark) and adults (4 marks)?
Children - school placement
Adults - Neuropsychological assessment
Forensic assessment
Disability grants
Work placement
The ….. measure is the gold standard of intelligence testing and provides a measure of ….
Weschler intelligence tests
Full scale IQ (FSIQ)
List the 2 subtests FSIQ and the 2 sections of each of these subtests. Additionally, give a test used to assess each of these 4 categories
Verbal IQ (VIQ):
- Verbal comprehension index (VCI); test = vocabulary
- Working memory index (WMI); test = arithmetic
Performance IQ (PIQ):
- Perceptual organization index (POI); test= picture completion
- Processing speed index (PSI); test = digit-symbol coding
- breakdown of this on lecture 10 slide 10/11
What is one of the most stable measures of intelligence, and the last to be affected by brain deterioration
Vocabluary
Which intelligence test assesses concentration, motivation and memory, and is the most sensitive to educationally deprived/intellectually disabled individuals?
Arithmetic
Which intelligence test measures ability to comprehend instructions, follow directions and provide a response
Information
Which intelligence test measures judgement in everyday situations? What list 3 types of questions used in this.
Comprehension
- Situational action
- Logical explanation
- Proverb definition
For FSIQ scoring, do raw scores carry meaning?
No. Different subtests have different ranges, and same raw scores for people of different ages not comparable. Raw scores are converted to scale scores with set means and SDs
Briefly descirbe picture completion intelligence tests
A picture in which an important detail is missing
Missing details become smaller and harder to spot
Which intelligence test tests New learning, Visuo-motor dexterity, Degree of persistence and speed of processing information
Digit-symbol coding
In which intelligence test must participants find some sort of relationship between figures?
Matrix reasoning
This measures:
Fluid intelligence
Information processing
Abstract reasoning
Describe the implications possible relationships between PIQ and VIQ
VIQ = PIQ
If both are low, can provide evidence for intellectual disability
PIQ > VIQ
Cultural, language, and/or educational factors
Possible language deficits (e.g., dyslexia)
VIQ > PIQ
Common for Caucasians and African-Americans
List the 4 personality types according to Hippocrates
- Sanguine
- Phlegmatic
- Choleric
- Melancholic
List the 3 tenets of personality theory today
- Stable characteristics (personality traits, basic behavioural/emotional tendancies)
- Personal projects and concerns - what a person is doing and wants to achieve
- Life story/narrative - construction of integrated identity
What are traits? (3 marks)
- Basic tendencies/predispositions to act in a certain way
- Consistencies in behaviour
- Influence behaviour across a variety of situations
Traits are measured via….
structured personality measures
What are the BIG 5 in terms of traits
Openness
Conscientiousness
Extraversion
Agreeableness
Neuroticism
List 4 structured personality tests
- The Big 5 Test
- The 16 personality factor test
- Myers-Briggs type indicator
- Minnesota multiphasic personality inventory (MMPI)
The Big 5 traits provide a framework for understanding ….
…personality disorders
What is the most widely used objective personality test?
The Minnesota multiphasic personality inventory (MMPI)
List 3 of the 5 ways in which the MMPI is used
- Help develop treatment plans
- Help with diagnosis
- Help answer legal questions
- Screen job candidates
- Part of therapeutic assessment
….. Personality tests assess tenet two of modern personality theory (personal projects and concerns). These tests measure….
Unstructured
…motives that underlie behaviour
3 examples of unstructured personality tests are….
- Thematic Apperception Test (TAT)
- The Rorschach test
- Draw a person test
Describe the process of the Thematic Apperception Test (TAT).
Then list and briefly describe the three major motives put forward by the TAT.
One must make up a dramatic story about ambiguous black and white pictures, describing the feelings and thoughts of characters
- Achievement motive
Need to do better - Power motive
Need to make an impact on people - Intimacy motive
The need to feel close to people
List 2 precautions when using personality test cross-culturally
- Constructs must have the same meaning across cultures
- Bias analysis must be done
List 2 solutions to culturally biased tests
- Caution in interpretation
- Cross-cultural adaption of test
In the MTMM, the validity diagonals are:
A) the intercepts of different traits measured by the same method
B) the intercepts of the same traits measured by different measures
C) the intercepts of different traits measured by different methods
D) The intercepts of the same traits measured by the same measure
B