Biais, Échelles de validité Méthodes de construction d’instrument de mesure Types d’items et échelles de réponses Traduction / adaptation transculturelle Flashcards
Biais des tests
Une chose très importante est de ne pas confondre différence de moyenne entre des groups et biais
-The public sometimes has the impression that all assessment instruments are biased (e.g., by age, by sex/gender, by ethnic group, by clinical group, etc.).
-This is sometimes the case and it is the duty of the test user to be aware of it. devoir de l’utilisateur.trice
Reminder:
Bias = systematic error, is not random
Biais = de l’erreur systématique, n’est pas aléatoire
One very important thing is not to confuse difference in means between groups with bias
Biais des tests
Differences in means between certain groups are not a priori a bias since some are theoretically/conceptually expected
e.g., In adolescence, few or no differences in means between ethnic groups for behavior problems, but differences by sex/gender
e.g., In adulthood, presence of sex differences in some personality traits, but few or none in adolescence
normative: compared to mass population
biase de test - Un instrument d’évaluation est biaisé
Un instrument d’évaluation est biaisé «si les differences entre les membres de différents groupes sont identifiées sur la base de caractéristiques autres que celles que l’instrument prétend évaluer» (Merrell, 2008; Whitcomb, 2017)
Autrement dit, il y a présence de biais pour un instrument si le contenu, la procédure ou l’utilisation favorise ou défavorise systématiquement les membres d’un groupe plutôt qu’un autre et si cette différenciation est non pertinente à l’objectif de l’instrument
An assessment instrument is biased “if differences between members of different groups are identified on the basis of characteristics other than those the instrument purports to assess” (Merrell, 2008; Whitcomb, 2017)
In other words, bias is present for an instrument if the content, procedure, or use systematically favors or disfavors members of one group over another and if this differentiation is irrelevant to the purpose of the instrument
how is reliability affected directly
As we have seen, the fidelity of scores on an assessment instrument can be compromised by various sources of measurement error
-We have also seen that the inferences and interpretations permitted with scores provided by an assessment instrument are dependent on the degree of validity of those scores.
-Validity can be affected directly by (a) response bias on individual items or
(b) scale score biasThe presence of bias is a critical issue for both test developers and test users
(a) biais de réponse aux items individuels ou par des
(b) biais des scores à une echelle
Biais de réponse: heuristiques ou biais cognitifs
People who are being assessed and asked questions, whether about themselves or as an informant for a third party, are always at risk of being partially biasedFor example, in a job interview where a person has to answer a personality questionnaire, would they want to look their best? Or even better than their best?Even at a basic level, it is now recognized that the human cognitive system is “victimized” by several heuristics or cognitive biases (Kahneman, 2011; Kahneman, Slovic & Tversky, 1982)
Heuristiques associés aux styles de réponses
Heuristiques: Stratégies cognitives utilisées pour simplifier et accélérer une décision en situation d’incertitude (Kahneman, 2011).
Heuristics: Cognitive strategies used to simplify and speed up a decision under uncertainty (Kahneman, 2011)Sometimes referred to as “mental shortcuts.”
Apply to behavioral evaluation/estimationVery useful when one does not know a person to be evaluated well enough.
Can also lead to misjudgment and “stereotyping” of people.
Quatre exemples connus d’heuristiques
- Heuristique de la représentativité
Representativeness heuristicEvaluating a specific characteristic in terms of how well it matches a prototype (e.g., evaluating a child’s attention based on our ADHD prototype) - Heuristique de la disponibilité
Availability HeuristicRating that is influenced by the things that come most easily (or frequently) to mind for the rater (e.g., children’s aggressive behaviors)Those things that come to mind more easily are considered more frequent and more representative of reality - Heuristiques de primauté / de récence
Primacy / recency heuristics
Evaluation that is influenced by the individual’s first vs. last impression - Heuristique de l’affect
Affect heuristics
Assessment colored by current emotional and affective state (e.g., bad mood leads to estimation of more behavior problems)
influencent directement la validité
Response biases may seem trivial, but they can be very serious as they directly influence the validity of test scores
Diminished” validity can in turn compromise the quality of inferences and clinical decisions that are made about an individual (or group) being assessed
Huit grands types de biais de réponse (see pictures )
1.Extrémité: responds are extreme
2. Indécision: neutral response
3. Acquiescement: say yes to everything
4.Objection: always say no
5. Désirabilité sociale: socially acceptable exaggerate the positive
6. Gestion défavorable des impressions (malingering): answer exageratevily negative
7. Réponse aléotoire ou negligent:random
8.Deviner (guessing):
9. halo
Que faire pour prévenir ou minimiserles biais de réponse ?
Three things to do:
1. Manage the assessment situation
Anonymity, minimize frustration, give warnings (i.e., warn that there are validity scales)
2. Manage the content of the tests
Simple items (language level), content-neutral items (i.e., non-suggestive), conceptually clear response options
3. Specialized validity tests or scales
Quelques exemples d’échelles de validité
Toutes ces échelles sont basées sur le même principe : des scores très élevés ou extrêmes suggèrent un problème potential
All these scales are based on the same principle: very high or extreme scores suggest a potential problem
Indeterminacy scale (e.g., the MMPI-2; Ben-Porath & Tellegen, 2008)
The full MMPI-2 questionnaire has over 567 items
Unanswered items, or items with multiple responses on the same item, are summed
Échelles de validité- échelles de désirabilité sociale
- Échelles de désirabilité sociale
Échelle de désirabilité sociale de Marlowe-Crowne
Marlowe-Crowne Social Desirability Scale (Crowne & Marlow, 1960)e.g., “I never lie”; “I like everyone I know”; “I have never been angry”. - Inventaire balance de style de réponse socialement desirable
Self-Deception: generally honest, but overly positive responsesImpression management: dishonest responses, positive bias is used to (a) please others or (b) gain advantage
Échelles de validité - Échelle de gestion dévavorable des impressions
Échelle de gestion dévavorable des impressions
Unfavorable impression management scale (e.g., the MMPI-2; Ben-Porath & Tellegen, 2008)
Tendency to respond positively to unlikely negative items (e.g., “I’m no good at anything”; “I have no talent”)
Difficult to distinguish effect with severe clinical cases (e.g., major depression or depressive personality disorder, etc.)
Échelles de validité -
- Échelle de style de réponse extreme
-Échelle d’indécision
- Extreme Response Style Scale
Criteria proposed by the EDC (Parent et al., 2006)
i.e., choosing the 1st or 7th choice of items an abnormally high number of times - Indecision scale
Criteria proposed by the EDC (Parent et al., 2006)
i.e., choosing an abnormally high number of times the central category, i.e., the 4th choice (the one in the middle) of the items
Échelles de validité -
Incohérence variable des réponses (VRIN)
- Variable response inconsistency (VRIN)Sum of the number of item pairs that were answered inconsistentlySimilar: “I don’t think before I act” - “I act without thinking about the consequences”Different: “I don’t think before I act” - “I think carefully before I make decisionsWe give 1 pt for each inconsistent pair and calculate a sumUsed to detect random responses réponses aléatoires (intentional or not) or confusion in a questionnaire
Échelles de validité -
Incohérence vraie des réponses (TRIN)
True response inconsistency (TRIN)
In this one, only pairs of items that are conceptually different are used
Calculates a sum of the inconsistently true response item pairs minus the sum of the inconsistently false response item pairs
Used to detect inconsistent responses that indicate acquiescence l’acquiescement (very high score) or objection (very low score, possibly negative)
Biais des items et tests
une fois que le niveau du trait est contrôlé
Aussi appelé «fonctionnement différentiel des items»
Item (or indicator) biasNot differences in scores on the trait, but systematic differences in the probability of responding in a given way for each item individually, once the trait level is controlled forAlso called “differential item functioning.”Compares the probability of endorsing items on a scale of individuals in different groups who have the same score/level on the traitSame principle as control variables in predictive studies (e.g., when “controlling for SES”)
Biais des items et tests - Biais structurel
Biais structurel
Pour un instrument unidimensionnel, il peut s’agir de différences significatives des saturations factorielles entre deux groups
Pas banal puisque ceci signifie que le trait n’est pas mesuré de la même façon dans différents groupes
Pour un instrument multidimensionnel, (a) différences des saturations et (b) la structure factorielle n’est pas la même dans différents groupes
e.g., analyse factorielle révèle 3 facteurs pour les hommes, mais seulement deux pour les femmes
Biais des items et tests -Biais critériel (ou critérié)
Criterion (or criterion-referenced) biasApplies to both concurrent criterion validity (independent criteria and contrasting groups) and predictive validitye.g., A temperamental trait that predicts later adjustment for one group of children, but not for anothere.g., an IQ test predicts success for one cultural group, but not for anotherCaution: the observation of differences between groups for predictive relationships can be expected because this is theoretically justified… it is not a bias then
Biais des items et tests -Fidelity bias
Fidelity bias
Fidelity estimates are significantly different in different groups
Can be potentially important for interpretation
if bias is present, the level of confidence one can have in the scale scores varies across groups
observed group differences in means can then be partly explained by error
Biais des items et tests
Although testing by comparing groups by sex/gender, ethnicity, cultural background, clinical group, etc., can be informative for many researchers, it often results in “over-generalization”«sur-généralisation»
Variation between individuals in the same group (intragroup variance) can be enormous (see figure distributions)As a psychoeducator, one must never lose sight of the fact that the purpose of a psychoeducational assessment is to interpret the scores and make recommendations for ONE particular individualpour UN individu particulier
Méthodes de construction des tests
There are a wide variety of tests useful in psychoeducation (Hogan et al., 2017)
Tests of intellectual ability/cognitive skills
Achievement tests
Neuropsychological tests
Measures of personality/temperament
Measures of interests, attitudes, and values
Measures of psychopathology
One major category is often overlooked in psychology books
Measures of environmental constructs
CONSTRUCTION OF TESTS
In general, professional organizations expect authors to have constructed their instrument in accordance with the criteria listed in the Standards for Testing in Education and Psychology (AERA, APA, & NCME, 2014)
Test construction and validation is a long-term process
Requires revisions before it is fully satisfactory
Can take place over several years, even a few decades
Deux grandes méthodes de construction des tests
Deductive (or rational)
“conclude from propositions taken as premises”.
Inductive (or empirical)
“conclude by going from the facts to the law
Deux grandes méthodes de construction des tests–Méthode deductive (rationnelle)
From a theoretical framework
Scientific theory
Constructs, domains, indicators (the test designers determine them according to the theory)
Clinical theory
We want to answer a practical problem or need
e.g., how to measure PES intervention? How to measure motivation to change?
Advantage: Clear theoretical context, logical consistency (i.e., nomological network often known a priori)
Deux grandes méthodes de construction des tests–Méthode inductive (empirique)
Based on an empirical (or factual, or pragmatic) 1.approachItem analysis / Factor analysis: items statistically related to the construct are selected (may also include internal consistency, criterion validity, etc.)
2. Criterion-referenced selection: e.g., only items that differentiate groups are selected (e.g., MMPI-2 Antisociality scale)
However, the approach is never completely empiricalto generate items, there is always an underlying theory, even if it is implicit
Méthode inductive (empirique) ADVANTAGES AND DISADVANTAGES
Advantages: Greater objectivity and more representative of reality; we verify our understanding of a construct, explicitly supported by data
Disadvantages:You don’t necessarily get what you want, the data dictate the final outcome (e.g., factor structure, etc.)e.g., data suggest that anxiety and depression items are combined into one factorStatistics can sometimes distort concepts due to sampling biase.g., statistics suggest eliminating a clinically important aspect, while discordant results are mostly the result of poor sampling, or too small a sample
Types de questions et choix de réponse
There are a host of different types of items and even more different response choices, making it challenging to present and categorize them (Urbina, 2014)
Different items and possible response choices, depending on:
Type of construct being assessed
Specific uses of an instrument
Personal preferences of the authors
Questions can also be presented in several ways
verbally in an interview
visually in a paper and pencil version
visually in a computerized version (on a fixed computer, or with an application on a smart phone, a tablet)
etc.
The most basic distinction is the type of response that is asked of the person being assessed: type de réponse qui est demandé à la personne évaluée
(a) constructed response items and (b) selected response items (Urbina, 2014)
Choix ou échelles de réponse - Items à réponses construites
Constructed response items
Also called “essay” or “open-ended” or “free-response” questions
A premise is presented to the test taker, but there is no constraint on a fixed answer choice
There are, however, some rules, so there are (a) long-answer and (b) short-answer open-ended questions
Choix ou échelles de réponse–Items à réponses construites
Constructed response items
Also called “essay” or “open-ended” or “free-response” questions
A premise is presented to the test taker, but there is no constraint on a fixed answer choice
There are, however, some rules, so there are (a) long-answer and (b) short-answer open-ended questions
Il y a néanmoins certaines règles, ce qui fait qu’il existe des (a) questions ouvertes à réponse longue et (b) des questions ouvertes à réponse succincte
Choix ou échelles de réponse –Items à réponses construites (suite)
Constructed response items (continued)
An example of a long answer would be:
“Describe your usual relationship with your child?”
An example of a short answer would be:
“Using no more than 4 or 5 words, complete the following sentence, “My usual relationship with my child is”: __________”
Constructed response questions are essential in interviews
questions à réponses construites sont essentielles en entrevue
Choix ou échelles de réponse - Items à réponses sélectionnées
Selected response itemsAlso called “objective,” “forced-choice,” “multiple-choice,” “true or false” questionsA premise is presented to the respondent and he or she is placed under the cognitive constraint of a fixed response choiceThis is the most common type of item used in humanities, social sciences and psychological assessment instrumentsMore objective, easier to derive a numerical score, more reliable, shorter, etc.
Choix ou échelles de réponse
When a person is asked to answer a selected-response question, he or she must perform four cognitive tasks (Tourangeau, Rips, & Rasinski, 2000)Comprehension: understanding the relevant contentRetrieval: retrieve the relevant information from memory that is needed to answerJudgment: making a judgment based on the retrieved informationResponding: reporting this judgment, based on the available response options
Traduction / adaptation transculture
Principe d’un instrument standardisé suggère qu’il s’agit d’une étape qui doit être prise au sérieux
Crucial issue in QuebecVery often “home-made translations” are used, (1) without any study verifying their psychometric properties and/or (2) without collecting Quebec standardsSometimes a simple translation is insufficient, but often an adaptation is necessaryImportant: The understanding of the content, the meaning/significance of the items is more important than the exact translationPrinciple of a standardized instrument suggests that this is a step that must be taken seriously
Six étapes de l’adaptation transculturelle
- Translate and adapt items (minimum 2 people)
- Method of choice: Back translation Traduction à rebours
- Independent experts review the translation
- Eliminate or adapt items according to their comments
- Pilot study with targeted individuals
Empirical validation (detailed evaluation of psychometric properties) - Standardization (establishing norms)
Cinq façons d’établir l’équivalence transculturel
- Semantic equivalence
Do items veulent dire mean the same thing in both languages/cultures? - Content equivalence
Is each item relevant (pertinent) in both languages/cultures to measure the construct? - Construct equivalence
Are the factor loadings similar? Is the factor structure the same? Is the C/D validity similar? - Équivalence critériée
Criterion-based equivalence
Sometimes quite difficult to conclude that there is no criterion-based equivalence
e.g., with a measure of parenting practices, a practice related to adjustment problems only in one version (or culture) is not necessarily a problem of the instrument
5.Fidelity equivalence