W2 - what makes a good test and norms Flashcards
Rational-empirical approach to test development
Rational: knowledge of the construct and psychological theory drives the process
Empirical: collecting data to evaluate individual items and overall test
- Some tests are based purely on empirical grounds, like the MMPI (i.e. that the items have predictive validity)
Assumptions about psychological testing
- Psychological Traits/States actually exist
- Psychological Traits can be measured
- Psychological Traits predict future behaviour
- Tests have strengths and weaknesses
- Various sources of error are part of assessment
- Testing/Assessment can be conducted in a fair and unbiased manner
- Testing and assessment benefits society
What are traits and states
Traits – defined as “any distinguishable, relatively enduring way in which one individual varies from another“
e.g. self-esteem, extraversion, optimism/pessimism
States – also distinguish one person from another, but are more temporary.
e.g. mood/affect (sad, but not all the time)
Assumptions - Psychological Traits/States actually exist
- This assumes that people HAVE recognizable traits (characteristics)
- People differ on these and are not homogenous (individual differences)
- These are (relatively) stable over time
- They may change over time, but there will be a high correlation between trait scores at different timepoints.
Assumptions - Psychological Traits can be Measured
- Psychological traits exist as constructs - an informed, scientific concept developed or constructed to describe or explain behaviour.
- We can’t see, hear, or touch constructs, but we can infer their existence from behaviour (incl test scores).
How to measure psychological traits
- Test developers start with a definition of the construct, then construct items that would provide insight into that trait.
- Content (breadth of coverage) is important for tests
- A consistent scoring system and a way to interpret these results needs to be devised (e.g. Likert scale, or for ability test 0 for incorrect answers, 1 for correct)
- This is harder for projective/open-ended responses
Assumptions - Psychological Traits Predict Future Behaviour
- Traits (if measured well) are thought to predict real-world behaviour.
- For example, an aptitude test should be able to predict the future work performance of potential job applicants.
- The rationale is that if we take a sample of behaviour (personality trait, ability), then it provides insight into that person.
- > e.g. does sensation-seeking as a trait predict intentions to undertake risky behaviours?
Assumptions - Tests have Strengths and Weaknesses
- No matter how well constructed, all tests have strengths and weaknesses.
- Competent test users appreciate the limitations of the tests they use, and should use other tools in making evaluations as well (e.g. case history, structured clinical interview, etc.) to compensate.
- > e.g. is the test appropriate for this particular use/population? Can I really predict future likelihood of criminality in future from using the PCL-R in children?
Assumptions - Various sources of Error are part of Assessment
- Error refers to a the assumption that factors other than what a test attempts to measure will always influence performance on the test.
- > e.g. test anxiety, mood on the day, perhaps even weather?
- Error variance - the component of a test score attributable to sources other than the trait or ability measured.
- > Both the assessee and assessor are potential sources of error variance. Error variance is to be expected, and considered in psychometrics. Described in classical test theory (CTT)
Assumptions - Testing/Assessment can be Con-ducted in a Fair and Unbiased Manner
- All major test publishers strive to develop instruments that are fair when used in strict accordance with guidelines in the test manual (test protocol).
- Tests give a standardized set of instructions, for consistency across testing situations. If timed test, need to be accurate measurement
- Problems arise if administered to a different population than it was intended (e.g. intelligence test relying on culture-specific knowledge, or an adult test for a child), or if it systematically discriminates against different groups (e.g. females, immigrants
Assumptions - Testing and Assessment Benefits Society
When used correctly by a skilled assessor, good tests can take out the subjectivity out of evaluations
- e.g. selecting the right applicant for the job, regardless of gender, race, religion
- Alternative would be nepotism (who you know, who you are related to)
What makes for a ‘good’ test
- reliability
- validity
- other considerations
Reliability
The consistency of the measuring tool: that is the precision with which the test measures (across time, across groups of people), and the extent to which error (e) is present
Validity
A test is a valid measure if it actually measures what it sets out to measure (and doesn’t measure something unwanted!)
For example, a test on values might also capture socially desirable responses
Other considerations for a ‘good’ test
Administration, scoring, interpretation should be straightforward (hence repeatable) for trained examiners.
A good test is one that ultimately benefits testtakers, researchers, educators, and society at large – all of the above properties important
What makes a ‘good’ score
Consider how scores on the test will be interpreted:
- Criterion-referenced tests
- Norm-referenced testing and assessment
- percentages and cutoffs
Criterion referenced tests
- assess whether particular criteria is met:
- > Scoring “high”, or “coming first”, is not important.
- > Only important that the criteria is met (non-graded pass/ fail)
Norm-referenced tests
- a method of evaluating performance and deriving meaning from test scores by evaluating an individual testtaker’s score and comparing it to scores of a group of test takers
- > The meaning of an individual’s test score is understood relative to others’ scores on the same test (NAPLAN, IQ, etc)
Percentages and cutoffs
- Another way to evaluate performance (at least for a test of ability) is to look at the percentage
e. g. Toni scored 34 out of 40 on the exam = 85% - We could then establish cutoffs, like 50 % = pass. These are arbitrary decisions and academic conventions.
- Some ability tests have a higher cutoff, such as 85% for a medical exam.
Norms testing
- Different tests have different scoring systems. The total score on a test is rather arbitrary (determined by number of items, and weighting). So how do we interpret it?
- Well one way is to determine what is a “typical” / normal score.
- We would call this the average or mean.
- We could then look at the variability (how far scores typically differ from the mean). We would call this the standard deviation.
- For each participant, we could then calculate standard score (z-score)
What are ‘norms’ in scoring and testing
- Norms are the test performance data of a particular group of test-takers that are designed for use as a reference when evaluating or interpreting individual test scores
- > Hence the term norm-referenced testing.
- A normative sample, is just the reference group we use to compare an individual’s score against.
Keep in mind:
- > Who is the group we are comparing scores against?
- > Therefore, you should always ask-”compared to whom?”
- > Who is the normative sample to which this test-taker is being compared? Is this a useful/fair comparison?
Sampling to develop norms - standardisation
Standardization: The process of administering a test to a representative sample of testtakers for the purpose of establishing norms.
- Keyword here – representative sample.
- Generally impractical to administer to an entire population, though some exceptions do exist (e.g. NAPLAN of all school students in target grades).
- Test developers recruit a sample, so that individual scores can be compared against this group.
Sampling to develop norms - sampling
Sampling: Test developers select a population group for which the test is intended. The group can be defined more broadly (“adults”) or narrowly “criminal offenders”.
- > For example, a clinical population would be suitable for sampling for a measure of depression.
- Remember though that we want our sample to be representative of the population.
- > Gold standard is stratified-random sampling, where every member of the population has an equal opportunity of being included. Rarely done due to cost
Stratified sampling
involves recruiting different subgroups (e.g. socioeconomic status, ethnicity, age, gender, etc.) in order to recruit a representative sample. This minimizes selection bias.