W2 - what makes a good test and norms Flashcards by Melanie Powell

Rational-empirical approach to test development

Rational: knowledge of the construct and psychological theory drives the process

Empirical: collecting data to evaluate individual items and overall test
- Some tests are based purely on empirical grounds, like the MMPI (i.e. that the items have predictive validity)

How well did you know this?

Not at all

Perfectly

Assumptions about psychological testing

Psychological Traits/States actually exist
Psychological Traits can be measured
Psychological Traits predict future behaviour
Tests have strengths and weaknesses
Various sources of error are part of assessment
Testing/Assessment can be conducted in a fair and unbiased manner
Testing and assessment benefits society

How well did you know this?

Not at all

Perfectly

What are traits and states

Traits – defined as “any distinguishable, relatively enduring way in which one individual varies from another“
e.g. self-esteem, extraversion, optimism/pessimism
States – also distinguish one person from another, but are more temporary.
e.g. mood/affect (sad, but not all the time)

How well did you know this?

Not at all

Perfectly

Assumptions - Psychological Traits/States actually exist

This assumes that people HAVE recognizable traits (characteristics)
People differ on these and are not homogenous (individual differences)
These are (relatively) stable over time
They may change over time, but there will be a high correlation between trait scores at different timepoints.

How well did you know this?

Not at all

Perfectly

Assumptions - Psychological Traits can be Measured

Psychological traits exist as constructs - an informed, scientific concept developed or constructed to describe or explain behaviour.
We can’t see, hear, or touch constructs, but we can infer their existence from behaviour (incl test scores).

How well did you know this?

Not at all

Perfectly

How to measure psychological traits

Test developers start with a definition of the construct, then construct items that would provide insight into that trait.
Content (breadth of coverage) is important for tests
A consistent scoring system and a way to interpret these results needs to be devised (e.g. Likert scale, or for ability test 0 for incorrect answers, 1 for correct)
This is harder for projective/open-ended responses

How well did you know this?

Not at all

Perfectly

Assumptions - Psychological Traits Predict Future Behaviour

Traits (if measured well) are thought to predict real-world behaviour.
For example, an aptitude test should be able to predict the future work performance of potential job applicants.
The rationale is that if we take a sample of behaviour (personality trait, ability), then it provides insight into that person.
> e.g. does sensation-seeking as a trait predict intentions to undertake risky behaviours?

How well did you know this?

Not at all

Perfectly

Assumptions - Tests have Strengths and Weaknesses

No matter how well constructed, all tests have strengths and weaknesses.
Competent test users appreciate the limitations of the tests they use, and should use other tools in making evaluations as well (e.g. case history, structured clinical interview, etc.) to compensate.
> e.g. is the test appropriate for this particular use/population? Can I really predict future likelihood of criminality in future from using the PCL-R in children?

How well did you know this?

Not at all

Perfectly

Assumptions - Various sources of Error are part of Assessment

Error refers to a the assumption that factors other than what a test attempts to measure will always influence performance on the test.
> e.g. test anxiety, mood on the day, perhaps even weather?
Error variance - the component of a test score attributable to sources other than the trait or ability measured.
> Both the assessee and assessor are potential sources of error variance. Error variance is to be expected, and considered in psychometrics. Described in classical test theory (CTT)

How well did you know this?

Not at all

Perfectly

Assumptions - Testing/Assessment can be Con-ducted in a Fair and Unbiased Manner

All major test publishers strive to develop instruments that are fair when used in strict accordance with guidelines in the test manual (test protocol).
Tests give a standardized set of instructions, for consistency across testing situations. If timed test, need to be accurate measurement
Problems arise if administered to a different population than it was intended (e.g. intelligence test relying on culture-specific knowledge, or an adult test for a child), or if it systematically discriminates against different groups (e.g. females, immigrants

How well did you know this?

Not at all

Perfectly

Assumptions - Testing and Assessment Benefits Society

When used correctly by a skilled assessor, good tests can take out the subjectivity out of evaluations

e.g. selecting the right applicant for the job, regardless of gender, race, religion
Alternative would be nepotism (who you know, who you are related to)

How well did you know this?

Not at all

Perfectly

What makes for a ‘good’ test

reliability
validity
other considerations

How well did you know this?

Not at all

Perfectly

Reliability

The consistency of the measuring tool: that is the precision with which the test measures (across time, across groups of people), and the extent to which error (e) is present

How well did you know this?

Not at all

Perfectly

Validity

A test is a valid measure if it actually measures what it sets out to measure (and doesn’t measure something unwanted!)
For example, a test on values might also capture socially desirable responses

How well did you know this?

Not at all

Perfectly

Other considerations for a ‘good’ test

Administration, scoring, interpretation should be straightforward (hence repeatable) for trained examiners.
A good test is one that ultimately benefits testtakers, researchers, educators, and society at large – all of the above properties important

How well did you know this?

Not at all

Perfectly

What makes a ‘good’ score

Study These Flashcards

Consider how scores on the test will be interpreted:

Criterion-referenced tests
Norm-referenced testing and assessment
percentages and cutoffs

Criterion referenced tests

Study These Flashcards

assess whether particular criteria is met:
> Scoring “high”, or “coming first”, is not important.
> Only important that the criteria is met (non-graded pass/ fail)

Norm-referenced tests

Study These Flashcards

a method of evaluating performance and deriving meaning from test scores by evaluating an individual testtaker’s score and comparing it to scores of a group of test takers
> The meaning of an individual’s test score is understood relative to others’ scores on the same test (NAPLAN, IQ, etc)

Percentages and cutoffs

Study These Flashcards

Another way to evaluate performance (at least for a test of ability) is to look at the percentage
e. g. Toni scored 34 out of 40 on the exam = 85%
We could then establish cutoffs, like 50 % = pass. These are arbitrary decisions and academic conventions.
Some ability tests have a higher cutoff, such as 85% for a medical exam.

Norms testing

Study These Flashcards

Different tests have different scoring systems. The total score on a test is rather arbitrary (determined by number of items, and weighting). So how do we interpret it?
Well one way is to determine what is a “typical” / normal score.
We would call this the average or mean.
We could then look at the variability (how far scores typically differ from the mean). We would call this the standard deviation.
For each participant, we could then calculate standard score (z-score)

What are ‘norms’ in scoring and testing

Study These Flashcards

Norms are the test performance data of a particular group of test-takers that are designed for use as a reference when evaluating or interpreting individual test scores
> Hence the term norm-referenced testing.
A normative sample, is just the reference group we use to compare an individual’s score against.

Keep in mind:

> Who is the group we are comparing scores against?
> Therefore, you should always ask-”compared to whom?”
> Who is the normative sample to which this test-taker is being compared? Is this a useful/fair comparison?

Sampling to develop norms - standardisation

Study These Flashcards

Standardization: The process of administering a test to a representative sample of testtakers for the purpose of establishing norms.

Keyword here – representative sample.
Generally impractical to administer to an entire population, though some exceptions do exist (e.g. NAPLAN of all school students in target grades).
Test developers recruit a sample, so that individual scores can be compared against this group.

Sampling to develop norms - sampling

Study These Flashcards

Sampling: Test developers select a population group for which the test is intended. The group can be defined more broadly (“adults”) or narrowly “criminal offenders”.

> For example, a clinical population would be suitable for sampling for a measure of depression.
Remember though that we want our sample to be representative of the population.
> Gold standard is stratified-random sampling, where every member of the population has an equal opportunity of being included. Rarely done due to cost

Stratified sampling

Study These Flashcards

involves recruiting different subgroups (e.g. socioeconomic status, ethnicity, age, gender, etc.) in order to recruit a representative sample. This minimizes selection bias.

Sampling to develop norms - purposive sample and incidental sample

purposive sample : specifically selecting a sample that is believed to be representative of a population (e.g. clinical sample from psychological clinics), -> but may be hard/expensive to access incidental/ convenience sample : a sample that is convenient to access, and available to use (i.e. subject pools of students) However there may be a selection bias present, and so generalisations must be made with caution

Sampling sizes for developing norms

- extremely small samples : even a few individuals might skew the interpretation of the data - > Central limits theorem suggests that at least n > 30 desirable, as approaches normal distribution - extremely large sample : measurement error decreases as sample size increases, up to a point. Beyond which, further increases have minimal effect - > e.g. WAIS-IV standardization sample was 2,200, so that could cover 13 different age bands, different subgroups.

Types of norms - percentile

Percentile – the percentage of people who achieve that raw score on the test - > Low numbers (eg 10th percentile) mean you score at the BOTTOM of a distribution. - > High numbers (eg 95th percentile) mean you score at the TOP of a distribution. Advantage is they are (relatively) easy to calculate. Developers provide the raw scores and associated percentile. But most scores will fall near the middle (eg. 34th – 68th percentile), and so differences in raw scores will have more of an effect at the low and high tails of the distribution.

Types of norms - age/grade, national norms and national anchor norms

- Age/grade norms : indicates average performance of samples at particular ages or grades. Often used when developmental effects are expected (e.g. reading level, vocabulary, performance on maths test) - National norms : norms derived from a sample that was nationally representative of the population at the time of testing - > Population demographics might change over time, so may need to be repeated - National anchor norms: equivalency table to allow for conversion between two tests (assuming both nationally representative samples).

Types of norms - subgroup norms

Subgroup norms: separate norms may be developed for specific demographic subgroups (e.g. socioeconomic status, geographic region, etc.) -> For example, a test of gender-role identification (trad. masculine and feminine personality traits) might provide separate norms tables for those who id as male/female

Types of norms - local norms

Local norms : provides normative information for a local population if locals are believed to differ in some way from the national norms -> Queensland might adopt different norms on a standardized test due to differences in curriculum, or for regional/urban school districts

Limitations of norms

- Assumption of norms is that they remain valid over time (i.e. the population being assessed remains stable). For some types of tests, this assumption will hold. - Normative samples for intelligence tests quickly become dated. Why? - > Intelligence (IQ) is actually RISING by 3 points per decade - > This is termed the Flynn effect - > Many explanations, including prenatal care and nutrition, literacy rate, education - Thus a limitation of some tests is obsolescence.

Fixed reference group scoring systems

- The distribution of scores obtained from the initial standardization is used as the basis for calculating scores on future administrations - > For example, most intelligence tests are standardized with a mean of 100, and a standard deviation of 15 (arbitrary). Raw scores are scaled to fit that standard. New norm tables must be created periodically, but we can compare across different tests, time periods, etc. - > NAPLAN, the American Scholastic Aptitude Test (SAT), are other fixed reference group scoring

Culture and inference

- In selecting a test for use, check how appropriate they are for use with the targeted testtaker population - > We would use slightly different norms (and in fact, different test items, at least for vocabulary subtest) for the Australian edition of the WAIS-IV intelligence test to that of the American edition. - When interpreting test results consider whether alternate forms of assessment are necessary - > For example, if dealing with a non-English speaking background, the Cattell Culture Fair Intelligence Test (CCFIT) might be better to use, or a test in their native language

W2 - what makes a good test and norms Flashcards

(33 cards)