W2 - what makes a good test and norms Flashcards

1
Q

Rational-empirical approach to test development

A

Rational: knowledge of the construct and psychological theory drives the process

Empirical: collecting data to evaluate individual items and overall test
- Some tests are based purely on empirical grounds, like the MMPI (i.e. that the items have predictive validity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Assumptions about psychological testing

A
  • Psychological Traits/States actually exist
  • Psychological Traits can be measured
  • Psychological Traits predict future behaviour
  • Tests have strengths and weaknesses
  • Various sources of error are part of assessment
  • Testing/Assessment can be conducted in a fair and unbiased manner
  • Testing and assessment benefits society
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are traits and states

A

Traits – defined as “any distinguishable, relatively enduring way in which one individual varies from another“
e.g. self-esteem, extraversion, optimism/pessimism
States – also distinguish one person from another, but are more temporary.
e.g. mood/affect (sad, but not all the time)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Assumptions - Psychological Traits/States actually exist

A
  • This assumes that people HAVE recognizable traits (characteristics)
  • People differ on these and are not homogenous (individual differences)
  • These are (relatively) stable over time
  • They may change over time, but there will be a high correlation between trait scores at different timepoints.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Assumptions - Psychological Traits can be Measured

A
  • Psychological traits exist as constructs - an informed, scientific concept developed or constructed to describe or explain behaviour.
  • We can’t see, hear, or touch constructs, but we can infer their existence from behaviour (incl test scores).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to measure psychological traits

A
  • Test developers start with a definition of the construct, then construct items that would provide insight into that trait.
  • Content (breadth of coverage) is important for tests
  • A consistent scoring system and a way to interpret these results needs to be devised (e.g. Likert scale, or for ability test 0 for incorrect answers, 1 for correct)
  • This is harder for projective/open-ended responses
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Assumptions - Psychological Traits Predict Future Behaviour

A
  • Traits (if measured well) are thought to predict real-world behaviour.
  • For example, an aptitude test should be able to predict the future work performance of potential job applicants.
  • The rationale is that if we take a sample of behaviour (personality trait, ability), then it provides insight into that person.
  • > e.g. does sensation-seeking as a trait predict intentions to undertake risky behaviours?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Assumptions - Tests have Strengths and Weaknesses

A
  • No matter how well constructed, all tests have strengths and weaknesses.
  • Competent test users appreciate the limitations of the tests they use, and should use other tools in making evaluations as well (e.g. case history, structured clinical interview, etc.) to compensate.
  • > e.g. is the test appropriate for this particular use/population? Can I really predict future likelihood of criminality in future from using the PCL-R in children?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Assumptions - Various sources of Error are part of Assessment

A
  • Error refers to a the assumption that factors other than what a test attempts to measure will always influence performance on the test.
  • > e.g. test anxiety, mood on the day, perhaps even weather?
  • Error variance - the component of a test score attributable to sources other than the trait or ability measured.
  • > Both the assessee and assessor are potential sources of error variance. Error variance is to be expected, and considered in psychometrics. Described in classical test theory (CTT)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Assumptions - Testing/Assessment can be Con-ducted in a Fair and Unbiased Manner

A
  • All major test publishers strive to develop instruments that are fair when used in strict accordance with guidelines in the test manual (test protocol).
  • Tests give a standardized set of instructions, for consistency across testing situations. If timed test, need to be accurate measurement
  • Problems arise if administered to a different population than it was intended (e.g. intelligence test relying on culture-specific knowledge, or an adult test for a child), or if it systematically discriminates against different groups (e.g. females, immigrants
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Assumptions - Testing and Assessment Benefits Society

A

When used correctly by a skilled assessor, good tests can take out the subjectivity out of evaluations

  • e.g. selecting the right applicant for the job, regardless of gender, race, religion
  • Alternative would be nepotism (who you know, who you are related to)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What makes for a ‘good’ test

A
  • reliability
  • validity
  • other considerations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Reliability

A

The consistency of the measuring tool: that is the precision with which the test measures (across time, across groups of people), and the extent to which error (e) is present

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Validity

A

A test is a valid measure if it actually measures what it sets out to measure (and doesn’t measure something unwanted!)
For example, a test on values might also capture socially desirable responses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Other considerations for a ‘good’ test

A

Administration, scoring, interpretation should be straightforward (hence repeatable) for trained examiners.
A good test is one that ultimately benefits testtakers, researchers, educators, and society at large – all of the above properties important

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What makes a ‘good’ score

A

Consider how scores on the test will be interpreted:

  • Criterion-referenced tests
  • Norm-referenced testing and assessment
  • percentages and cutoffs
17
Q

Criterion referenced tests

A
  • assess whether particular criteria is met:
  • > Scoring “high”, or “coming first”, is not important.
  • > Only important that the criteria is met (non-graded pass/ fail)
18
Q

Norm-referenced tests

A
  • a method of evaluating performance and deriving meaning from test scores by evaluating an individual testtaker’s score and comparing it to scores of a group of test takers
  • > The meaning of an individual’s test score is understood relative to others’ scores on the same test (NAPLAN, IQ, etc)
19
Q

Percentages and cutoffs

A
  • Another way to evaluate performance (at least for a test of ability) is to look at the percentage
    e. g. Toni scored 34 out of 40 on the exam = 85%
  • We could then establish cutoffs, like 50 % = pass. These are arbitrary decisions and academic conventions.
  • Some ability tests have a higher cutoff, such as 85% for a medical exam.
20
Q

Norms testing

A
  • Different tests have different scoring systems. The total score on a test is rather arbitrary (determined by number of items, and weighting). So how do we interpret it?
  • Well one way is to determine what is a “typical” / normal score.
  • We would call this the average or mean.
  • We could then look at the variability (how far scores typically differ from the mean). We would call this the standard deviation.
  • For each participant, we could then calculate standard score (z-score)
21
Q

What are ‘norms’ in scoring and testing

A
  • Norms are the test performance data of a particular group of test-takers that are designed for use as a reference when evaluating or interpreting individual test scores
  • > Hence the term norm-referenced testing.
  • A normative sample, is just the reference group we use to compare an individual’s score against.

Keep in mind:

  • > Who is the group we are comparing scores against?
  • > Therefore, you should always ask-”compared to whom?”
  • > Who is the normative sample to which this test-taker is being compared? Is this a useful/fair comparison?
22
Q

Sampling to develop norms - standardisation

A

Standardization: The process of administering a test to a representative sample of testtakers for the purpose of establishing norms.

  • Keyword here – representative sample.
  • Generally impractical to administer to an entire population, though some exceptions do exist (e.g. NAPLAN of all school students in target grades).
  • Test developers recruit a sample, so that individual scores can be compared against this group.
23
Q

Sampling to develop norms - sampling

A

Sampling: Test developers select a population group for which the test is intended. The group can be defined more broadly (“adults”) or narrowly “criminal offenders”.

  • > For example, a clinical population would be suitable for sampling for a measure of depression.
  • Remember though that we want our sample to be representative of the population.
  • > Gold standard is stratified-random sampling, where every member of the population has an equal opportunity of being included. Rarely done due to cost
24
Q

Stratified sampling

A

involves recruiting different subgroups (e.g. socioeconomic status, ethnicity, age, gender, etc.) in order to recruit a representative sample. This minimizes selection bias.

25
Q

Sampling to develop norms - purposive sample and incidental sample

A

purposive sample : specifically selecting a sample that is believed to be representative of a population (e.g. clinical sample from psychological clinics),
-> but may be hard/expensive to access

incidental/ convenience sample : a sample that is convenient to access, and available to use (i.e. subject pools of students)

However there may be a selection bias present, and so generalisations must be made with caution

26
Q

Sampling sizes for developing norms

A
  • extremely small samples : even a few individuals might skew the interpretation of the data
  • > Central limits theorem suggests that at least n > 30 desirable, as approaches normal distribution
  • extremely large sample : measurement error decreases as sample size increases, up to a point. Beyond which, further increases have minimal effect
  • > e.g. WAIS-IV standardization sample was 2,200, so that could cover 13 different age bands, different subgroups.
27
Q

Types of norms - percentile

A

Percentile – the percentage of people who achieve that raw score on the test

  • > Low numbers (eg 10th percentile) mean you score at the BOTTOM of a distribution.
  • > High numbers (eg 95th percentile) mean you score at the TOP of a distribution.

Advantage is they are (relatively) easy to calculate. Developers provide the raw scores and associated percentile. But most scores will fall near the middle (eg. 34th – 68th percentile), and so differences in raw scores will have more of an effect at the low and high tails of the distribution.

28
Q

Types of norms - age/grade, national norms and national anchor norms

A
  • Age/grade norms : indicates average performance of samples at particular ages or grades. Often used when developmental effects are expected (e.g. reading level, vocabulary, performance on maths test)
  • National norms : norms derived from a sample that was nationally representative of the population at the time of testing
  • > Population demographics might change over time, so may need to be repeated
  • National anchor norms: equivalency table to allow for conversion between two tests (assuming both nationally representative samples).
29
Q

Types of norms - subgroup norms

A

Subgroup norms: separate norms may be developed for specific demographic subgroups (e.g. socioeconomic status, geographic region, etc.)
-> For example, a test of gender-role identification (trad. masculine and feminine personality traits) might provide separate norms tables for those who id as male/female

30
Q

Types of norms - local norms

A

Local norms : provides normative information for a local population if locals are believed to differ in some way from the national norms
-> Queensland might adopt different norms on a standardized test due to differences in curriculum, or for regional/urban school districts

31
Q

Limitations of norms

A
  • Assumption of norms is that they remain valid over time (i.e. the population being assessed remains stable). For some types of tests, this assumption will hold.
  • Normative samples for intelligence tests quickly become dated. Why?
  • > Intelligence (IQ) is actually RISING by 3 points per decade
  • > This is termed the Flynn effect
  • > Many explanations, including prenatal care and nutrition, literacy rate, education
  • Thus a limitation of some tests is obsolescence.
32
Q

Fixed reference group scoring systems

A
  • The distribution of scores obtained from the initial standardization is used as the basis for calculating scores on future administrations
  • > For example, most intelligence tests are standardized with a mean of 100, and a standard deviation of 15 (arbitrary). Raw scores are scaled to fit that standard. New norm tables must be created periodically, but we can compare across different tests, time periods, etc.
  • > NAPLAN, the American Scholastic Aptitude Test (SAT), are other fixed reference group scoring
33
Q

Culture and inference

A
  • In selecting a test for use, check how appropriate they are for use with the targeted testtaker population
  • > We would use slightly different norms (and in fact, different test items, at least for vocabulary subtest) for the Australian edition of the WAIS-IV intelligence test to that of the American edition.
  • When interpreting test results consider whether alternate forms of assessment are necessary
  • > For example, if dealing with a non-English speaking background, the Cattell Culture Fair Intelligence Test (CCFIT) might be better to use, or a test in their native language