WK 2 Norms and Reliability Flashcards

Norms and Reliability

1
Q

Describe the main premise of classical test theory and how it relates to reliability?

A

CTT says that every person’s observed score is made up of the true score (of the trait) as well as partly error. For a population, the total variance is the true variance + the error variance. Reliability refers to the proportion of true variance divided by the total variance. That is, reliability is directly influenced by true variance, but note - we can only ever estimate the true variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe measurement error?

A

Measurement error is also known as error variance. It is made up of both systematic error (predictable and constant) and random error (unpredictable, unrelated, noise). Random error is good because it should balance out in the end and result in a similar mean. Systematic is less good, but if you know what could be affecting it, you can adjust your numbers accordingly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

List common sources of measurement error

A
  1. Test Construction
  2. Test Administration
  3. Test Scoring and Interpretation
  4. Sampling Error
  5. Methodological Errors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe Test Construction error

A

Variation due to differences in items on same test or between tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe Test Administration error

A

Variation due to testing environment
(test-taker: anxiety, stress, drugs, sleep, physical discomfort)
(Examiner: appearance, demeanour)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe Test Scoring and Interpretation error

A

Variation due to scoring and interpretation e.g. scoring a video on warmth behaviours of a mother towards aggressive child

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe Sampling Error

A

Variation due to representativeness of sample e.g. doesn’t gather sample that represents a population, instead only educated people

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe Methodological Errors

A

Variation due to poor training, unstandardised administration, unclear questions, biased questions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the difference between CCT and IRT?

A

CTT assumes just two components to measurement and that all items have equal ability to measure the target in question.
IRT is very powerful in understanding the power of an item in finding latent traits, it examines items specifically and can reveal different levels of the latent trait being exmained

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

IRT incorporates considerations of item difficulty and discrimination. Can you describe what they mean in the context of IRT?

A

Difficulty relates to the ability of an item to be completed, solved or comprehended
Discrimination refers to the degree to which an item differentiates between high and low levels of the construct. E.g. if the discrimination slope is steep, it is good at discriminating between different levels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

List the common estimates of reliability.

A
  1. Test-retest reliability
  2. Parallel and Alternate Forms Reliability
  3. Internal consistency reliability (split-half, inter item correlation, Cronbach’s alpha)
  4. Inter-rater/ inter-scorer reliability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe Test-retest reliability

A

Estimate of reliability over time/ the consistency of a test over time
How? Correlate Pairs of scores from the same people, on the same test, at different time points
Good for? Stable variables e.g. Personality
Bad? Estimates tend to decrease as time passes
Not good for fluctuating variables e.g. Mood

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe Parallel and Alternate Forms Reliability

A

if the MEANS and VARIANCE are equal in both versions of a test = PARALLEL
If not = ALTERNATE
How? Correlate the scores of the same people measured by the different forms
E.g. Does cognitive function improve over time: use the Montreal Cognitive Assessment (MOCA): two different versions: Patient can’t use answers from first version to help them in second

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Describe split-half (internal consistency)

A

How? Correlate equivalent halves of the one test with each other, then generalise the half-test reliability to the full-test internal consistency reliability Spearman-Brown Formula By changing the ‘n’ of your final test, you can manipulate the reliability of your test.

S-B predicted reliability = (nhalf-correlation)/ 1 + (n-1) half-correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe inter-item consistency/ correlation (internal consistency)

A

the degree of relatedness of items on a test. HOMOGENEITY. Basically you get the average of inter-item correlations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe Kuder-Richardson Formula 20 (internal consistency)

A

Statistic of choice for determining the inter-item consistency of dichotomous (binary) items. I.e. yes/ no

17
Q

Describe coefficient/ Cronbach’s Alpha (internal consistency).

A

You get the mean of all possible split-half correlations, corrected by S-B formula, very popular approach for internal consistency. Values 0-1

18
Q

Describe Inter-rater/ inter-scorer reliability

A

Degree of agreement/ consistency between two or more scorers> correlate scores from different raters. Often used in behavioural measures. Aims to guard against biases or idiosyncrasies in scoring. Obtained by correlating scroes form different raters (INTRACLASS correlation for CONTINUOUS measures) allows you to adjust for systematic differences. (COHEN’S KAPPA for CATEGORICAL measures)

19
Q

What should you consider when choosing reliability estimates?

A
  1. Homogenous/ Hetero
  2. Dynamic/ static (over time how will it change)
  3. Restricted/ not (restriction of or not enough restriction of range affects ur correlation)
  4. speed/ power test (speed likely to be homogenous, power, likely to be heterogenous)
  5. criterion-/ not criterion-referenced
20
Q

Why might we want to consider reliability of a single test score?

A

For example in the clinical setting, we want to know is one person’s score taking our test. We can use our reliability coefficient and generalise to a single score.

21
Q

How do you use reliability of tests to get precision?

A
  1. Standard Error of Measurement (SEM)
    - estimates the single observed score and how close it is to the true score i.e. its’ precision/ amount of error
    - generally, higher reliability, lower SEM
    - Estimate the extent of deviation between observed and true score
    SEM CHANGES BASED ON SD and RELIABILITY OF TEST
  2. Standard Error of the Difference (SED)
    - estimates the difference b/w 2 test scores and is it statistically significant
    - MUST use standardised variables, i.e. compare the apples with the apples, or convert the oranges to apples
22
Q

Explain the difference between norm-referenced and criterion-referenced tests?

A

Norm-referenced compares a single person’s test score to a normative sample. e.g. IQ test. Criterion-referenced tests compare a single person’s test score to a pre-determined standard criterion/ threshold. e.g. passing first aid course/ driving test

23
Q

What does Standardisation in sampling to develop norms mean?

A

It means the process of administering test to representative sample to establish norms

24
Q

What does SAMPLING in sampling to develop norms mean?

A

the selection of an intended population for the test, that has at least one observable characteristic

25
Q

What does STRATIFIED SAMPLING in sampling to develop norms mean? How is it different to STRATIFIED RANDOM SAMPLING?

A

Stratified purposefully includes a representation of subgroups in a population, Stratified Random Sampling forcibly samples in a way that makes the demographics equal e.g. equally from metropolitan and rural compared to natural higher metro, and natural lower numbers of rural

26
Q

What are the two less robust types of sampling, but also more common?

A
Purposive sample (arbitrarily selecting sample believed to be representative of the population
Incidental/ convenience sampling (convenient sample which may or may not be representative of population) > need to be cautious with generalisations of these findings
27
Q

What is the process of developing norms?

A
  1. Administer the test with standard set of instructions
  2. Recommend a setting for testing
  3. Collect and analyse data
  4. Summarise data using descriptive statistics including MEASURE OF CENTRAL TENDENCY and VARIABILITY
  5. Provide a detailed description of standardisation and administration protocol, so others can replicate if they read ur paper in the same context
28
Q

What are the different types of norms? Describe them.

A
  1. Percentiles NB: tend to be larger differences at edges e.g. big difference between ATAR 97 and ATAR 98, but one point difference, but smaller differences in middle but exaggerated point differences
  2. Age norms
  3. Grade norms
  4. National norms
  5. National anchor norms
  6. Subgroup norms
  7. Locla Norms
29
Q

What is a standard score?

A

A raw score converted from one scale to another that has a predefined scale i.e. a set mean and SD
e.g.
z-score (M=0, SD = )
t-score (M= 50, SD = 10)
You can calculate difference scores depending on set mean + SD, it all corresponds. Just needs to be pre-determined, can compare as long as everyone is on the same scale.

30
Q

Why is it important for test users to research all available norms before prescribing to a patient?

A

It is important to conduct culturally-informed assessment. E.g. no point comparing a low education to a university level education -> it affects the test results and misleads you