WK 2 Norms and Reliability Flashcards by Lilyan Oey

Describe the main premise of classical test theory and how it relates to reliability?

CTT says that every person’s observed score is made up of the true score (of the trait) as well as partly error. For a population, the total variance is the true variance + the error variance. Reliability refers to the proportion of true variance divided by the total variance. That is, reliability is directly influenced by true variance, but note - we can only ever estimate the true variance.

How well did you know this?

Not at all

Perfectly

Describe measurement error?

Measurement error is also known as error variance. It is made up of both systematic error (predictable and constant) and random error (unpredictable, unrelated, noise). Random error is good because it should balance out in the end and result in a similar mean. Systematic is less good, but if you know what could be affecting it, you can adjust your numbers accordingly.

How well did you know this?

Not at all

Perfectly

List common sources of measurement error

Test Construction
Test Administration
Test Scoring and Interpretation
Sampling Error
Methodological Errors

How well did you know this?

Not at all

Perfectly

Describe Test Construction error

Variation due to differences in items on same test or between tests

How well did you know this?

Not at all

Perfectly

Describe Test Administration error

Variation due to testing environment
(test-taker: anxiety, stress, drugs, sleep, physical discomfort)
(Examiner: appearance, demeanour)

How well did you know this?

Not at all

Perfectly

Describe Test Scoring and Interpretation error

Variation due to scoring and interpretation e.g. scoring a video on warmth behaviours of a mother towards aggressive child

How well did you know this?

Not at all

Perfectly

Describe Sampling Error

Variation due to representativeness of sample e.g. doesn’t gather sample that represents a population, instead only educated people

How well did you know this?

Not at all

Perfectly

Describe Methodological Errors

Variation due to poor training, unstandardised administration, unclear questions, biased questions

How well did you know this?

Not at all

Perfectly

What is the difference between CCT and IRT?

CTT assumes just two components to measurement and that all items have equal ability to measure the target in question.
IRT is very powerful in understanding the power of an item in finding latent traits, it examines items specifically and can reveal different levels of the latent trait being exmained

How well did you know this?

Not at all

Perfectly

IRT incorporates considerations of item difficulty and discrimination. Can you describe what they mean in the context of IRT?

Difficulty relates to the ability of an item to be completed, solved or comprehended
Discrimination refers to the degree to which an item differentiates between high and low levels of the construct. E.g. if the discrimination slope is steep, it is good at discriminating between different levels

How well did you know this?

Not at all

Perfectly

List the common estimates of reliability.

Test-retest reliability
Parallel and Alternate Forms Reliability
Internal consistency reliability (split-half, inter item correlation, Cronbach’s alpha)
Inter-rater/ inter-scorer reliability

How well did you know this?

Not at all

Perfectly

Describe Test-retest reliability

Estimate of reliability over time/ the consistency of a test over time
How? Correlate Pairs of scores from the same people, on the same test, at different time points
Good for? Stable variables e.g. Personality
Bad? Estimates tend to decrease as time passes
Not good for fluctuating variables e.g. Mood

How well did you know this?

Not at all

Perfectly

Describe Parallel and Alternate Forms Reliability

if the MEANS and VARIANCE are equal in both versions of a test = PARALLEL
If not = ALTERNATE
How? Correlate the scores of the same people measured by the different forms
E.g. Does cognitive function improve over time: use the Montreal Cognitive Assessment (MOCA): two different versions: Patient can’t use answers from first version to help them in second

How well did you know this?

Not at all

Perfectly

Describe split-half (internal consistency)

How? Correlate equivalent halves of the one test with each other, then generalise the half-test reliability to the full-test internal consistency reliability Spearman-Brown Formula By changing the ‘n’ of your final test, you can manipulate the reliability of your test.

S-B predicted reliability = (nhalf-correlation)/ 1 + (n-1) half-correlation

How well did you know this?

Not at all

Perfectly

Describe inter-item consistency/ correlation (internal consistency)

the degree of relatedness of items on a test. HOMOGENEITY. Basically you get the average of inter-item correlations

How well did you know this?

Not at all

Perfectly

Describe Kuder-Richardson Formula 20 (internal consistency)

Study These Flashcards

Statistic of choice for determining the inter-item consistency of dichotomous (binary) items. I.e. yes/ no

Describe coefficient/ Cronbach’s Alpha (internal consistency).

Study These Flashcards

You get the mean of all possible split-half correlations, corrected by S-B formula, very popular approach for internal consistency. Values 0-1

Describe Inter-rater/ inter-scorer reliability

Study These Flashcards

Degree of agreement/ consistency between two or more scorers> correlate scores from different raters. Often used in behavioural measures. Aims to guard against biases or idiosyncrasies in scoring. Obtained by correlating scroes form different raters (INTRACLASS correlation for CONTINUOUS measures) allows you to adjust for systematic differences. (COHEN’S KAPPA for CATEGORICAL measures)

What should you consider when choosing reliability estimates?

Study These Flashcards

Homogenous/ Hetero
Dynamic/ static (over time how will it change)
Restricted/ not (restriction of or not enough restriction of range affects ur correlation)
speed/ power test (speed likely to be homogenous, power, likely to be heterogenous)
criterion-/ not criterion-referenced

Why might we want to consider reliability of a single test score?

Study These Flashcards

For example in the clinical setting, we want to know is one person’s score taking our test. We can use our reliability coefficient and generalise to a single score.

How do you use reliability of tests to get precision?

Study These Flashcards

Standard Error of Measurement (SEM)
- estimates the single observed score and how close it is to the true score i.e. its’ precision/ amount of error
- generally, higher reliability, lower SEM
- Estimate the extent of deviation between observed and true score
SEM CHANGES BASED ON SD and RELIABILITY OF TEST
Standard Error of the Difference (SED)
- estimates the difference b/w 2 test scores and is it statistically significant
- MUST use standardised variables, i.e. compare the apples with the apples, or convert the oranges to apples

Explain the difference between norm-referenced and criterion-referenced tests?

Study These Flashcards

Norm-referenced compares a single person’s test score to a normative sample. e.g. IQ test. Criterion-referenced tests compare a single person’s test score to a pre-determined standard criterion/ threshold. e.g. passing first aid course/ driving test

What does Standardisation in sampling to develop norms mean?

Study These Flashcards

It means the process of administering test to representative sample to establish norms

What does SAMPLING in sampling to develop norms mean?

Study These Flashcards

the selection of an intended population for the test, that has at least one observable characteristic

What does STRATIFIED SAMPLING in sampling to develop norms mean? How is it different to STRATIFIED RANDOM SAMPLING?

Stratified purposefully includes a representation of subgroups in a population, Stratified Random Sampling forcibly samples in a way that makes the demographics equal e.g. equally from metropolitan and rural compared to natural higher metro, and natural lower numbers of rural

What are the two less robust types of sampling, but also more common?

``` Purposive sample (arbitrarily selecting sample believed to be representative of the population Incidental/ convenience sampling (convenient sample which may or may not be representative of population) > need to be cautious with generalisations of these findings ```

What is the process of developing norms?

1. Administer the test with standard set of instructions 2. Recommend a setting for testing 3. Collect and analyse data 4. Summarise data using descriptive statistics including MEASURE OF CENTRAL TENDENCY and VARIABILITY 5. Provide a detailed description of standardisation and administration protocol, so others can replicate if they read ur paper in the same context

What are the different types of norms? Describe them.

1. Percentiles NB: tend to be larger differences at edges e.g. big difference between ATAR 97 and ATAR 98, but one point difference, but smaller differences in middle but exaggerated point differences 2. Age norms 3. Grade norms 4. National norms 5. National anchor norms 6. Subgroup norms 7. Locla Norms

What is a standard score?

A raw score converted from one scale to another that has a predefined scale i.e. a set mean and SD e.g. z-score (M=0, SD = ) t-score (M= 50, SD = 10) You can calculate difference scores depending on set mean + SD, it all corresponds. Just needs to be pre-determined, can compare as long as everyone is on the same scale.

Why is it important for test users to research all available norms before prescribing to a patient?

It is important to conduct culturally-informed assessment. E.g. no point comparing a low education to a university level education -> it affects the test results and misleads you

WK 2 Norms and Reliability Flashcards

Norms and Reliability (30 cards)