Module 2: Norms and Reliability Flashcards

1
Q

What is Classical Test Theory? (CCT)

A

CCT is a model for understanding measurement
CCT is based on the True Score Model…

… for each person, their observed score on a test is comprised of: -	Observed score (X) = True Score (T) + Error (E)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a true score?

A

True score is a person’s actual true ability level (i.e. measured without error).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is error?

A

Error is a component of observed score unrelated to the test takers rue ability or trait being measured.

True variance and Error variance thus refer to the variability in a collection/population of test scores.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is reliability?

A

Reliability refers to consistency in measurement.

- According to CCT: reliability is the proportion of the total variance attributed to true variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is test administration error?

A

Test administration: variation due to the testing environment

  • Testtaker variables (e.g., arousal, stress, physical discomfort, lack of sleep, drugs, medication)
  • Examiner variables (e.g., physical appearance, demeanour)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is test scoring and interpretation error?

A

Test scoring and interpretation:

Variation due to differences in scoring and interpretation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are methodological errors?

A

Variation due to poor training, unstandardized administration, unclear questions, biased questions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

CCT True-score Model vs. Alternative

A
  • True Score Model of measurement (based on CCT) is simple, intuitive, and thus widely used
  • Another widely used model of measurement is Item Response Theory (IRT)
  • CTT assumptions more readily met than IRT, and assures only two components to measurement
  • But, CTT assumes all items on a test have an equal ability to measure the underlying construct of interest.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Item Response Theory (IRT)

A
  • IRT provides a way to model the probability that a person with X ability level will correctly answer a question that is ‘tuned’ to that ability level.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does IRT incorporate and consider?

A
  • IRT incorporates considerations of item Difficulty and discrimination
    o Difficulty relates to an item not being easily accomplished, solved, or comprehended.
    o Discrimination refers to the degree to which and item differentiates among people with higher or lower levels of the trail ability, or construct being measures.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Reliability estimates

A

Because a person’s true score is unknown, we use different mathematical methods to estimate the reliability of tests.

Common examples include: -	Test-retest reliability -	Parallel an Alternate forms of reliability  -	Internal consistency reliability o	E.g., split in half, inter item correlation, Cronbach’s alpha -	Interrater/interscorer reliability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Test-retest reliability

A

Test-retest reliability is an estimate of reliability over time

  • Obtained by correlating pairs of scores from same people on administration and same test at different times
  • Appropriate for stable variables (e.g., personality)
  • Estimates tend to decrease as time passes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Parallel and Alternate Forms Reliability

A
  • Parallel forms: two versions of a test are parallel if in bother versions the means and variances of test scores are equal
  • Alternate forms: there is an attempt to create two forms of a test, but they do not meet strict requirement of parallel forms
  • Obtained by correlating the scores of the same people measured with the different forms.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Split half reliability

A

Obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once.

Entails three steps:

  • Step 1: Divide the test into two halves
  • Step 2: Correlate scores on the two halves of the test.
  • Step 3: Generalise the half-test reliability to the full-test reliability using the Spearman-Brown formula.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Inter-item/correlating

A

The degree of relatedness of items on a test. Able to gauge the homogeneity of a test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Inter-item/correlating

A

The degree of relatedness of items on a test. Able to gauge the homogeneity of a test

17
Q

Kuder-Richardson formula 20

A

statistic of choice for determining the inter-item consistency of dichotomous items

18
Q

Coefficient alpha

A

mean of all possible split-half correlations, corrected by the Spearman-brown formula. The most popular approach for internal consistency. Values range from 0 to 1.

19
Q

Interrater/InterScorer Reliability

A

The degree of agreement/consistency between two or more scorers (or judges or raters).

  • Often used with behavioural measures
  • Guards against biases or idiosyncrasies in scoring
  • Obtained by correlating scores from different raters:
    o Use intraclass correlation for continuous measures
    o Use Cohen’s Kappa for categorical measures
20
Q

Choosing Reliability Estimates

A

The nature of the test will often determine the reliability metric e.g.,

  • Are the test items homogenous or heterogeneous in nature
  • Is the characteristic, ability, trait being measured presumed to be dynamic or static
  • The range of test scores is or is not restricted
  • The test is a speed (how many can you do in a certain amount of time) or a power test (increasing difficulty over the item)
  • The test is or is not criterion-referenced (in order to pass you need to reach a threshold)

Otherwise, you can select whatever you think is appropriate.

21
Q

How do we account for reliability in a single score?

A
  • Our reliability coefficient tells us about error in our test in general
  • We can use this reliability to estimate to understand how confident we can be in a single observed score for one person.
22
Q

Standard Error of the Difference (SED)

A

The SED is a measure of how large a difference in test scores would be to be considered ‘statistically significant’

Helps with three questions (Note: test 1&2 must be on the same scale)

  1. How did Person A’s performance on test 1 compare with own performance on test 2?
  2. How did Person A’s performance on test 1 compare with Person B’s performance on test 1?
  3. How did Person A’s performance on test 1 compare with person B’s performance on test 2?
23
Q

Standardization

A

is the process of administering tests to representative samples to establish norms.

24
Q

Sampling

A

the selection of an intended population for the test, that has at least one common, observable characteristic.

25
Q

Stratified-random sampling

A

is a sampling design that ensures every member of a population has an equal opportunity of being included in a sample.

26
Q

Purposive sample

A

is arbitrarily selecting a sample believed to be representative of the population.

27
Q

Incidental/convenience

A

sample that is convenient or available for use. May not be representative of the population.
o Generalisation of findings from convenience samples must be made with caution.

28
Q

Process of developing norms:

A

Have obtained the normative sample:

  1. Administer the test with standard set of instructions
  2. Recommend a setting for test administration
  3. Collect and analyse data
  4. Summarize data using descriptive statistics including measures of central tendency and variability
  5. Provide a detailed description of the standardization and administration protocol
29
Q

Types of Norms

A

Percentiles: the percentage of people in the normative sample whose score was below a particular raw score.

  • Percentiles are popular because they are easily calculated and interpreted.
  • Problem: real differences between raw scores may be minimized near ends of distribution and exaggerated in the middle of the distribution.

Age norms: average performance of normative sample segmented by age.

Grade norms: average performance of normative sample segmented by grade.

Subgroup: a normative sample can be segmented by any criteria initially used in selecting sample.

National norms: derived from normative sample that was nationally representative of the population.

National anchor norms: equivalency table for scores on two different tests. Allows common comparison.

Local norms: provide normative information with respect to the local populations performance on some test.

30
Q

The normal curve

A

The normal curve is a bell-shaped, smooth, mathematically defined curve t

31
Q

Standard Scores

A

Standard score: is a raw score converted from one scale to another that has a predefined scale (i.e., set mean and standard deviation)

32
Q

Z-score

A

Z-Score: conversion of a raw score into a number indicating how many standard deviation units the raw score is below or above the mean

33
Q

T-scores

A

T-Scores: aka ‘fifty plus or minus ten scale’ – scale has set mean = 50 and standard deviation = 10

34
Q

Culture and Inference

A
  • In selecting a test for use, responsible test users should research all available norms to check if norms are appropriate for use with your patient
  • When interpreting test results it helps to know about the culture and era of test-taker
  • It is important conduct culturally informed assessment