Cronbach’s Alpha is often incorrectly used. Lower estimate of reliability. Not a measure of one-dimensionality. I.e., it is a function only of the number of items, and the average inter-item correlation. Only use if you think the items are homogenous on your test. If your items are not similar (measure the same thing) then don’t use Cronbach’s alpha. You need factor analysis to see the underlying correlation matrix between items.

Module 2: Norms and Reliability Flashcards by Molly McVean

Classical Test Theory: (CCT)

is a model for understanding measurement.

CCT is based on the True Score Model. See Notes

How well did you know this?

Not at all

Perfectly

True score:

is a person’s actual true ability level (i.e., measured without error).

How well did you know this?

Not at all

Perfectly

Error:

is a component of observes score unrelated to test-takers true ability or trait being measures.

How well did you know this?

Not at all

Perfectly

True variance and error variance:

thus, refer to the variability is a collection/population of test scores.

How well did you know this?

Not at all

Perfectly

Reliability:

refers to consistency in measurement. See Notes.

How well did you know this?

Not at all

Perfectly

Systematic error

Would be the same for everyone. Being in a noisy classroom made everyone perform 10 points worse.

How well did you know this?

Not at all

Perfectly

Random error:

the good error. Unrelated to the persons true score/environment. There is nothing you can do about random error. You just need to be aware it is there. We are only trying to measure random error. Because, if there is a systematic error that is affecting everyone. Then everyone is treated the same.

How well did you know this?

Not at all

Perfectly

Sources of Measurement Error:

Test Construction: : Variation due to differences in items on same test or between tests (i.e. item/content sampling)
Test Administration: Variation due to testing environment.
• Test taker variables (e.g., arousal, stress, physical discomfort, lack of sleep, drugs, medication).
• Examiner variables: (e.g., physical appearance, demeanour).
Test scoring and interpretation: Variation due to differences in scoring and interpretation.
Sampling error: Variation due to representativeness of sample.
• The larger the sample, the smaller the sampling error.
Methodological errors: Variation due to poor training, unstandardized administration, unclear questions, biased questions.

How well did you know this?

Not at all

Perfectly

Item Response Theory (IRT).

IRT provides a way to model the probability that a person with X ability level will correctly answer a question that is “tuned” to that ability level.

IRT incorporates considerations of item Difficulty and Discrimination.

o Difficulty: relates to an item not being easily accomplished, solves, or comprehended.
o Discrimination: refers to the degree to which an item differentiates among people with higher or lower levels of the trait, ability, or construct being measured.

You want different questions varying in degree of difficulty (speeding fine vs heroin) and you want to discriminate between the two (speeding fine is low level of trait, and heroin use is high levels of the trait)

How well did you know this?

Not at all

Perfectly

CCT True Score Model vs. Alternatives

True Score Model of measurement (based on CCT) is simple. Intuitive, and thus widely used.
Another widely used model of measurement is Item Response Theory (IRT).

o CCT assumptions more readily met than IRT and assumes only two components to measurement.

o But CCT assumes all items on a test have equal ability to measure the underlying construct of interest.

o E.g., a test measures risk taking. Have you ever had a speeding fine? (risk taker) Have you used heroin before (risk taker). Speeding fine is more common than heroin. Therefore, Classical Test Theory those two items would be equivalent in assuming risk taking behaviour. Whereas in reality, that is not a fait assumption.

o Item response theory can address this by examining items specifically and see how those items perform differently in the construct attempting to be measured.

How well did you know this?

Not at all

Perfectly

Reliability Estimates:

Because person’s true score is unknown, we use different mathematical methods to estimate the reliability of tests.

Common examples include:
• Test-retest reliability
• Parallel and alternate form’s reliability
• Internal consistency reliability
o E.g., split half, item correlation, Cronbach’s alpha
• Interrater/interscorer reliability

How well did you know this?

Not at all

Perfectly

Test-retested reliability:

is an estimate of reliability over time.

Obtained by correlating pairs of scores from same people on administration of same test at different times.
Appropriate for stable variables (e.g., personality, NOT mood) (if it’s meant to be the same over 1 week then fine)
Estimates tent to decrease as time passes.

How well did you know this?

Not at all

Perfectly

Parallel forms:

Two versions of a test are parallel if in both versions, the means and variances of test scores are equal. E.g., in neuropsych, we might want to test the same thing twice, we can’t use the exact same test because the participant might remember the answers. The test has to be equal for this to work or else you test sperate things on separate tests.
• More strict

How well did you know this?

Not at all

Perfectly

Alternate forms:

there is an attempt to create two forms of a test, but they do not meet strict requirement of parallel forms.
• Obtained by correlating the scores of same people measures with the different forms.

How well did you know this?

Not at all

Perfectly

Internal consistency measures:

Split half reliability

Inter-item consistency/correlation

Kuder-Richardson formula 20

Coefficient alpha

How well did you know this?

Not at all

Perfectly

Split half reliability:

obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once.
Entails 3 steps:
1. Divide the test into two halves.
2. Correlate scores on the two halves of the test.
3. Generate the half-test reliability to test the full-test reliability using the spearman brown formulae.
4. Cronbach’s Alpha

How well did you know this?

Not at all

Perfectly

Inter-item consistency/correlation:

the degree of relatedness of items on a test. Able to gauge the homogeneity of a test.

How well did you know this?

Not at all

Perfectly

Kuder-Richardson formula 20:

statistic of choices for determining the inter-item consistency of dichotomous items.

How well did you know this?

Not at all

Perfectly

Coefficient alpha:

mean of all possible split-half correlations, corrected by the spearman-brown formula. The most popular. Values range from 0 – 1.

How well did you know this?

Not at all

Perfectly

Cronbach’s Alpha

Cronbach’s Alpha is often incorrectly used.
Lower estimate of reliability.
Not a measure of one-dimensionality.
I.e., it is a function only of the number of items, and the average inter-item correlation.
Only use if you think the items are homogenous on your test. If your items are not similar (measure the same thing) then don’t use Cronbach’s alpha.
You need factor analysis to see the underlying correlation matrix between items.

Interrater/interscorer reliability:

is the degree of agreement/consistency between two or more scorers (or judgers or raters)
• Often used with behavioural measures (strange situation)
• Guards against biases or idiosyncrasies in scoring.
• Obtained by correlating scores from different raters:
• Us interclass correlation for continuous measures.
• Use Cohen’s kappa for categorical measures.

Choosing Reliability Estimates:

We will never know the true score.
The nature of the test will often determine the reliability metric e.g.,
• Are the test items homogenous or heterogenous in nature?
• Is the characteristic, ability, trait being measures presumed to be dynamic or static.
• The range of test scores is or is not restricted.
• The test is a speed or power test.
• Speed: how fast you can do the questions.
• Power: increasing difficulty across the test. (Split half not good for a power test)
• The test is or is not criterion-referenced.
• Test in order to pass you need to reach a certain threshold.

…Otherwise, you can select whatever you like.

The Standard Error of Measurement (SEM):

provides measure of precision of an observed test score i.e., estimate of amount of error in an observed score.

The Standard Error of the Difference (SED):

is a measure of how large a difference in test scores would be to be considered “statistically significant”. See notes

Norm-referenced testing and assessment:

allows one to derive meaning from a person’s test score by comparing it to a reference group.

Norms:

are the data that is obtained from the particular group of test takers that are being used as the reference group.

A Normative sample

is the reference group to which test-takers are compared.

Norm referenced tests:

compare an individual’s observes score to the norms obtained from a normative sample.

Criterion referenced tests:

compare an individual’s score to a particular predetermined standard, criterion, level of performance/proficiency, or mastery (e.g. driving exam)

Standardization:

is the process of administering test to representative sample establish norms.

Sampling:

is the selection of an intended population for the test, that have at least one common, observable characteristic.

Stratified sampling:

purposefully include a representation of different subgroups of population.

Stratified-random sampling:

in sampling design that ensures every member of population has an equal opportunity of being included in a sample. (less bias)

Purposive sample:

is arbitrary (randomly) selecting a sample believed to be representative of the population.

Incidental/convenience sampling

is a sample that is convenient or available for use. May not be representative of the population. o Generalization of findings from convivence sampling made with caution.

Process of Developing Norms - Having obtained the normative sample:

* Administer the test with standard set of instructions. * Recommended a setting for test administration. * Collect and amaze data. * Summarize data using descriptive statistics including measures of central tendency and variability. * Provide a detailed description of the standardization and administration protocol.

Types of Norms:

All designed to have a better comparison group to a representative comparison group. Percentiles Age norms Grade norms Subgroup norms National norms National anchor norms Local norms

Percentiles:

the percentiles of people in the normative sample whose score was below a particular score. • Percentiles popular because easily calculated and interpreted. • Problem: real differences between raw scores may be minimized near ends of distribution and exaggerated in middle distribution.

Age norms:

average performance of normative sample segmented by age.

Grade norms:

average performance of normative sample segmented by grade.

Subgroup norms:

a normative sample can be segmented by any of criteria initially used in selecting sample.

National norms:

derived from normative sample, that was nationally representative of the population.

National anchor norms:

equivalency table for scores on two different tests. Allows common comparison.

Local norms:

provide normative information with respect to the local population’s performance on some test. • All about getting comparison group. That you can compare your single person to your comparison group.

Standard score:

is a raw score converted from one scale to another that has a predefined scale (i.e., set mean and standard deviation).

Z-score:

conversion of a raw score into a number indicating how many standard deviation units the raw score is below or above the mean.

T-scores:

aka ‘fifty plus or minus ten scale’ – scale has set mean = 50 and standard deviation =10.

Culture and inference

* In selecting a test to use, responsible test users should research all available norms to check if norms are appropriate for use with your patient. * When interpreting test results, it helps to know about the culture and era of the test taker. * It is important to conduct culturally informed assessment.