PA 2 Flashcards
What is CTT?
Classical test theory
What model is CTT based on?
The True Score Model
What are the basic principles of the True Score Model, for an individual and for a population?
Individual: Observed Score (X) = True Score (T) + Error (E)
Population: Total variance = true variance + error variance
Define “error” in CTT
Error is the component of the observed score unrelated to test takers true ability or trait being measured
Define “reliability” in simple terms
Consistency in measurement
Define “reliability” in CTT
Reliability is the proportion of the total
variance attributed to true variance
What is the basic formula for reliability in CTT?
Reliability = True variance / Total variance (True variance + error variance)
Define systematic v random error
Systematic Error: Source of error that is constant, proportionate, predictable.
Random Error: Source of error that is unpredictable, inconsistent, unrelated i.e. noise.
List and define 4 types of possible measurement error
Test Construction: Variation due to differences in items on same test or between tests (i.e. item/content sampling).
Test Administration: Variation due to testing environment:
• Testtaker variables (e.g., stress, discomfort, lack of sleep)
• Examiner variables (e.g., demeanor).
Test Scoring and Interpretation
Sampling Error: representativeness of sample.
Methodological errors: poor training, unstandardized administration, unclear questions, biased questions.
What is IRT?
Item Response Theory
What is the core difference between CTT and IRT?
CTT assumes all item on a test have an equal ability to measure the underlying construct of interest.
IRT provides a way to model the probability that a person with a particular ability level will correctly answer a question that is “tuned” to that ability level.
Define “difficulty” and “discrimination” in IRT
Difficulty relates to an item not being easily accomplished, solved, or comprehended.
• Discrimination refers to the degree to which an item differentiates among people with higher or lower levels of the trait, ability, or construct being measured.
List 4 main types of reliability
Test‐retest reliability
Parallel and Alternate forms reliability
Internal consistency reliability
Interrater/interscorer reliability
What is test-retest reliability and how is it obtained?
An estimate of reliability over time.
Obtained by correlating pairs of scores from the same people doing the same test at different times.
Name situations where the test-retest is not recommended
Unstable variables (e.g. mood v personality) Too long between tests (reliability tends to decrease)
Define and distinguish between Parallel and Alternate Forms Reliability methods
- Parallel forms: Two versions of a test in which the means and variances of the test scores are equal.
- Alternate forms: two similar forms of a test, but they do not meet the strict requirement of parallel forms.
- In both cases, reliability obtained by correlating the scores of the same people using the different forms.
Define Split Half Reliability
Split‐half reliability: a measure of internal consistency, obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once.
Describe the 3 basic steps of Split Half Reliability method
- Step 1. Divide the test into two halves.
- Step 2. Correlate scores on the two halves of the test.
- Step 3. Generalise the half‐test reliability to the full‐test reliability using the Spearman‐Brown formula.
Define the Spearman‐Brown (S‐B) formula
S‐B formula allows one to estimate internal consistency reliability from a correlation between two halves of the one test, and predict reliability changes based on any number of measurement items
List and define 4 methods of estimating Internal Consistency
- Spearman‐Brown (S‐B) formula: correlation between two halves of the one test
- Inter‐item consistency/correlation: The degree of relatedness of items on a test - able to gauge the homogeneity of a test.
- Kuder‐Richardson formula 20: best choice for determining the inter‐item consistency of DICHOTOMOUS items.
- Coefficient (Cronbach’s) alpha: mean of all possible split‐half correlations, corrected by the Spearman‐Brown formula. The most popular approach for internal consistency. (Values range from 0 to 1)
Name 2 disadvantages of Cronbach’s alpha
- Lower estimate of reliability
- Not a measure of unidimensionality i.e., it is a function only of the number of items, and the average inter‐item correlation.
If a test measures more than one variable, what is the best way to test reliability?
Factor analysis
Define Interrater (Interscorer) reliability. What sorts of studies often need this?
The degree of agreement/consistency between two or more scorers (or judges or raters).
Often used in behavioural studies
What are the two main ways to obtain Interrater reliability, and when are they used?
- Use intraclass correlation for continuous measures.
* Use Cohen’s Kappa for categorical measures
Name 5 different test characteristics that determine what type of reliability measure you should use
- Are the items homogeneous or heterogeneous?
- Is the trait being measured dynamic or static?
- Is the range of test scores restricted or not?
- Is it a speed or a power test?
- Criterion‐referenced or norm referenced?
What is SEM and how does it relate to reliability?
Standard Error of Measurement. Provides measure of precision of an observed test score (i.e., estimate of amount of error in an observed score; estimate of the extent of deviation between observed and true score).
Generally: higher reliability = lower SEM.
In test measures, what does CI stand for, and what does it mean?
Confidence Interval. It gives a probably spread of true scores based on the observed score. Eg 95% CL.
In test measures, what does SED stand for, and what does it mean?
Standard Error of Difference. A measure of how large a difference in test scores would be to be considered statistically significant.
List 3 situations where you would use SED
• How did Person A’s performance on test 1 compare with own performance on test 2?
• How did Person A’s performance on test 1 compare with Person B’s performance on test 1?
• How did Person A’s performance on test 1 compare with Person B’s performance on test 2?
(NB. Both tests must be on same scale)
What is norm-referenced testing?
Deriving meaning from a person’s test score by comparing it to a reference group.
What is a normative sample?
The reference group to which test‐takers are compared.
What is a criterion‐referenced test?
A test that compares an individual’s score to a particular predetermined standard, criterion, level of performance, or mastery (e.g. a driving exam)
Define standardization
The process of administering a test to a representative sample to establish norms
Define sampling
The selection of an intended population for the test, that has at least one common, observable characteristic.
Define stratified sampling
Purposefully including a representation of different subgroups of a population
Define stratified random sampling
Divide into strata, then randomly sample from each strata. Final numbers can be proportionate or not, depending on the study requirements.
What is a purposive sample?
Selecting a sample believed to be representative of the intended population
What is incidental or convenience sampling?
Using a sample that is convenient or available for use. May not be representative of the population, so may be hard to generalise.
Describe 6 steps in the process of developing norms
- Obtain a normative sample:
- Standardise a setting for test administration.
- Administer the test with standard set of instructions.
- Collect and analyze data.
- Summarize data using descriptive statistics including
measures of central tendency and variability. - Provide a detailed description of the standardization
and administration protocol.
What is a percentile, and what is a potential problem with this method of assessing a norm?
A percentile is the percentage of people in the normative sample whose score was below a particular raw score.
Easy and popular, however real differences between raw scores may be minimized near ends of distribution and exaggerated in middle of distribution.
Define age norm
The average performance of a normative sample segmented by age
Define grade norm
The average performance of a normative sample segmented by grade
Define subgroup norm
A normative sample can be segmented by any of criteria initially used in selecting sample
Define national norm
Derived from normative sample that is nationally representative of the population
Define national anchor norm
Equivalency table for scores on two different tests. Allows common comparison.
Define local norms
Normative information with respect to the local population’s performance on some test
Describe the “normal” curve”
A bell‐shaped, symmetrical, mathematically defined curve that is highest at its center. Can be conveniently divided into areas defined by units of standard deviations.
Define “standard score”
A raw score converted from original scale to another with a predefined scale (i.e., set mean and standard deviation)
Define Z score
Conversion of a raw score into a number indicating how many standard deviation units the score is below or above the mean. x score = score minus mean / SD
Define T scores
Scores using a scale where the mean is 50 and the SD is 10. Also known as “fifty plus or minus 10 scale”