Al psychometrics notes Flashcards
What are the 2 processes that are part of test standardization?
- Uniform administration and scoring procedures
2. development of test norms
What does reliability refer to?
A test’s consistency
What does reliability provide no information on?
What is being measured
What does classical test theory propound?
That an obtained test score (X) is composed of two additive and independent components:
- True score (T): actual status on the attribute
- Error (E): random
What is the ideal (but unobtainable) formula for reliability?
True variance/observed variance
What do reliability estimates assume?
- Variability that is consistent is true variance
2. Variability that is inconsistent is random error
What is the range of a reliability coefficient?
0.0-1.0
What does a reliability coefficient of 0.0 indicate?
That all variability obtained in a test’s scores is attributable to measurement error
What does a reliability coefficient of 1.0 indicate?
That all variability obtained in a test’s scores reflects true score variability
What is the difference between the reliability coefficient and other correlation coefficients?
It is never squared
What does the reliability coefficient estimate?
The proportion of variability in obtained test scores that reflects true scores
What are the 5 main types of reliability?
- Test-retest reliability
- Alternate-forms reliability
- Split-half reliability
- Coefficient Alpha
- Inter-rater reliability
What is test-retest reliability?
The test is given to the same group twice, and then the two sets of scores are correlated
What is the coefficient given from test-retest reliability?
A coefficient of stability (tests the degree of stability over time)
What is the source of measurement error in test-retest reliability?
Time sample error
random factors between the two test administrations: examinees fluctuations (e.g., anxiety) etc.
What kind of tests is test-retest reliability most suitable for?
Aptitude tests - a stable characteristic
What kind of tests is test-retest reliability least suitable for?
Tests of mood - fluctuates over time
What do you do in alternate-forms reliability? What does it indicate?
2 equivalent tests are administered to the same group, and then the two sets of scores are correlated
It indicates the consistency of responding to different item samples
What is the coefficient derived from alternate forms reliability?
Coefficient of equivalence
In alternate-forms reliability, when the forms are administered at different times the test also measures consistency over time - what is the reliability coefficient derived?
Coefficient of equivalence and stability
What kind of error is associated with alternate-forms reliability?
Content sampling
The interaction between different examinee’s knowledge and the different content assessed by the items in the forms. e.g., Form A matches one examinee’s knowledge better than Form B
Alternate-form reliability is a rigorous form of reliability, but what is the problem with it?
It is difficult to develop truly equivalent forms
When is alternate-form reliability inappropriate?
When the attribute is likely to fluctuate over time
In what two way are split-half reliability and coefficient alpha similar?
- Both involve administering a test once to a single group
2. Both yield a reliability coefficient called a “coefficient of internal consistency”
How is split-half reliability conducted?
The test is split into halves so that each examinee has two scores. The scores on the two halves are then correlated.
What is a problem with split half reliability?
It yields a coefficient derived from 1/2 test length (remember the reliability decreases as the length of a test decreases).
A problem with split-half reliability is that it is derived from only half a test, what does this mean?
It underestimates true reliability
Split-half reliability underestimates true reliability, how is this corrected?
using the Spearman-Brown prophecy formula
What is the full name for the coefficient alpha?
Cronbach’s coefficient alpha
How is the Cronbach’s coefficient alpha derived?
The test is administered to one group of examinees at a single time point. The formula then determines average inter-item consistency. Average reliability is then obtained from all split tests.
The coefficient alpha is conservative, and consequently can be considered a X of the test’s reliability
lower bound estimate
Split-half reliability and coefficient alpha are both measures of what?
Internal consistency
What is an error source for internal consistency?
Content sampling
How does split-half reliability contain content sampling error?
Because of differences between the content of the test halves - items in one half better fit the knowledge of some than items in the other half
How does coefficient alpha contain content sampling error?
Because of differences between individual test items
What is the specific term for the type of content sampling error found in the coefficient alpha?
heterogeneity of the content domain
i.e., the greater the heterogeneity of the content, the lower the inter-item correlations and the lower the coefficient alpha
When is inter-rater reliability used?
In situations where the test scores involve a rater’s judgement (e.g., essay tests or projective tests)
What is the method for determining inter-rater reliability?
Correlation coefficient (e.g., a kappa coefficient) or determine % agreement between raters
What is wrong with the method of determining the % agreement between raters in inter-rater reliability?
Leads to erroneous conclusions because it doesn’t take into account the level of chance agreement. This is especially high when the behavior has a high rate of occurrence.
How does Cohen’s kappa adjust % agreement?
By removing the effects of chance
What is the term for a type of error that artificially inflates a measure of inter-rater reliability?
consensual observer drift
Consensual observer drift is one kind of error associated with inter-rater reliability - what is another?
Rater factors, such as: lack of motivation or rate biases
How can observer drift be eliminated?
Raters working independently
When raters are told ratings are checked what happens?
Accuracy improves
What three factors effect the reliability coefficient?
- test length (longer better)
- range of test scores (larger the range, the higher the cc)
- guessing
How does guessing effect the reliability coefficient?
As the probability of correctly guessing answers increases, the reliability coefficient decreases
When is a test reliable?
When it has small measurement error
Does the WAIS-IV have good reliability?
Yes - both internal (split-half) and temporal
What does SEM index?
SEM indexes the amount of error that can be expected in obtained scores due to test unreliability
Also: the SD of all scores averaged across persons with infinite test administrations is the SEM
How do you calculate a 99% CI
SEM*2.58
How do you calculate a 68% CI?
SEM*1
What is test validity?
The extent to which a test measures what it is supposed to measure. How successful a test is for its intended use.
What are the 3 main types of validity?
- Content validity
- Construct validity
- Criterion-related validity
What is content validity?
Familiarity with a particular content or behavior domain (e.g., our test)
Used in achievement tests
What is construct validity?
The extent to which an examinee possess a particular hypothetical trait (e.g., aggressiveness or intelligence)
Validity in measurement of a hypothetical construct that cannot be measured directly
What is criterion-related validity?
The extent to which a measure is related to an outcome (e.g., GRE scores to graduate grades)
What results in poor content validity?
inadequate sampling
What three things indicate adequate content validity?
- Strong internal consistency
- Correlations with other tests of the same domain
- Sensitivity to manipulations that increase the familiarity with the domain
What is contruct validity trying to do?
provide evidence that the test measures the construct that it is supposed to measure
Construct validity may entail 5 different things, what are they?
- Internal consistency
- Group differences
- Research on the construct
- Convergent and discriminant validity
- Factorial validity
What is internal consistency?
Are all measures measuring the same thing
How can group differences be used to determine construct validity
Do scores distinguish between people with different levels of the construct?
How can research on the construct be used to determine construct validity?
Do test scores change, following construct manipulation as predicted by the theory
How can convergent and discriminant validity be used to determine construct validity?
You should find a high CC with measures of the same trait and a low CC with measures of unrelated traits
How can factorial validity be used to determine construct validity?
Does the test have the predicted factorial composition
What is the multitrait-multimethod matrix?
An approach to examining construct validity. It organizes convergent and discriminant validity evidence for comparison of how a measure relates to other measures.
monotrait-heteromethod coefficients
Indicate the correlation between different measures of the same trait
(when large provide evidence of convergent validity
heterotrait-monomethod coefficients
show the correlation between different traits measured by the same measure
(When small this indicates that the test has discriminant validity)
heterotrait-heteomethod coefficients
correlation between different traits measured by different measures
(provide evidence of discriminant validity when small)
What is factor analysis used for?
Evaluating construct validity
Criterion-related validity
Measures how well one measure predicts an outcome for another measure
Assessed by correlating the scores of a sample of individuals on the predictor with their status on the criterion
When the criterion-related validity coefficient is large…
this confirms that the predictor has criterion-related validity
What are the two types of criterion-related validity?
- Predictive validity
2. Concurrent validity
Predictive validity is a subtype of criterion-related validity - describe it.
It is testing to predict future performance on the criterion
Concurrent validity is a subtype of criterion-related validity - describe it.
Estimates the current status on the criterion
What kinds of validity are associated with incremental validity?
concurrent and predictive
formula for incremental validity?
incremental validity = positive hit rate - base rate
formula for positive hit rate?
true positives/total positives