Test Construction Flashcards
A test’s reliability or true score variability can not be measured directly but must be estimated. In order to estimate a test’s reliability what must be assessed?
consistency of scores over time, across different content samples, and across different scorers and is based on the assumption that variability that is consistent is True Score Variability while variability that is inconsistent reflects measurement (random) error.
Most methods that measure Reliability produce a Reliability Coefficient which is a correlational Coefficient that ranges in value from? when this is 0 it means that scores are due to?
When test retest reliability is 1, this means that? and the rxx-test measured against it’s self.
if a reliability coefficient is .84 that means that % of variability in scores is due to true score differences while % is due to measurement error?
0.0 to 1.0 (0- measurement error and 1= true score difference)
Measurement Error
Variability in scores reflect true score variability:
Reliability coefficient = .84 84% true scores .16 or 16%; (.84-1.0) is measurement error
What are the methods for estimating reliability and how does it work? and what does it indicate? What is measurement error due to?
Test retest: administer the test to the same group on two different occasions and corelate the scores- indicates the degree of stability (consistency) of examinee’s scores over time–Coefficient of Stability. Measurement error is due to any factors that occur over time: random fluctuations in examinee’s over time and variation in testing situations.
- Alternate (Equivalent, Parallel) Forms Reliability: two equivalent forms of the test are administered to the same group and two sets of scores are correlated- consistency of responding to two test forms (different item samples) when administered at different times- consistency of responding over time. Administered at the Same Time- Coefficient of Equivalence (equal). when over time- Coefficient of Equivalence and Stability (test retest). Measurement Error is due to content sampling interaction between scorer’s knowledge and different content assessed by the item. - best reliability but difficult to create equivalent forms
- -Internal Consistency Reliability—Split Half Reliability and Coefficient Alpha two methods— Both involve administering the test once to a single group and both yield a reliability coefficient named the Coefficient of Internal Reliability
- -Split Half-Spearman Brown Prophecy Formula
- -Cronbach’s Coefficient Alpha- formula- determines the average degree of inter-item consistency (obtained from all possible spits–not so good, if scored dichotomously (right or wrong answers)- Kuder-Richardson Formula (KR-20)
- -error is due to content sampling split (examinee’s knowledge better matches one half and not the other) Coefficient Alpha-content error-heterogeneity of the content domain- good for when test measure a single characteristic–can not use for Speeded Tests—
- –Inter Rater Reliability (Inter-Scorer, Inter Observer) –depend on rater’s judgement–Correlation Coefficient and Percent Agreement- Corr. Coefficient- Kappa Statistic (Kappa Coefficient, Cohen’s Kappa)–Nominal or Ordinal Scale– and Coefficient of Concordance (Kendell’s Coeff. Concord.) (used when three or more raters)– Percent Agreement: dividing total number of items agreed upon by raters with total scores. Agreement might be due to chance alone. Error is due factors related to raters: consensual observer drifting.
What factors affect the reliability Coefficient?
Test Length_ longer the test the larger the correlation coefficient_ In addition to Split half Reliability-Spearman Brown can be used to determine best length
Range of Test Scores: maximized when examinees are heterogeneous and item difficulty.
Guessing
When a predictor’s reliability coefficient is .75, it’d criterion-related validity can be:
No Greater then the square root of .75
rxy< square root of rxx
A screening test has a low base rate with an overall accuracy rate of 98%. When using this test to identify people in the general population it is good to keep in mind that?
Base rate are people selected without the predictor (test) and are successful on the criterion (behavior). Positive hit rate are those people who were selected based on the predictor scores and are successful on the criterion.
Ex. A Psychologist wants to use a short screening tool to substitute a lengthier one and wants to determine adequate incremental validity
predictor determines if a person is a positive or negative (on the screening tool) and the criterion determines if he/she is a “true” or “false” (lengthier tool) - use scatterplot
–When the predictor has a high accuracy rate 98% and the base rate is less then 50% then a Large number of False Positives then False Negatives will always be the case.
What are the different methods for establishing validity?
Content Validity: examinee’s familiarity with a content domain or behavior that it measures (ex: achievement tests) established by experts in the field:
Construct Validity: hypothetical trait
Criterion related validity ( to predict an examinee’s performance on an external criterion
What are some ways that you can test construct validity?
Convergent and discriminant validity: correlate test scores with scores on other measures that do and do not assess the same trait. convergent-relate discriminant- unrelate—
Factor analysis- another way to obtain information on a tests convergent and discriminant validity
What is a Multirait-Multimethod Matrix? and what does it assess?
a table of correlations that provides information about the degree of association between two or more traits that have be assessed using two or more methods.
-tests a test’s convergent and discriminant validity
A supervisor measures two trait that are unrelated assertiveness and aggression using two different measures, a test and supervisor ratings. She calculates the scores and put them in a ____.
What are the four types of correlation coefficients that will be created and what do they mean?
Multitrait-multimethod matrix
Monotrait-monomethod- is a reliability coefficient and measures a measure with it’s self. Not a measure of convergent/discriminat but needs to be high inorder to use the matrix.
Monotrait-heteromethod-large correlation- convergent
Heterotrait- monomethod coefficient- if small then discriminant validity.
Heterotrait-heteromethod- small-discriminant validity
To ensure a work sample has adequate content validity you would :
You would make sure that skills (behavior) required by the work sample represents skill domain required by the job
What do you need to construct a confidence interval around a obtained test score?
the examinee’s test score and the standard error of the measurement ( which is calculated from the test’s standard deviation)
What is the difference between the standard error of measurement, the standard error of the estimate and standard deviation?
Standard error of the measurement: to construct a confidence interval you add and subtract on standard error of the measurement to and from the examinee’s obtained score- A measurement (((obtained Score)))
Standard error of a estimate is to construct a confidence interval around an estimated ((predicted score))) score. The 68% confidence interval is constructed by adding/subtracting 1 standard error from the predicted criterion score, 95% 2, 99.9% adding or subtracting 3 standard error
What is internal and external validity? Is this used in the validity of tests?
Internal and external validity refer to the validity of research studies, not tests. A research study has adequate internal validity when its results allow a researcher to conclude that there is cause-effect relationship between the independent and dependent variables. A research study has adequate external validity when it allows the researcher to generalize conclusions about the cause-effect relationship to other people and conditions
Test of Statistical Significance assess ?
How likely the difference between groups are due to sampling error
What are the steps for a factor analysis?
tests construct validity plus
administer several tests
use factor analytic techniques to convert data into a correlation matrix,
Simplify the interpretation and naming of the factors by rotation
–there are two type of rotations- orthogonal 90 degree angel
Oblique not 90 degree angle
interpret and name factors
When a test has high sensitivity it means that there is
What is specificity?
When the sensitivity is high, this means that most of the people with the disorder will be identified as having the disorder by the test (i.e., there will be few false negatives) but that there will be some people without the disorder who will also be identified as having the disorder (i.e., there will be some false positives).
Criterion-related validity used when?
What is the predictor and what is the criterion?
What are the two forms of criterion related validity?
test scores are to be used to draw conclusions about an examinee’s likely status on another measure–to insure that an employee selection test can actually predict how well an applicant will do on a measure of job performance after she is hired.
Test-predictor and Other Measure is the criterion
–Concurrent-criterion data is collected at the same time or prior to the predictor. (estimate current status)
–Predictive validity- criterion is measure some time After. (predict future occurrence)
What is shared variability and when can you square a correlation coefficient?
Shared variability represents a correlation between two tests that is squared
square the criterion related validity coefficient.
When an exam question gives you the correlations coefficient for variable x and variable y and asks how much variability in variable Y is __explained or accounted for___ by variable Y, you will do what to answer the question correctly?
Square the correlation coefficient
A psychologist conducts a content validity study by administering assertiveness test (predictor) to 100 salespersons and determined their average monthly sales over 3 m. (criterion). She correlated their test scores with sales and obtained a validity coefficient of .60. This means that __ variability in sales is accounted for by differences in a assertiveness. While the remaining __ is due to other factors
36% (.6 X .6) square 6 = 36= 36%
64% (1.00 - .36) 100-36= 64
Convergent and discriminant (divergent) is associated with ___ and concurrent and predictive is associated with___ And concurrent and predictive can also assess a predictors ___ which is ____ and is evaluated by looking at the data on a _____
construct validity
Criterion related validity
incremental validity - correct scores that can be expected if the predictor is used as a decision making tool
Scatter plot
What are these?
Communality
Factor loading
principal component
eigenvalue
Communality Each test included in a factor analysis has a communality, which indicates the total amount of variability in test scores that has been explained by the factor analysis - i.e., by all of the identified factors.
The factor loading is the correlation between a single test and an identified factor.
In principal components analysis (which is similar to factor analysis), the principal component is equivalent to a factor.
In principal components analysis, the eigenvalue indicates the total amount of variability accounted for by each component (factor).
Criterion-related validity refers to the relationship between _______ and a __________measure.
test scores; criterion
______________ is a type of construct validity. Construct validity refers to the degree to which a test measures a theoretical construct. The establishment of what type of validity may include correlating scores on the test with scores on another test that does not purport to measure the same or a related construct. If the correlations are low, evidence that the test measures the construct it purports to measure (and not the construct the other test purports to measure) is provided. (This is referred to as?
Discriminant validity
Content validity refers to the degree to which a test samples the content domain it purports to sample. For example, an algebra test with many questions about calculus has low content validity. Content validity, a concern with most academic tests, as well as with the Psychology Licensing Exam, is primarily subjectively determined by who? in the given content domain.
What is construct validity?
experts
Construct validity refers to the degree to which a test measures a theoretical construct.
An item characteristic curve (ICC) indicates:
the relationship between the likelihood that an examinee will endorse the item and the examinee’s level on the attribute measured by the test.
As its name suggests, an item characteristic curve (ICC) provides information about an item’s characteristics. It is part of the Item Response Theory and is constructed for each item. The curve provides information on the relationship between an Examinee’s Level on that ability or trait and the Probability that she will respond correctly to that item.
The internal consistency of test items can be assessed with? what are the possible errors? when is it useful?
Kappa Statistics is used to measure what?
Alternative forms reliability provides information on what? and is used when ___________
What provides information of the consistency of a test over time?
Split Half reliability (Spearman Brown), Cronbach’s coefficient Alpha (conservative/lower level) Kuder-Richardson Formula 20 (when test items are scored dichotomously). Error results in content sampling error (difference content between two halves of the test and Coeff. difference between individual test items– heterogeneity of test items). Useful when Measuring 1 single characteristic.
inter-rater reliability: scores or ratings represent an nominal or ordinal scale of measurement.
the equivalence of items contained in two different forms of the test.
reliability coefficient