Measurement Flashcards
Classical Test Theory/True Score Theory
Classical test theory (CTT), or true score theory, was developed by Spearman in 1904. Typically, researchers would expect that the variation in one’s score (e.g., health anxiety) actually reflects their score on the respective measure rather than be composed of measurement error. Classical Test Theory enables researchers to examine reliability by estimating the variation of error alongside an individual’s true score. This formula is described as observed score = true score + error. Observed scores are measured from a sample of participants, true scores are the realistic scores/amounts for that individual or as the average score for and individual if they completed it infinitely, and measurement error is noise contributing to unreliability. True scores are not directly observable on measurements. CTT also assumes that this measurement error is random and normally distributed, that true scores and error scores are uncorrelated, and that one’s test scores on separate items are uncorrelated.
Content validity
Content validity refers to the degree in which the item contents actually reflect the construct of interest. In order to strengthen content validity, the construct should be well-defined and items generated based on theory and expert judgement, then items statistically examined (e.g., factor analysis) to confirm item content and overlap. Content validity is different from face validity in which the test and construct being measured is discernable to the layperson so they understand what is being measured and feel motivated to complete it.
Criterion validity
Criterion validity speaks to the degree to which one’s measurement is related to similar criterion variables (e.g., is health anxiety related to anxiety sensitivity?) and is split between both concurrent and predictive validities. Concurrent examines the measure’s score’s relationship to another measure’s score’s taken at the same time while predictive examines whether one’s measurement scores can predict other criterion variables measured at a later time.
Construct validity
Construct validity speaks to whether the measure being examined actually reflects the psychological construct it aims to measure and is made up of both convergent and discriminant validities. Convergent examines whether the test scores actually correlated with other measure’s test scores (e.g., resilience and grit) while discriminant examines whether the test scores do not correlate with measures it should not theoretically be related to (e.g., resilience and coffee drinking habits). These can be examined using a multi-trait multi-method matrix where correlations with other measures and methods are calculated. Construct validity is also comprised of the test content, its internal structure, its association with other test scores, the psychological processes responsible for test responses, and the test’s consequences.
Internal Consistency (Coefficient of Stability) vs. Stability in Reliability
The coefficient of stability in reliability is calculated as the correlation coefficient between two scores among the same group of individuals who took the same test at two separate times (i.e., test-retest reliability). This coefficient might be impacted by practice effects among the sample, instability of the construct across time, or administrative and participant errors. For instance, a group of individuals taking a depression measure might have expectations on how to answer on the second trial, or their experiences of depression the next week might have shifted. Another way of estimating reliability coefficient is with alternate forms, where one form of the test is given, with a break period, and a second form of the test is given afterwards. Crocker and Algina (2008) state there is no perfectly valid standard for an acceptable coefficient of stability.
On the other hand, internal consistency examines how participants perform across items on a single test administration. IC speaks to the degree of homogeneity of items and to the degree the items measure the same construct. IC can be calculated as a coefficient/Cronbach’s alpha value based on item correlations or split-halves scoring. Researchers might use a domain of items on a single test to explore the consistency with which participants answered. Errors that might arise could be due to content validity issues (e.g., depression items do not accurately capture diagnostic accuracy with the disorder) or measurement issues (e.g., faulty administration or scoring, participant guessing, fluctuations in performance). IC can also increase the greater number of test items are included.
Multi-Trait- Multi-Method Matrix
This method developed by Campbell and Fiske (1959) is utilized to examine the measurement’s construct validity adequacy using the same sample of participants. The Matrix is especially helpful to examine the convergent and discriminant validity of a measure and the method variance. The Matrix contains correlation coefficients between the measure of interest and measures of similar and dissimilar constructs. Reliability coefficients are reported between the same measurement and the same construct, and should be high, Convergent validity coefficients are reported between the same construct but differing measures, and should be high. Discriminant validity coefficients are reported between differing constructs using the same measurements (heterotrait-monomethod) or differing constructs using different measurements (heterotrait-heteromethod); both should be lower than the prior two coefficients.
Percentiles/Percentile Rank
A percentile is defined as the point on a scale at or below which a specified percentage of scores fall, typically displayed as values that divide the distribution into hundredths. Percentile rank divides data into distribution of 100 equal intervals and ranges from 1st-99th percentile. For example, an individual taking an anxiety measure might score on the 75th percentile which is the point on the distribution below which 75% of the scores lie. Percentile ranks are a type of standardization score and correspond to the scores of examinees in the norm group. They are nonlinearly transformed raw scores; this is why a gain in 1 point may correspond to different magnitudes on the percentile rank. Weakness: percentile rank scores are less stable and reliable in the center part of the distribution, and percentile ranks cannot be meaningfully interpreted for examinees at different points in the distribution (i.e., raw score increase of 5 points could mean different percentiles for different individuals).
Validity vs Reliability (Types)
Reliability is the consistency or reproducibility of tests scores, the degree to which individuals’ deviation scores remain relatively consistent over (1) repeated administration or (2) with alternate forms. Reliability is a large concern for test creatures and utilizing measures because some degree of unreliability is reasonably expected in most situations. There are many ways of assessing reliability including test-retest (producing the same score over multiple administrations), parallel forms (obtaining similar scores by administering two forms that are parallel in content), interrater (degree to which different raters agree in their assessment decision), and internal consistency using single form testing (the degree to which items relate to the same construct). While reliability is important to establish to ensure examinee scores are consistent, it does not ensure that the information being collected is accurate.
Validity refers to the content of a measure and whether the measure assesses the domain of interest. Validation refers to the collection of evidence to support types of inferences that are to be drawn from test scores. There are 3 general types of validity: (1) content validity—making inferences to a larger domain of similar items (2) criterion validity—making concurrent or predictive inferences from a test to some other behavioral variable (3) construct validity—does the measure assess the construct it intends to measure by comparing it with other measures of similar and dissimilar concepts. Validity is important so that the use of a test is appropriate given the content and the intention behind measurement, as instances of poor validity might lead to faulty conclusions (e.g., creativity actually measures drawing ability and not creative ideas).
Scales of Measurement
Nominal, ordinal, interval, ratio (NOIR). Nominal: labels and assigns things to categories like gender. It has no meaningful order or equal distance between units or fixed origin. Ordinal: numbers are ordered using values of the real-number system, but they lack the properties of equal distance between units and a fixed origin (e.g. military rank). Interval: scale includes rank order, and the distance between the numbers have meaning, but the point of origin is still arbitrarily chosen and does not represent total absence of the property being measured (e.g. Celsius temperature). Ratio: scale orders the properties, has equal distance between units, and a fixed origin or absolute zero point allowing non-zero measurements to be expressed as ratios of one another. For example, the ratio scale allows us to know that half of 4ft is 2ft, or height and weight as examples.
Spearman Brown prophecy formula
The Spearman Brown Prophecy is a formula method that helps estimate the reliability of parallel tests when the reliability of one is known, assuming that all component tests are parallel in content and difficulty. For instance, it might be used to predict the reliability of a test whose length has changed. This formula can be used to correct the split-half tests method, which tends to underestimate the reliability coefficient given that shorter tests are less reliable. For instance, if the correlation between two half-tests of three items each is .34, the corrected reliability estimate for six items would be .51. This overall process is usually used in measurements of internal consistency, in which one test might be given in one administration and split into two halves for scoring.
Test Norms
Normative scores that provide information about an examinee’s performance compared to the score distribution of some reference group. Raw scores alone make useful interpretations difficult, therefore norms allow for enhancing of test score interpretation. Scores can be compared across people taking the exam (peer norm group) or to the test score distribution of a sample representing a well-defined population. Test norms are meaningful based on two characteristics: (1) the extent to which the test user is interested in comparing the examinee to the normative population and (2) the adequacy of the norming sample in representing that population. For example, if an examinee does not characteristically fit within the norming sample, the norms are likely not a good representation of the population from which the examinee came from. Similarly, if the norming sample is not adequately large or diverse, then its usefulness will be narrower. The normative sample should be described in sufficient detail with respect to demographic characteristics to permit a test user to assess whether it is meaningful to compare an examinee’s performance to their norm group’s.
Steps of conducting a norming study: (1) ID population of interest (2) ID the most critical statistics that will be computed for the sample data (3) Decide on the tolerable amount of sampling error (4) Devise a procedure for drawing a sample from the population of interest (5) Estimate minimum sample size required to hold the sampling error within limits (6) Draw the sample and collect the data (7) Compute the values of the group statistics of interest (8) ID the types of normative scores that will be needed (9) Prepare written documentation of the norming procedure.
Standardized Scores for Reporting Test Results/Types of Normative Scores
According to Crocker and Algina (2008), there are a few different types of normative scores. Normative scores are utilized for the interpretation of raw scores given its relative location or frequency within the distribution of total scores. In other words, standard scores involve the transformation of data which allows for comparison with other standardized data sets.
First, percentile rank allows a researcher to interpret an examinee’s raw score depending on the percentage of scores within the norm group (e.g., 31st percentile = 31% of participants scored below examinee).
Second, normalized z-scores are transformed raw scores to see fixed amounts of examinee responses below or above a certain point; too, they utilize an approximately normal distribution curve with a mean of 0 and an SD of 1. However, it is common for some tests to transform z-scores; for instance, Weschler tests utilize standard scores with a mean of 100 and SD of 15, and T-scores utilize a mean of 50 and an SD of 10.
Third, stanines are used to broadly depict examinee test scores between 1-9, with a mean of 5 and SD of 2, making it easier for individuals to interpret one’s scores.
Fourth, scaled scores are converted scores that utilize normative comparisons of scores between groups but places the distributions on a single continuum; in other words, this allows for the interpretation of an examinee’s scores in the educational area (e.g., reading) beyond just comparing to their grade group. For example, an examinee who tests at the 50th percentile in reading in second grade would have a scaled score of 571, and if they maintain the 50th percentile in the fourth grade, their scaled score would be 674.
Fifth, grade and age equivalent scores are normative scores that indicate how an examinee’s performance compares with others. For example, a third grader’s GE score of 4.0 indicates that they scored how a typical fourth grader would be expected to perform that same third-grade content.
Generalizability Theory
Crocker and Algina (2008) describe generalizability theory as underlining the process to examine study designs, evaluate reliability, and estimate measurement error, particularly to carry out decision studies (D-studies). Measurement conditions are considered facets (e.g., studying teen’s math abilities includes facets of teacher and math score); this theory allows a researcher to consider all variables’ variance by examining these facets (e.g., teacher’s mood, testing environment). This theory allows a researcher to examine whether a particular set of measurements of an examinee would generalize to a more extensive set of measurements of that examinee; in other words, to minimize and predict error. Additionally, this hypothetical circumstance of measuring an examinee under all conditions and averaging them produces their universe score (concept of universe of generalization). A generalizability coefficient can then be produced as a ratio of universe to observed score variance.
Standard error of measurement (SEM)
The SEM refers to the standard deviation of error measurement in observed scores among a group on a test. SEM should help provide an estimate of how far someone’s true score may lie from an observed score. For example, if the reliability of a test is 1.00+, indicating no errors of measurement, the SEM would be 0. Because an examinee’s true score is not realistically known, confidence interval bands around the observed test score, which has a known probability of containing the examinee’s true score, are created using SEM. A concern with SEM is that assuming that the standard errors are equal for all examinees, which is faulty.
Psychological Construct
A psychological construct is an idea/variable/concept that is not directly observable and abstract (e.g., creativity). It is possible a psychological construct is a label for observed or groups of behaviors; it should be operationally defined and agreed upon. By doing this, psychological researchers can begin exploring the construct’s association with other variables, observed behaviors, and theories. There are five primary issues with the construction of psychological constructs: (1) No single way of defining a construct (no universal acceptance); (2) Psychological measurements are usually based on limited samples of behavior; (3) Measurement obtained is always subject to error; (4) Units of measurement are not well-defined; and (5) Measurements must have demonstrated relationships to other variables to have meaning (must be based on an observable behavior and be related to other similar constructs).