Psychometrics Flashcards
Zickar & Broadfoot (2009)
“CTT is treated as antiquated while IRT will abolish it, but we believe this is an urban legend.”
CTT uses terms such as true score or reliability.
A true score is equivalent to the expected value of the observed score for an individual on a particular test. So true scores are defined both by the person & the scale. So, an individual would have a diff true score for each IQ test.
The false notion of a true score that exists independent of a test has been called the platonic notion of a CCT true score.
The expected value of the true score (rather than a platonic type) might narrow generalizability but avoid ontological difficulties.
Reliability, denoted rxx is the proportion of observed score variance that is due to true score variance. This is true score variance divided by observed (total) variance. Tests of reliability include:
Test-retest reliability, split half, alternate forms, and internal consistency methods (Cronbach’s alpha).
Reliability provides a measure of precision for a test in CTT.
The standard error of measurement is a function of reliability.
Generally, when reliabilities increase, standard errors decrease, meaning greater confidence about precision of indy test scores.
Though CTT focuses on scale score as the unit of analysis, there are several statistics also used to assess item functioning. For example:
Item difficulty - is the proportion of test takers who affirm the item (correctly answer) with ability items or agree with the item for personality items)
Item discrimination describes how well an item does at differentiating between test takers w/ diff levels of the trait measured by the scale.
Item response Theory
Focuses on measuring a latent construct believed to underlie the responses to a given test. Latent trait is symbolized by theta. IRT has 2 primary assumptions:
*1.) The test is unidimensional, meaning it only measures one latent trait and (2.) that local independence exists, meaning that items in a scale are solely correlated bc of theta. If theta were partialled out, no correlation.
A cornerstone of IRT is the item response function, which relates theta to the expected probability of affirming an item. The IRF is determined by item parameters, which consist of:
Item difficulty - the location of the theta continuum where the item is most discriminatory (items with high difficulty will only be endorsed by respondents with large +thetas
IRT item discrimination has the same goal as under CTT (shown by the slope of the IRF at its infliction point, which is equal or near item difficulty.
Pseudo-guessing parameter relates to the probability that an indy with an extremely low theta will answer an item correctly (e.g., extremely low-theta people will be able to correctly guess an item with a P that is 1/number of options.
Information is a quantification of the amount of uncertainty removed by considering item responses. So, Items with a difficulty parameter closer to the px’s theta level, and strong discrimination, will provide much more info.
Opposite to the property of CCT true scores, theta estimates on 2 different tests measuring the same construct should be equivalent within sampling error. (bc remember with CCT, the true score may depend on the test, as diff tests will yield their own true score for theta.
*A serious limitation of CTT is that stats & parameters are sample dependent AND test dependent. True scores (person parameters) are test dependent and item difficulty and item discrimination (item parameters) are sample dependent.
It’s therefore not as easy to compare true scores across tests.
Also, in CTT, item statistics (Zickar & Broadfoot, 2009) are dependent on the sample used to estimate those statistics.
Downsides to IRT: to measure parameters with equal precision as CTT, you need larger sample sizes. This is because IRT models are more complex and have more models.
AND IRT requires a strong assumption of unidimensionality, and has difficulty running programs.
Re: the assumption of unidimensionality, it assumes local independence, which means that if you control for theta, there should be no correlation between items (e.g., all correlation in items is only due to theta). - This is often not the case (e.g., big 5 traits may correlate with each other)
Ironically, with some (but not too much) multidimensional data, IRT my be better, as deviant items can be weighted less when computing traits scores - but with CTT, items are usually weighted the same.
CTT more easily supports other statistical methods such as EFA, CFA, FA and SEM, which are based on the CTT measurement foundation (T = X + E)
But no doubt about it, IRT is better if you want to concentrate measurement precision at a certain range of the latent trait.
Applications such as differential item functioning (DIF) appropriateness of measurement, and CAT are all advanced psychometric tools that require IRT.
With appropriateness measurement, researchers are trying to identify respondents responding to items in an idiosyncratic way that sets them apart from other respondents. IRT looks for people who deviate from the iRT model, and IRT still does better than CTT at this.
IRT advances the psychometric quality of our instruments, particularly through allowing testing of more specific hypotheses, being theory-based, and facilitate advanced psychometric tools
But requires unidimensionality & large samples due to more parameters.
Melchers et al. (2020)
Review of applicant faking in selection interviews
Researching applicant faking in employment interviews is new, mostly emerging in the 2010s
This is surprising given the vast majority of applicants use faking in some degree in interviews.
Faking good is more common, but faking bad happens in rare cases such as attempting to receive further unemployment benefits or avoiding compulsory military service.
Socially Desirable Responding (SDR) is usually described as comprising two facets: self-deception & impression management management for others.
SDR is more of a trait while faking is more of state.
It is possible that applicants also use nonverbal behaviors deceptively (e.g., fake smile or laugh), and future research could explore this form of faking (Melchers et al., 2020).
Future research should look for whether context factors like competitive industry or rough economy increase likelihood of faking.
Criterion validity can be influenced in different directions, and data often appears to become homogenized as people fake to appear as desirable.
Faking is less common among more qualified applicants than among less qualified applicants.
Practical applications:
Increasing the structure of the interview reduces applicant faking and its effects on interview performance ratings.
*Training based on content-based lie detection strategies is a viable strategy to help interviewers deal with faking.
DO NOT: Use warning that faking can be detected (unless ready to deceive applicants about the actual possibility to detect faking)
DO NOT: Rely on interviewers’ intuition, experience, or abilities (e.g., emotional intelligence) to try to detect when/how applicants fake.
*DO NOT: Rely on non-verbal behaviors to identify whether applicants are honest or faking
DO:
When trying to assess the veracity of applicants’ responses, train interviewers to focus on content and rely on a combination of indicators
****(e.g., level of details, plausibility).
DO: Increase the degree of interview structure (e.g.,
use standardized and job-related questions).
Applicants usually fake to compensate for a lack of qualification or fit, and thus faking should negatively impact criterion-related validity
Conscientiousness and Agreeableness are usually negatively related to faking, whereas Extraversion and Neuroticism are usually positively related to faking. In addition to the Big Five, the Dark Triad, which is comprised of Psychopathy, Narcissism, and Machiavellianism (Paulhus & Williams, 2002), is linked to faking.
Levashina & Campion (2006)
There is just one model that is specifically tailored to faking in interviews (Levashina & Campion, 2006). In this model, Levashina and Campion (2006) consider faking as a function of capacity, willingness, and opportunity. Levashina and Campion’s model is a multiplicative model, so that all factors must be present at least to some extent for faking to occur.
Kuncel (2001)
MA of GRE score predictive validity
The verbal, quant & analytic and subject tests had validity generalization (e.g., predicted performance across fields) for criterion performance metrics including GPA, 1st year GPA, faculty ratings, comprehensive exam scores, citation counts, and degree attainment.
Not a lot of evidence of moderation of the effect.
GRE general score & Undergrad GPA had very similar validities.
Subject GRE scores were better than general GRE scores at predicting. - It may be that motivation for a given field makes it such that subject tests are better. Or they’ve had a head start on the content compared to classmates - so future research should answer this question.
WHY THE SIOP principle summary says items shouldn’t be redundant: for other predictors to provide incremental validity, they must BOTH be correlated with the criterion and typically weakly related, or ideally uncorrelated, with other predictors used in the selection system.
Murphy (2009)
Content Validation Is Useful for Many Things, but Validity Isn’t One of Them
Content-oriented validation strategies establish the validity of selection tests as predictors of performance by comparing the content of the tests with the content of the job. These comparisons turn out to have little if any bearing on the predictive validity of selection tests.
***There is little empirical support for the hypothesis that the match between job content and test content influences validity, and there are often structural factors in selection (e.g., positive correlations among selection tests) that strongly limit the possible influence of test content on validity.
Comparisons between test content and job content have important implications for the acceptability of testing, the defensibility of tests in legal proceedings, and the transparency of test development and validation, but these comparisons have little if any bearing on validity.
The legislative and judicial history of the current Civil Rights Act suggest that job relatedness and validity are not necessarily synonymous; tests that are valid predictors of performance (e.g., Ravens’ Progressive Matrices) might not be judged to be job related because of the lack of any manifest relationship between the test content and the content of a job.
Numerous factors, ranging from poor item writing and range restriction to differences in respondents’ understandings of and reaction to response formats, might lead to situations in which scores on tests that seem job related turn out to provide little help in making valid selection decisions (Murphy, 2009).
When the set of tests that is considered as possible predictors of performance are positively correlated with one another and with the criterion (i.e., they show positive manifold, which is OFTEN the case), content-oriented assessments of validity have very little to do with the question of whether or not test scores predict job performance.
*The effects of positive manifold are mostly pronounced when the correlations among tests are large (e.g., .50 or above), but they are substantial even when the correlations among tests are considerably smaller.
A third approach is one in which subject matter experts are asked to make judgments about overlaps in the knowledge, skills, and abilities (KSAs) required to do well on a test and those required to do well on a job.
Suppose a consulting firm develops selection test batteries for entry-level machine operator jobs in one organization and for data entry clerks in another. They use careful job analyses to develop reliable measures of knowledge, skills, and abilities that closely match the content of each job. As a result of a mix-up in the mailroom, the test batteries are sent to the wrong organizations. Murphy (2009) notes that according to the past literature on KSAOs, they’d probably predict roughly the same at each organization anyway.
WOW: Both the Uniform Guidelines and case law suggest that job relatedness represents an adequate justification for a set of tests, regardless of the empirical relationship between test scores and measures of job performance and effectiveness.
A test battery that is both job related and valid as a predictor of job performance is best, but if forced to make a choice between job relatedness and validity, most stakeholders are likely to prefer tests that seem logically related to the job over alternatives that might show equivalent or even higher validity but no apparent job relatedness.
Griggs & Civil Rights Act of 1991
A selection procedure is job related if there is a manifest relationship between the test and the job, based on a structured analysis of the tasks, duties, and responsibilities of the job.
Thus, according to Murphy (2009), There is little doubt that content-oriented methods of validating tests are useful for establishing the job relatedness of selection tests.
The consistent pattern of positive correlations among ability tests and criteria (positive manifold) means that the choice of which tests to use to predict performance in which jobs will not usually have a substantial impact on the validity of a test battery.
According to Murphy (2009), the three key challenges to content-oriented validation strategies are:
A - Given its widespread use, there is surprisingly little evidence showing that content validation actually works (at predicting perf)
B - The use of composites designed to maximize validity within a specific job family did not, in general, lead to higher validities in those families than the validities achieved using more general composites (Peterson et al., 2001).
C. The structure of selection tests limits the potential relevance of content matching. Murphy et al. (2009) asked why comparisons between the content of tests and the content of jobs turn out to have so little bearing on the validity of selection tests.
According to Murphy (2009), the 3 benefits to content validity are:
- Literature suggests that employment tests that are seen as relevant to the job are much more likely to be acceptable to applicants and are less likely to be challenged or to cause applicants to develop negative views of organizations. Also more acceptable to the org, stakeholders (e.g., unions, political groups)
- Litigation defense
- Tie-breaker when different applicant test batteries have equal validity but differences in content validity
Hinkin (1998)
A brief tutorial on the development of measures for use in survey questionnaires
In an extensive review of the organizational behavior literature, Hinkin (1995) found that inappropriate domain sampling, poor factor structure, low internal consistency reliability and poor reporting of newly developed measures continue to threaten our understanding of organizational phenomena.
There are three major aspects of construct validation: (a) specifying the domain of the construct, (b) empirically determining the extent to which items measure that domain, and (c) examining the extent to which the measure produces results that are predictable from theoretical hypotheses (Nunnally, 1978).
Domain sampling theory states that it is not possible to measure the complete domain of interest, but that it is important that the sample of items drawn from potential items adequately represents the construct under examination (Hinkin, 1998).
The deductive approach, sometimes called logical partitioning or classification from above. The second method is inductive, known also as grouping, or classification from below (Hunt, 1991). Deductive requires a deep review of literature to develop the theoretical definition. The definition is then used as a guide for the development of items (Schwab, 1980).
The inductive approach is done by asking a sample of respondents to provide descriptions of their feelings about their organizations or to describe some aspect of behavior. An example might be, “Describe how your manager communicates with you.” Responses are then classified into a number of categories by content analysis based on key words or themes
There are a number of guidelines that one should follow in writing items. Statements should be simple and as short as possible, and the language used should be familiar to target respondents.
It is also important to keep all items consistent in terms of perspective, being sure not to mix items that assess behaviors with items that assess affective responses.
A very common question in scale construction is, “How many items?” There are no hard-and-fast rules guiding this decision, but keeping a measure short is an effective means of minimizing response biases caused by boredom or fatigue.
Adequate internal consistency reliabilities can be obtained with as few as three items (Cook et al., 1981),
3 items per factor are usually adequate, maybe 4-6 items per construct, depending on factors within and other contingencies (Hinkin, 1998)
With respect to scaling the items, it is important that the scale used generate sufficient variance among respondents for subsequent statistical analyses.
Likert (1932) developed the scales to be composed of five equal appearing intervals with a neutral midpoint, such as strongly disagree, disagree, neither agree nor disagree, agree, strongly agree.
Coefficient alpha reliability with Likert scales has been shown to increase up to the use of five points, but then it levels off (Lissitz &Gren, 1975). And If the scale is to be assessing frequency in the use of a behavior, it is very important that the researcher accurately benchmark the response range to maximize the obtained variance on a measure.
The new items should be administered along with other established measures to examine the “nomological network”— the relationship between existing measures and the newly developed scales.
Prior to conducting the factor analysis, the researcher may find it useful to examine the interitem correlations of the items and any variable that correlates at less than .4 with all other variables may be deleted from the analysis (Kim & Mueller, 1978).
The American Psychological Association (APA, 1995)
states that an appropriate operational definition of the construct a measure purports to represent should include a demonstration of content validity, criterion-related validity, and internal consistency. Together, these provide evidence of construct validity—the extent to which the scale measures what it is purported to measure.
The researcher should have a strong theoretical justification for determining the number of factors to be retained.
Kerlinger (1986)
Construct validity forms the link between theory and psychometric measurement.
Schmitt & Klimoski, 1991).
construct validation is essential for the development of quality measures.
Cohen (1969)
it is important to note the difference between statistical and practical significance.
Cortina (1993)
found that alpha is very sensitive to the number of items in a measure, and that alpha can be high in spite of low item intercorrelations and multidimensionality. This suggests that .70 should serve as an absolute minimum for newly developed measures, and that through appropriate use of factor analysis, the internal consistency reliability should be considerably higher than .70
Hinkin (1998) - Chi Squaree
The chi-square statistic permits the assessment of fit of a specific model as well as the comparison between two models. The smaller the chi-square the better the fit of the model.