Psychometrics Flashcards by Modern-Day Debate

Zickar & Broadfoot (2009)

“CTT is treated as antiquated while IRT will abolish it, but we believe this is an urban legend.”

CTT uses terms such as true score or reliability.

A true score is equivalent to the expected value of the observed score for an individual on a particular test. So true scores are defined both by the person & the scale. So, an individual would have a diff true score for each IQ test.

The false notion of a true score that exists independent of a test has been called the platonic notion of a CCT true score.

The expected value of the true score (rather than a platonic type) might narrow generalizability but avoid ontological difficulties.

Reliability, denoted rxx is the proportion of observed score variance that is due to true score variance. This is true score variance divided by observed (total) variance. Tests of reliability include:

Test-retest reliability, split half, alternate forms, and internal consistency methods (Cronbach’s alpha).

Reliability provides a measure of precision for a test in CTT.

The standard error of measurement is a function of reliability.

Generally, when reliabilities increase, standard errors decrease, meaning greater confidence about precision of indy test scores.

Though CTT focuses on scale score as the unit of analysis, there are several statistics also used to assess item functioning. For example:

Item difficulty - is the proportion of test takers who affirm the item (correctly answer) with ability items or agree with the item for personality items)

Item discrimination describes how well an item does at differentiating between test takers w/ diff levels of the trait measured by the scale.

Item response Theory

Focuses on measuring a latent construct believed to underlie the responses to a given test. Latent trait is symbolized by theta. IRT has 2 primary assumptions:
*1.) The test is unidimensional, meaning it only measures one latent trait and (2.) that local independence exists, meaning that items in a scale are solely correlated bc of theta. If theta were partialled out, no correlation.

A cornerstone of IRT is the item response function, which relates theta to the expected probability of affirming an item. The IRF is determined by item parameters, which consist of:
Item difficulty - the location of the theta continuum where the item is most discriminatory (items with high difficulty will only be endorsed by respondents with large +thetas

IRT item discrimination has the same goal as under CTT (shown by the slope of the IRF at its infliction point, which is equal or near item difficulty.

Pseudo-guessing parameter relates to the probability that an indy with an extremely low theta will answer an item correctly (e.g., extremely low-theta people will be able to correctly guess an item with a P that is 1/number of options.

Information is a quantification of the amount of uncertainty removed by considering item responses. So, Items with a difficulty parameter closer to the px’s theta level, and strong discrimination, will provide much more info.

Opposite to the property of CCT true scores, theta estimates on 2 different tests measuring the same construct should be equivalent within sampling error. (bc remember with CCT, the true score may depend on the test, as diff tests will yield their own true score for theta.

*A serious limitation of CTT is that stats & parameters are sample dependent AND test dependent. True scores (person parameters) are test dependent and item difficulty and item discrimination (item parameters) are sample dependent.

It’s therefore not as easy to compare true scores across tests.

Also, in CTT, item statistics (Zickar & Broadfoot, 2009) are dependent on the sample used to estimate those statistics.

Downsides to IRT: to measure parameters with equal precision as CTT, you need larger sample sizes. This is because IRT models are more complex and have more models.

AND IRT requires a strong assumption of unidimensionality, and has difficulty running programs.

Re: the assumption of unidimensionality, it assumes local independence, which means that if you control for theta, there should be no correlation between items (e.g., all correlation in items is only due to theta). - This is often not the case (e.g., big 5 traits may correlate with each other)

Ironically, with some (but not too much) multidimensional data, IRT my be better, as deviant items can be weighted less when computing traits scores - but with CTT, items are usually weighted the same.

CTT more easily supports other statistical methods such as EFA, CFA, FA and SEM, which are based on the CTT measurement foundation (T = X + E)

But no doubt about it, IRT is better if you want to concentrate measurement precision at a certain range of the latent trait.

Applications such as differential item functioning (DIF) appropriateness of measurement, and CAT are all advanced psychometric tools that require IRT.

With appropriateness measurement, researchers are trying to identify respondents responding to items in an idiosyncratic way that sets them apart from other respondents. IRT looks for people who deviate from the iRT model, and IRT still does better than CTT at this.

IRT advances the psychometric quality of our instruments, particularly through allowing testing of more specific hypotheses, being theory-based, and facilitate advanced psychometric tools

But requires unidimensionality & large samples due to more parameters.

How well did you know this?

Not at all

Perfectly

Melchers et al. (2020)

Review of applicant faking in selection interviews

Researching applicant faking in employment interviews is new, mostly emerging in the 2010s

This is surprising given the vast majority of applicants use faking in some degree in interviews.

Faking good is more common, but faking bad happens in rare cases such as attempting to receive further unemployment benefits or avoiding compulsory military service.

Socially Desirable Responding (SDR) is usually described as comprising two facets: self-deception & impression management management for others.

SDR is more of a trait while faking is more of state.

It is possible that applicants also use nonverbal behaviors deceptively (e.g., fake smile or laugh), and future research could explore this form of faking (Melchers et al., 2020).

Future research should look for whether context factors like competitive industry or rough economy increase likelihood of faking.

Criterion validity can be influenced in different directions, and data often appears to become homogenized as people fake to appear as desirable.

Faking is less common among more qualified applicants than among less qualified applicants.

Practical applications:

Increasing the structure of the interview reduces applicant faking and its effects on interview performance ratings.

*Training based on content-based lie detection strategies is a viable strategy to help interviewers deal with faking.

DO NOT: Use warning that faking can be detected (unless ready to deceive applicants about the actual possibility to detect faking)

DO NOT: Rely on interviewers’ intuition, experience, or abilities (e.g., emotional intelligence) to try to detect when/how applicants fake.

*DO NOT: Rely on non-verbal behaviors to identify whether applicants are honest or faking

DO:
When trying to assess the veracity of applicants’ responses, train interviewers to focus on content and rely on a combination of indicators
****(e.g., level of details, plausibility).

DO: Increase the degree of interview structure (e.g.,
use standardized and job-related questions).

Applicants usually fake to compensate for a lack of qualification or fit, and thus faking should negatively impact criterion-related validity

Conscientiousness and Agreeableness are usually negatively related to faking, whereas Extraversion and Neuroticism are usually positively related to faking. In addition to the Big Five, the Dark Triad, which is comprised of Psychopathy, Narcissism, and Machiavellianism (Paulhus & Williams, 2002), is linked to faking.

How well did you know this?

Not at all

Perfectly

Levashina & Campion (2006)

There is just one model that is specifically tailored to faking in interviews (Levashina & Campion, 2006). In this model, Levashina and Campion (2006) consider faking as a function of capacity, willingness, and opportunity. Levashina and Campion’s model is a multiplicative model, so that all factors must be present at least to some extent for faking to occur.

How well did you know this?

Not at all

Perfectly

Kuncel (2001)

MA of GRE score predictive validity

The verbal, quant & analytic and subject tests had validity generalization (e.g., predicted performance across fields) for criterion performance metrics including GPA, 1st year GPA, faculty ratings, comprehensive exam scores, citation counts, and degree attainment.

Not a lot of evidence of moderation of the effect.

GRE general score & Undergrad GPA had very similar validities.

Subject GRE scores were better than general GRE scores at predicting. - It may be that motivation for a given field makes it such that subject tests are better. Or they’ve had a head start on the content compared to classmates - so future research should answer this question.

WHY THE SIOP principle summary says items shouldn’t be redundant: for other predictors to provide incremental validity, they must BOTH be correlated with the criterion and typically weakly related, or ideally uncorrelated, with other predictors used in the selection system.

How well did you know this?

Not at all

Perfectly

Murphy (2009)

Content Validation Is Useful for Many Things, but Validity Isn’t One of Them

Content-oriented validation strategies establish the validity of selection tests as predictors of performance by comparing the content of the tests with the content of the job. These comparisons turn out to have little if any bearing on the predictive validity of selection tests.

***There is little empirical support for the hypothesis that the match between job content and test content influences validity, and there are often structural factors in selection (e.g., positive correlations among selection tests) that strongly limit the possible influence of test content on validity.

Comparisons between test content and job content have important implications for the acceptability of testing, the defensibility of tests in legal proceedings, and the transparency of test development and validation, but these comparisons have little if any bearing on validity.

The legislative and judicial history of the current Civil Rights Act suggest that job relatedness and validity are not necessarily synonymous; tests that are valid predictors of performance (e.g., Ravens’ Progressive Matrices) might not be judged to be job related because of the lack of any manifest relationship between the test content and the content of a job.

Numerous factors, ranging from poor item writing and range restriction to differences in respondents’ understandings of and reaction to response formats, might lead to situations in which scores on tests that seem job related turn out to provide little help in making valid selection decisions (Murphy, 2009).

When the set of tests that is considered as possible predictors of performance are positively correlated with one another and with the criterion (i.e., they show positive manifold, which is OFTEN the case), content-oriented assessments of validity have very little to do with the question of whether or not test scores predict job performance.

*The effects of positive manifold are mostly pronounced when the correlations among tests are large (e.g., .50 or above), but they are substantial even when the correlations among tests are considerably smaller.

A third approach is one in which subject matter experts are asked to make judgments about overlaps in the knowledge, skills, and abilities (KSAs) required to do well on a test and those required to do well on a job.

Suppose a consulting firm develops selection test batteries for entry-level machine operator jobs in one organization and for data entry clerks in another. They use careful job analyses to develop reliable measures of knowledge, skills, and abilities that closely match the content of each job. As a result of a mix-up in the mailroom, the test batteries are sent to the wrong organizations. Murphy (2009) notes that according to the past literature on KSAOs, they’d probably predict roughly the same at each organization anyway.

WOW: Both the Uniform Guidelines and case law suggest that job relatedness represents an adequate justification for a set of tests, regardless of the empirical relationship between test scores and measures of job performance and effectiveness.

A test battery that is both job related and valid as a predictor of job performance is best, but if forced to make a choice between job relatedness and validity, most stakeholders are likely to prefer tests that seem logically related to the job over alternatives that might show equivalent or even higher validity but no apparent job relatedness.

How well did you know this?

Not at all

Perfectly

Griggs & Civil Rights Act of 1991

A selection procedure is job related if there is a manifest relationship between the test and the job, based on a structured analysis of the tasks, duties, and responsibilities of the job.

Thus, according to Murphy (2009), There is little doubt that content-oriented methods of validating tests are useful for establishing the job relatedness of selection tests.

The consistent pattern of positive correlations among ability tests and criteria (positive manifold) means that the choice of which tests to use to predict performance in which jobs will not usually have a substantial impact on the validity of a test battery.

How well did you know this?

Not at all

Perfectly

According to Murphy (2009), the three key challenges to content-oriented validation strategies are:

A - Given its widespread use, there is surprisingly little evidence showing that content validation actually works (at predicting perf)

B - The use of composites designed to maximize validity within a specific job family did not, in general, lead to higher validities in those families than the validities achieved using more general composites (Peterson et al., 2001).

C. The structure of selection tests limits the potential relevance of content matching. Murphy et al. (2009) asked why comparisons between the content of tests and the content of jobs turn out to have so little bearing on the validity of selection tests.

How well did you know this?

Not at all

Perfectly

According to Murphy (2009), the 3 benefits to content validity are:

Literature suggests that employment tests that are seen as relevant to the job are much more likely to be acceptable to applicants and are less likely to be challenged or to cause applicants to develop negative views of organizations. Also more acceptable to the org, stakeholders (e.g., unions, political groups)
Litigation defense
Tie-breaker when different applicant test batteries have equal validity but differences in content validity

How well did you know this?

Not at all

Perfectly

Hinkin (1998)

A brief tutorial on the development of measures for use in survey questionnaires

In an extensive review of the organizational behavior literature, Hinkin (1995) found that inappropriate domain sampling, poor factor structure, low internal consistency reliability and poor reporting of newly developed measures continue to threaten our understanding of organizational phenomena.

There are three major aspects of construct validation: (a) specifying the domain of the construct, (b) empirically determining the extent to which items measure that domain, and (c) examining the extent to which the measure produces results that are predictable from theoretical hypotheses (Nunnally, 1978).

Domain sampling theory states that it is not possible to measure the complete domain of interest, but that it is important that the sample of items drawn from potential items adequately represents the construct under examination (Hinkin, 1998).

The deductive approach, sometimes called logical partitioning or classification from above. The second method is inductive, known also as grouping, or classification from below (Hunt, 1991). Deductive requires a deep review of literature to develop the theoretical definition. The definition is then used as a guide for the development of items (Schwab, 1980).

The inductive approach is done by asking a sample of respondents to provide descriptions of their feelings about their organizations or to describe some aspect of behavior. An example might be, “Describe how your manager communicates with you.” Responses are then classified into a number of categories by content analysis based on key words or themes

There are a number of guidelines that one should follow in writing items. Statements should be simple and as short as possible, and the language used should be familiar to target respondents.

It is also important to keep all items consistent in terms of perspective, being sure not to mix items that assess behaviors with items that assess affective responses.

A very common question in scale construction is, “How many items?” There are no hard-and-fast rules guiding this decision, but keeping a measure short is an effective means of minimizing response biases caused by boredom or fatigue.

Adequate internal consistency reliabilities can be obtained with as few as three items (Cook et al., 1981),

3 items per factor are usually adequate, maybe 4-6 items per construct, depending on factors within and other contingencies (Hinkin, 1998)

With respect to scaling the items, it is important that the scale used generate sufficient variance among respondents for subsequent statistical analyses.

Likert (1932) developed the scales to be composed of five equal appearing intervals with a neutral midpoint, such as strongly disagree, disagree, neither agree nor disagree, agree, strongly agree.

Coefficient alpha reliability with Likert scales has been shown to increase up to the use of five points, but then it levels off (Lissitz &Gren, 1975). And If the scale is to be assessing frequency in the use of a behavior, it is very important that the researcher accurately benchmark the response range to maximize the obtained variance on a measure.

The new items should be administered along with other established measures to examine the “nomological network”— the relationship between existing measures and the newly developed scales.

Prior to conducting the factor analysis, the researcher may find it useful to examine the interitem correlations of the items and any variable that correlates at less than .4 with all other variables may be deleted from the analysis (Kim & Mueller, 1978).

How well did you know this?

Not at all

Perfectly

The American Psychological Association (APA, 1995)

states that an appropriate operational definition of the construct a measure purports to represent should include a demonstration of content validity, criterion-related validity, and internal consistency. Together, these provide evidence of construct validity—the extent to which the scale measures what it is purported to measure.

The researcher should have a strong theoretical justification for determining the number of factors to be retained.

How well did you know this?

Not at all

Perfectly

Kerlinger (1986)

Construct validity forms the link between theory and psychometric measurement.

How well did you know this?

Not at all

Perfectly

Schmitt & Klimoski, 1991).

construct validation is essential for the development of quality measures.

How well did you know this?

Not at all

Perfectly

Cohen (1969)

it is important to note the difference between statistical and practical significance.

How well did you know this?

Not at all

Perfectly

Cortina (1993)

found that alpha is very sensitive to the number of items in a measure, and that alpha can be high in spite of low item intercorrelations and multidimensionality. This suggests that .70 should serve as an absolute minimum for newly developed measures, and that through appropriate use of factor analysis, the internal consistency reliability should be considerably higher than .70

How well did you know this?

Not at all

Perfectly

Hinkin (1998) - Chi Squaree

The chi-square statistic permits the assessment of fit of a specific model as well as the comparison between two models. The smaller the chi-square the better the fit of the model.

How well did you know this?

Not at all

Perfectly

Medsker et al. (1994)

Study These Flashcards

recommend that the chi-square statistic be used with caution and that the Comparative Fit Index (CFI) of a value greater than .90 indicates a reasonably good model fit.

Pokdsakoff et al. 2003

Study These Flashcards

Common method bias - a review of lit

Method biases are a problem because they are one of the main sources of measurement error.
Measurement error threatens the validity of the conclusions about the relationships between measures and is widely recognized to have both a random and a systematic component.

Although both types of measurement error are problematic, systematic measurement error is a particularly serious problem because it provides an alternative explanation for the observed relationships between measures of different constructs that is independent of the one hypothesized.

Method variance refers to variance that is attributable to the measurement method rather than to the construct of interest. The term method refers to the form of measurement at different levels of abstraction, such as the content of specific items, scale type, response format, and the general context.

Not only can the strength of the bias vary but so can the direction of its effect. Method variance can either inflate or deflate observed relationships.

Common method bias due to common source or rater.
This type of self-report bias may be said to result from any artifactual covariance between the predictor and criterion variable produced by the fact that the respondent providing the measure of these variables is the same.

Theory suggests people try to maintain consistency between their cognitions and attitudes.. The tendency of respondents to try to maintain consistency in their responses to similar questions or to organize information in consistent ways is called the consistency motif.

Cote and Buckley (1987),

Study These Flashcards

found that approximately one quarter (26.3%) of the variance in a typical research measure might be due to systematic sources of measurement error like common method biases.

Item social desirability Podsakoff (2003)

Study These Flashcards

Refers to the fact that items may be written in such a way as to reflect more socially desirable attitudes, behaviors, or perceptions. Items or constructs on a questionnaire that possess more (as opposed to less) social desirability may be observed to relate more (or less) to each other as much because of their social desirability as they do because of the underlying constructs that they are intended to measure.

Common scale bias
Refer to artifactual covariation produced by the use of the same scale format (e.g., Likert scales, semantic differential scales, “faces” scales) on a questionnaire.

Common scale anchors Refer to the repeated use of the same anchor points (e.g., extremely, always, never) on a questionnaire.

Positive and negative item wording Refers to the fact that the use of positively (negatively) worded items may produce artifactual relationships on the questionnaire.

Scale length Refers to the fact that if scales have fewer items, responses to previous items are more likely to be accessible in short-term memory and to be recalled when responding to other items.

Measurement context effects Refer to any artifactual covariation produced from the context in which the measures are obtained. EXAMPLES INCLUDE:

Predictor and criterion variables measured using the same medium Refers to the fact that measures of different constructs measured with the same medium may produce artifactual covariance independent of the content of the constructs themselves.

Or same time or same location.

Generally speaking, the two primary ways to control for method biases are through (a) the design of the study’s procedures and/or (b) statistical controls.

SOLUTIONS:
Different sources example remedy
collect the measures of these variables from different sources. For example, Barnes et al., 2016 (abusive leader behaviors).

Different sources helps deal with things like consistency motifs, implicit theories, social desirability tendencies, dispositional and transient mood states, and any tendencies on the part of the rater to acquiesce or respond in a lenient manner.

But researchers examining the relationships between two or more employee job attitudes cannot obtain measures of these constructs from alternative sources and still answer their original research question (of whether those attitudes are related within someone)

So another potential remedy is to separate the measurement of the predictor and criterion variables in time or psychologically (by distraction [e.g., cover story between parts of study]).

create a temporal separation by introducing a time lag between the measurement of the predictor and criterion variables.

Another is to create a psychological separation by using a cover story to make it appear that the measurement of the predictor variable is not connected with or related to the measurement of the criterion variable.

researchers can use different response formats: (semantic differential, Likert scales, faces scales, open-ended questions),
- media (computer based vs. paper and pencil vs. face-to-face interviews).

Counterbalancing question order. Another remedy that researchers might use to control for priming effects, item-context induced mood states, and other biases related to the question context or item embeddedness is to counterbalance the order of the measurement of the predictor and criterion variables.

**However, the primary disadvantage of counterbalancing is that it may disrupt the logical flow and make it impossible to use the funneling procedure (progressing logically from general to specific questions) often recommended in the survey research literature (Peterson, 2000).

Kiely (1987)

Study These Flashcards

Item context effects - Refer to any interpretation that a respondent might ascribe to an item solely because of its relation to the other items making up an instrument.

Item priming effects (Podsakoff et al., 2003)

Study These Flashcards

Refer to the fact that the positioning of the predictor (or criterion) variable on the questionnaire can make that variable more salient to the respondent and imply a causal relationship with other variables.

Item embeddedness (Podsakoff et al., 2003)

Study These Flashcards

Refers to the fact that neutral items embedded in the context of either positively or negatively worded items will take on the evaluative properties of those items.

Podsakoff et al (2012)

Study These Flashcards

“The key point to remember is that the procedural and statistical remedies selected should be tailored to fit the specific research question at hand. There is no single best method for handling the problem of common method variance because it depends on what the sources of method variance are in the study and the feasibility of the remedies that are available.

Although method bias can inflate (or deflate) bivariate linear relationships, it cannot inflate (but does deflate) quadratic and interaction effects. So if a study tests hypotheses about quadratic or interaction effects, rather than main effects, then method bias would not be able to account for any statistically significant effects observed (so not as much reason to worry

Brannick et al. (2010)

Study These Flashcards

The second alternative approach, recently suggested by is to (a) identify one or more potential sources of method bias, (b) manipulate them in the design of the study, and (c) test whether the hypothesized estimates of the relationships among the constructs generalize across conditions.

Tziner et al. 2014

Strikingly, 70% of the respondents, who were all American employees, reported that their work was not meaningful. in this article we argue that it is time to include employee well-being metrics into utility analysis processes. utility analysis has become a compelling managerial keyword, particularly in today’s ferocious context of business competition. methods of estimating the financial utility of organizational interventions such as personnel testing and training quickly developed (Cascio, 1993). ROI indicates total growth, start to finish, of an investment, while IRR usually identifies the annual growth rate. We propose that there is a natural place in the existing utility formulas to include measures of employee well-being. Conceptually this space occurs in the economic value derived from the improvement of employee’s affective capacities due to training interventions—an important criterion first cited by Kraiger et al. (1993). Utility calculations should also include indirect costs rather than just direct costs (e.g., turnover leading to paying for severance packages). The indirect cost to the firm includes (a) the excess overtime pay to current employees and/or excess monetary compensation to outside substitute employees. Wellbeing metrics (e.g perceived meaningfulness of work; reduced morale among remaining workers during high turnover) may be added to either training or turnover utility analyses according to Tziner et al., 2014).

(Lissitz & Gren, 1975)

Coefficient alpha reliability with Likert scales has been shown to increase up to the use of five points on scales, but then it levels off .

Peterson (2000) (Gwen's source from RM - survey design)

the primary disadvantage of counterbalancing is that it may disrupt the logical flow and make it impossible to use the funneling procedure (progressing logically from general to specific questions) often recommended in the survey research literature (Peterson, 2000)

Psychometrics Flashcards

(27 cards)