Selection (Measurement/Testing/Reliability) Flashcards

1
Q

define measurement in the context of selection

A

“systematic application of pre-established rules or standards for assigning scores to the attributes or traits of an individual.” - Gatewood & Field, 7e

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is the overarching purpose of selection measures?

A

to be used as a predictor or criterion; to detect any true differences that may exist among individuals with regard to the attribute being measured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A predictor or criterion measure is standardized if it possesses each of the follow- ing characteristics

A
  1. Content—All persons being assessed are measured by the same information or content. This includes the same format (for example, multiple-choice, essay, and so on) and medium (for example, paper-and-pencil, computer, video). 2. Administration—Information is collected the same way in all locations and across all administrators, each time the selection measure is applied. 3. Scoring—Rules for scoring are specified before administering the measure and are applied the same way with each application. For example, if scoring requires subjective judgment, steps should be taken (such as rater training) to ensure inter-rater agreement or reliability.6
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what scale must a selection criterion be measured at?

A

interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

List and describe types of criterion methods

A
  1. Objective production data—These data tend to be physical measures of work. Number of goods produced, amount of scrap left, and dollar sales are examples of objective production data. 2. Personnel data—Personnel records and files frequently contain information on workers that can serve as important criterion measures. Absenteeism, tardiness, voluntary turnover, accident rates, salary history, promotions, and special awards are examples of such measures. 3. Judgmental data—Performance appraisals or ratings frequently serve as criteria in selection research. They most often involve a supervisor’s rating of a subordi- nate on a series of behaviors or outcomes found to be important to job success, including task performance, citizenship behavior, and counterproductive behav- ior. Supervisor or rater judgments play a predominant role in defining this type of criterion data. 4. Job or work sample data—These data are obtained from a measure developed to resemble the job in miniature or sample of specific aspects of the work process or outcomes (for example, a typing test for a secretary). Measurements (for exam- ple, quantity and error rate) are taken on individual performance of these job tasks, and these measures serve as criteria. 5. Training proficiency data—This type of criterion focuses on how quickly and how well employees learn during job training activities. Often, such criteria are labeled trainability measures. Error rates during a training period and scores on training performance tests administered during training sessions are examples of training proficiency data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what are the two basic options for choosing selection measures?

A

locate existing measures or create your own measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

locating existing selection measures: discuss the advantages

A
  1. Use of existing measures is usually less expensive and less time-consuming than developing new ones. 2. If previous research was conducted, we will have some idea about the reliability, validity, and other characteristics of the measures. 3. Existing measures often will be superior to what could be developed in-house.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

List the basic steps involved in developing your own selection measure

A
  1. Analyzing the job for which a measure is being developed 2. Selecting the method of measurement to be used 3. Planning and developing the measure 4. Administering, analyzing, and revising the preliminary measure 5. Determining the reliability and validity of the revised measure for the jobs studied 6. Implementing and monitoring the measure in the human resource selection system
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

creating your own selection measure: 1. work analysis

A

-broader analysis of work can be used in situations wherein technology/jobs are changing too rapidly and quickly for a traditional job analysis to be carried out -purpose is to determine KSAs necessary for the work activities or identify employee competencies from broader perspective -provides foundation for criterion measures to be chosen/developed -

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

creating selection measures: selecting the measurement method

A

depends on: -nature of job tasks and level of responsibility -skill of people who are administering and scoring -costs -resources available for development -applicant characteristics -choose the method that’s most appropriate; for example, to test an industrial electrician applicant’s ability to solder connections, you wouldn’t give a paper pencil test, but probably a work sample test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

creating selection measures: planning and developing the selection measure; specifications required for each measure

A

-prepare an initial version of the measure 1. The purposes and uses the measure is intended to serve. 2. The nature of the population for which the measure is to be designed. 3. The way the behaviors or knowledge, skills, abilities, and other attributes (KSAOs) will be gathered and scored. This includes decisions about the method of administration, the format of test items and responses, and the scoring procedures.15

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

describe the general method for generating items for selection measures

A

Substantial work is involved in selecting and refining the items or questions to be used to measure the attribute of interest. This often involves having subject-matter experts (SMEs) create the items or rewrite them. In developing these items, the reviewers (for example, SMEs) should consider the appropriateness of item content and format for fulfilling its purpose, including characteristics of the applicant pool; clarity and grammatical correctness; and consideration of bias or offensive portrayals of a subgroup of the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

discuss the two types of response formats for selection measure responses

A

Broadly, there are two types of formats—the first uses objective or fixed-response items (multiple-choice, true-false); the second elicits open-ended, free-response formats (essay or fill-in-the-blank). The fixed-response format is the most popular; it makes efficient use of testing time, results in few (or no) scoring errors, and can easily and reliably be transformed into a numerical scale for scoring purposes. The primary advantage of the free-response format is that it can provide greater detail or richer samples of the candidates’ behavior and may allow unique characteris- tics, such as creativity, to emerge. Primarily due to both the ease of adminis- tration and objectivity of scoring, fixed-response formats are most frequently utilized today, particularly if the measure is likely to be administered in a group setting. Finally, explicit scoring of the measure is particularly critical. Well-developed hiring tools will provide an “optimal” score for each item that is uniformally applied.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

creating selection measures: administering, analyzing, and revising

A

-pilot testing -The measure should be administered to a sample of people from the same population for which it is being developed. -Choice of participants should take into account the demographics, motivation, ability, and experience of the applicant pool of interest. - if a test is being developed for which item analyses (for ex- ample, factor analyses or the calculation of means, standard deviations, and reliabilities) are to be performed, a sample of at least a hundred, preferably several hundred, will be needed. -Based on the data collected, item analyses are performed on the preliminary data. The objective is to revise the proposed measure by correcting any weakness and deficien- cies noted. Item analyses are used to choose the content, permiting it to discriminate between those who know and those who do not know the information covered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

creating selection measures: psychometric characteristics to consider when analyzing pilot test data

A
  1. The reliability or consistency of scores on the items. In part, reliability is based on the consistency and precision of the results of the measurement process and in- dicates whether items are free from measurement error. 2. The validity of the intended inferences. Do responses to an item differentiate among applicants with regard to the characteristics or traits that the measure is designed to assess? For example, if the test measures verbal ability, high-ability individuals will answer an item differently than those with low verbal ability. Often items that differentiate are those with moderate difficulty, where 50 percent of applicants answer the item correctly. This is true for measures of ability, which have either a correct or incorrect answer. 3. Item fairness or differences among subgroups. A fair test has scores that have the same meaning for members of different subgroups of the population. Such tests would have comparable levels of item difficulty for individuals from diverse de- mographic groups. Panels of demographically heterogeneous raters, who are qualified by their expertise or sensitivity to linguistic or cultural bias in the areas covered by the test, may be used to revise or discard offending items as war- ranted. An item sensitivity review is used to eliminate or revise any item that could be demeaning or offensive to members of a specific subgroup.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

creating selection measures: implementing the measure

A

After we obtain the necessary reliability and va- lidity evidence, we can then implement our measure. Cut-off or passing scores may be de- veloped. Norms or standards for interpreting how various groups score on the measure (categorized by gender, ethnicity, level of education, and so on) will be developed to help interpret the results. Once the selection measure is implemented, we will continue to mon- itor its performance to ensure that it is performing the function for which it is intended. Ultimately, this evaluation should be guided by whether the current decision-making pro- cess has been improved by the addition of the test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Using norms to interpret scores on selection measures

A

-a score may take on different meanings depending on how it stands relative to the scores of others in particular groups. Our interpretation will depend on the score’s relative standing in these other groups. -norm group for comparison should be relevant/comparable to the applicant group -use local norms -norms are transitory- they’re specific to the point in time when they were collected, and probably change over time -Norms are not always necessary in HR selection. For example, if five of the best per- formers on a test must be hired, or if persons with scores of 70 or better on the test are known to make suitable employees, then a norm is not necessary in employment decision making. One can simply use the individuals’ test scores. On the other hand, if one finds that applicants’ median selection test scores are significantly below that of a norm group, then the firm’s recruitment practices should be examined. The practices may not be at- tracting the best job applicants; normative data would help in analyzing this situation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Reliability: definition (selection measures)

A

degree of dependability, consistency, or stability of scores on a measure used in selection - Gatewood, 7e

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

In general, how is reliability of a measure determined?

A

by the degree of consistency between two sets of scores on the measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In general, what determines whether a measure has low or high reliability?

A

more measurement error = lower reliability less measurement error = higher reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Discuss the concept of “true scores” in the context of reliability of selection measures

A

The true score is really an ideal conception. It is the score individuals would obtain if external and internal conditions to a measure were perfect. For example, in our mathe- matics ability test, an ideal or true score would be one for which both of the following conditions existed: 1. Individuals answered correctly the same percentage of problems on the test that they would have if all possible problems had been given and the test were a construct valid measure of the underlying phenomenon of interest (see next chapter). 2. Individuals answered correctly the problems they actually knew without being affected by external factors such as lighting or temperature of the room in which the testing took place, their emotional state, or their physical health. Because a true score can never be measured exactly, the obtained score is used to estimate the true score. Reliability answers this question: How confident can we be that an individual’s obtained score represents his or her true score?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Discuss the idea of error score in the context of reliability of selection measures

A

A second part of the obtained score is the error score. This score represents errors of measurement. Errors of measurement are those factors that affect obtained scores but are not related to the characteristic, trait, or attribute being measured.7 These factors, present at the time of measurement, distort respondents’ scores either over or under what they would have been on another measurement occasion. There are many reasons why individuals’ scores differ from one measurement occasion to the next. Fatigue, anxi- ety, or noise during testing that distracts some text takers but not others are only a few of the factors that explain differences in individuals’ scores over different measurement occasions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

discuss the function of the reliability coefficient in the context of selection measures

A

A reliability coefficient is simply an index of relationship. It summarizes the relation between two sets of measures for which a reliability estimate is being made. The calcu- lated index varies from 0.00 to 1.00. In calculating reliability estimates, the correlation coefficient obtained is regarded as a direct measure of the reliability estimate. The higher the coefficient, the less the measurement error and the higher the reliability estimate. Conversely, as the coefficient approaches 0.00, errors of measurement increase and reli- ability correspondingly decreases. Of course, we want to employ selection measures hav- ing high reliability coefficients. With high reliability, we can be more confident that a particular measure is giving a dependable picture of true scores for whatever attribute is being measured.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

list the primary types of methods of estimating reliability for selection measures

A
  1. test-retest 2. parallel (equivalent forms) 3. internal consistency 4. interrater
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Discuss the idea of test-retest reliability

A

Administer the measure twice and then correlate the two sets of scores using the Pearson product-moment correlation coefficient. This method is referred to as test-retest reliabil- ity. It is called test-retest reliability because the same measure is used to collect data from the same respondents at two different points in time. Because a correlation coefficient is calculated between the two sets of scores over time, the obtained reliability coefficient represents a coefficient of stability. The coefficient indicates the extent to which the test can be generalized from one time period to the next. The higher the test-retest reliability coefficient, the greater the true score and the less error present. If reliability were equal to 1.00, no error would exist in the scores; true scores would be perfectly represented by the obtained scores. any factor that causes scores within a group to change differentially over time will decrease test-retest reliability. Similarly, any factor that causes scores to remain the same over time will increase the reliability estimate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

guidelines for using test re test reliability

A
  1. Test-retest reliability is appropriate when the length of time between the two administrations is long enough to offset the effects of memory or practice. 2. When there is little reason to believe that memory will affect responses to a measure, test-retest reliability may be employed. Memory may have minimal effects in situa- tions where (a) a large number of items appear on the measure, (b) the items are too complex to remember (for example, items involving detailed drawings, complex shapes, or detailed questions), and (c) retesting occurs after at least eight weeks.11 3. When it can be confidently determined that nothing has occurred between the two testings that will affect responses (learning, for example), test-retest can be used. 4. When information is available on only a single item measure, test-retest reliability may be used.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Parallel forms; what does it do, how it’s measured, what it means, etc.

A

-helps offset the effects of memory on test-retest reliability -Pearson correlation calculated between two sets of scores of different but equivalent items -As the coefficient approaches 1.00, the set of measures is viewed as equivalent or the same for the attribute measured. - If equivalent forms are administered on different occasions, then this design also reflects the degree of temporal stability of the measure. In such cases, the reliability estimate is referred to as a coefficient of equivalence and stability. The use of equivalent forms administered over time accounts for the influence of random error to the test con- tent (over equivalent forms) and transient error (across situations).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Parallel forms: the basic process

A
  1. the process of developing equivalent forms initially begins with the identification of a universe of possible math ability items, called the universe of possible math items 2. Items from this domain are administered to a large sample of individuals representative of those to whom the math ability test will be given 3. Individuals’ responses are used to identify the difficulty of the items through item analyses and to ensure the items are measuring the same math ability construct 4. Next, the items are rank- ordered according to their difficulty and randomly assigned in pairs to form two sets of items (form A and B).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

difference between alternate forms and parallel forms?

A

Because it is difficult to meet all of the criteria of equivalent forms, some writers use the term alternate forms to refer to forms that approximate but do not meet the criteria of parallel forms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

internal consistency reliability measurement

A

An index of a measure’s similarity of content is an internal consistency reliability esti- mate. Basically, an internal consistency reliability estimate shows the extent to which all parts of a measure (such as items or questions) are similar in what they measure.13 Thus a selection measure is internally consistent or homogeneous when individuals’ responses on one part of the measure are related to their responses on other parts -If the sample of selected items truly assesses the same concept, then re- spondents should answer these items in the same way. What must be determined for the items chosen is whether respondents answer the sample of items similarly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

list the procedures most often applied to obtain internal consistency estimates

A

split half, Kuder Richardson, & Chronbach’s alpha

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

describe split half reliability

A

-single administration of the measure -then measure is divided or split into two halves -performance on half 1 should be associated with performance on half 2 if all items measure the same attribute -most common method is to split by even and odd numbered test items -Problem: too many ways to split, not all splits reveal the same reliability estimate; this isn’t used as much today -limitations: can’t detect any errors associated with time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Describe Kuder-Richardson reliability estimates (KR20)

A

-single administration of the measure -used to determine consistency of answers to any measure that has items scored dichotomously (like right or wrong answers) -estimates average of the reliability coefficients that would result from all possible ways of subdividing a test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Describe Chronbach’s alpha reliability estimate

A

-can use for continuous item responses -a general version of KR20 -still represents average reliability computed from all possible split half reliabilities -gives average correlation of each item to every other item on a measure - If coefficient alpha reliability is unacceptably low, then the items on the selection measure may be assessing more than one characteristic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Describe interrater reliability estimates

A

-when scoring is based on individual judgment -rater behaviors and rater biases contribute to rater error -defined as the consistency among raters; determine whether multiple raters are consistent in their judgments -computation can involve interrater agreement, interclass correlation, and intraclass correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Describe interrater agreement

A

-some are not good estimators of reliability -often doesn’t take consideration of rater agreement due to chance into account -generally restricted to nominal or categorical data that reduce their flexibility

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

interclass correlation

A

employed when two raters make judgements on an interval scale -pearson product moment correlation and cohen’s weighted kappa -shows amount of error between two raters -Relatively low interclass correlations indicate that more specific operational criteria for making the ratings, or additional rater training in how to apply the rating criteria, may be needed to enhance interrater reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

describe Intraclass correlation

A

-When three or more raters have made ratings on one or more targets, intraclass correla- tions can be calculated. -This procedure is usually viewed as the best way to determine whether multiple raters differ in their subjective scores or ratings on the trait or behavior being assessed. -Intraclass correlation shows the average relationship among raters for all targets being rated. -how much of the difference in ratings is due to true differences vs measurement error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

How high should a reliability coefficient be?

A
  • Unfortunately, there is no clear-cut, generally agreed upon value above which reliability is acceptable and be- low which it is unacceptable. -Obviously, we want the coefficient to be as high as possible; however, how low a coefficient can be and still be used will depend on the purpose for which the measure is to be used. -The following principle generally applies: The more critical the decision to be made, the greater the need for precision of the measure on which the decision will be based, and the higher the required reliability coefficient.
40
Q

list the factors that influence the reliability of a measure

A

-method of estimation -stability -sample used -individual respondent differences -length of measure -test question difficulty -homogeneity of measure content -response format -administration of the measure

41
Q

discuss the standard error of measurement in terms of reliability coefficients

A

-To obtain an estimate of the error for an individual, we can use another statistic called the standard error of measurement. -The statistic is simply a number in the same measurement units as the measure for which it is being calculated. -The higher the standard error, the more error present in the measure, and the lower its reliability. -Thus the standard error of measurement is not another approach to estimating reliability; it is just another way of expressing it (using the test’s standard deviation units). -Because of its importance, the statistic should be routinely reported with each reli- ability coefficient computed for a measure. In practice, if you want to make a cursory comparison of the reliability of different measures, you can use the reliability coefficient. However, to obtain a complete picture of the dependability of information produced by a measure and to help interpret individual scores on the measure, the standard error of measurement is essential.

42
Q

when is it appropriate to use GMA tests in selection?

A

most appropriate for use at entry level, not for jobs where prior job specific preparation is required; when this is the case, content valid tests are more appropriate and acceptable to users and applicants than GMA tests

43
Q

effect size

A

the measure of the difference in the average scores of sub groups measured in SD units; so an effect size of .88 means that that subgroup scores .88 units above the mean of the other subgroup

44
Q

job knowledge tests; what are some keys to developing them?

A

making them job specific; this increases the criterion related validity, especially in high complexity jobs. It also increases face validity and perceptions of fairness. Using content validity as a way to develop job knowledge tests is likely to result in high criterion related validity.

45
Q

when are job knowledge tests inappropriate?

A

for entry level jobs where no prior job specific training or experience is required

46
Q

SJTs vs. GMA

A

SJTs correlate well with GMA tests, but have less adverse impact. they also have higher face validity than GMA tests

47
Q

what is FA and how can it be used in creating a test or measurement process?

A

Factor analysis is a data reduction technique used to reduce the original data into a common set of categories, factors, or traits (Anastasi & Urbina, 1997). These factors are considered to be what makes up the factorial composition of the test. Simply put, factor analysis allows us to determine relationships between variables. Factor analysis is most useful for more complex or latent variables (i.e., variables that cannot be directly measured). The result is that a large number of variables are collapsed into a smaller number of more interpretable variables, while accounting for the most variance possible. The idea is that each of these variables will produce a similar pattern of responses that all tie to the latent trait that is being measured. Factors that explain the most amount of variance in the observed variables are retained, while those that do not account for as much variance can be thrown out. In the context of test development, factor analysis can help us in improving the psychometric properties of the test, providing evidence for validity, and helping us decide which items to retain or drop from the test itself (e.g., factor loadings). In particular, factor analysis can help us in an exploratory sense by determining the dimensionality of constructs to be measured through the instrument (Floyd & Widaman, 1995). The factors identified are then used as subtests for the overall instrument, Through the identification of common vs. unique variance, factor analysis can help us determine the structure of the instrument.

48
Q

what is test bias and how can you reduce or possibly reduce test bias?

A

Test bias occurs when a test results in statistical errors of prediction for certain groups/subgroups. Bias can mean that a test functions differently or produces different results for certain groups. Some tests may not predict outcomes or behaviors consistently among subgroups. Bias can be a result of internal factors, such as the psychometric properties of the test, or external, such as selection. There are different types of test bias, such as construct, method, or item bias. One relevant type of test bias is cultural bias, in which the test bias occurs as a function of a cultural variable, such as ethnicity or gender (Reynolds & Suzuki). One common example of cultural test bias is with intelligence tests, which have come under legal and social scrutiny in the past. When a test produces different scores between groups, but that difference is considered true, then there is no bias present. Sources of test bias can include inappropriate content, inappropriate standardization samples, test language, inequitable social consequences, and differential predictive validity (Reynolds & Suzuki). Some ways to avoid test bias include using neutral language in test items, using standard English (i.e., avoiding slang), avoiding references to race, ethnicity, gender, age, etc., and avoiding using language specific to a certain geographical location. Additionally, avoiding external influences such as stereotype threat can help reduce biases. Finally, it is recommended that items asking for demographic information, such as race and gender, should not be placed before the test items. With respect to the test development process, ways to prevent test bias include using a heterogenous group of test developers, having minority individuals review test items, using performance items over language, and pilot testing effectively (Green & Ross, 1979). Finally, objectively scorable measures should be used in order to reduce the possibility of bias resulting from subjective judgments and test scoring.

49
Q

explain what distracter analysis means and why you might perform this type of analysis

A

Distractor analysis involves the comparison of the performance of the highest and lowest scoring 25% of examinees on options presented on the test that are incorrect. Ideally, fewer of the higher scorers on the test should choose each of the distracters as their answers as compared to the lower scorers. Distractor analysis helps us determine whether the alternative choices we provide on the test are functioning effectively. These items should draw the examinee away from the correct answer. Ideally, a good distractor would be one that is selected enough times. Acceptable values are determined by item difficulty and can be subjective. The number of people who select each distractor also helps determine that particular distractor’s effectiveness. For example, if 10 people took a test and the item difficulty is .6, and 6/10 answered correctly (answer choice “B”). Of the 4 examinees who answered incorrectly, and all selected the distractor A, then distracters C and D would be deemed ineffective. We should have roughly the same number of examinees selecting each distractor. Item discrimination can also be used, such that a negative value would be expected as more lower performers should select distracters (each item has its own discrimination value). The concept of distracters and their analysis is important because it can affect test scores; if distracters are provided that are not effective, a test might become too easy, and the interpretation of test scores that indicate true ability level may become inflated through the process. This means that there is a high probability of examinees selecting the correct answers because the distracters are ineffective, rather than their true ability levels, will influence test performance.

50
Q

What is meant by “the criterion problem?” What impact could this problem have on criterion validation studies?

A

In selection, the criterion problem refers to issues that revolve around what and how criterion is measured, as well as how it’s defined. A criterion measure that is easily accessible is not necessarily a valid measure of the criterion (e.g., job performance). This means that the interpretations made can be inaccurate or otherwise null and void. In organizations this can especially be difficult because of the fast-paced environment in which they often operate. Additionally, it may be hard to convince organizational leadership that one way of measuring job performance is more effective and valid over another way. This often results in the use of convenience measures of performance, which may not be solid indicators of actual performance but rather a means of ease of measurement. From a philosophical standpoint, it is difficult to be sure that even what we perceive as being the “best” measures of performance actually are, indeed, the best. Although we can do our best, it is never guaranteed that we are measuring performance properly, even when there is a suitable definition of performance available. Jex (2002) discusses the tradeoff between what is easy to measure and what may be more meaningful. The criterion problem exists mainly in selection, but problems can also arise with organizational initiatives as organizations insist on using global measures of effectiveness rather than defining effectiveness using a myriad of criterion. In terms of selection and validation studies, a domino effect will take place. If you aren’t using a valid criterion measure, you simply cannot interpret the study results as valid either.

51
Q

Briefly explain what the multi-trait multi-method matrix does and how it works.

A

The multi-trait multi-method matrix is an approach to convergent and discriminant validation proposed by Campbell & Fiske (1959). Two or more traits are measured using two or more methods. The matrix is a visualization of this process, and includes all possible correlations among scores when multiple traits are measured by multiple methods. Reliability and validity coefficients are also presented. The validity coefficients the cores obtained for the same trait by the different methods are correlated, wherein each measure is being compared against other measures of that same trait. The table also includes correlations between different traits measured by the same method and correlations between different traits measured by different methods. The validity coefficients should be higher than the correlations for different traits measured by both the same and different methods in order to demonstrate construct validity.

52
Q

How do you decide when to use a passing point?

A

A passing point should be used to determine the differentiation point between minimally qualified applicants and non-qualified applicants, particularly for tests in which a pass/fail dichotomous scoring method is going to be applied. The purposes of the test should be basis for determining setting passing points, and a passing point should be set by consulting SMEs rather than arbitrarily set (using percentages of minimally qualified people who should be able to correctly respond to the item). The SMEs should be chosen based on their familiarity with test content and the job being considered (i.e., KSAs needed to perform the job). Should think about the consequence of classification errors before putting a passing point in place.

53
Q

how do you decide when to use a rank order list in selection test development?

A

Similar to strict top down, maximizing utility When combining two assessment scores in order to determine scores/ lists based on the scores of the two tests Use Spearman’s r to assess to evaluate relationships involving ordinal variables

54
Q

Explain how you might use Fisher’s r to z Transformation for examining differential item functioning.

A

Looks at differences between groups, used to find confidence intervals and differences between correlations Testing the difference between two correlation coefficients from independent samples DIF procedures differ from the standard methods of item analysis Shows the extent to which an item might be measuring different abilities for members of separate subgroups Looks to see whether the items truly have differing correlational strength from each other Both look at the differences between groups, looking at item characteristics and statistics to determine test differences If there are significantly different correlations between subgroups’ scores and a given criterion (job performance, for example), this can serve as evidence of DIF

55
Q

a criterion measure is standardized if

A

content: all persons being assessed are measured by the same content and same format administration: information is collected the same way in all locations across administrations Scoring: rules for scoring are specified before administration and applied the same way for each application of the measure

56
Q

list the primary considerations when choosing or developing a selection measure (predictor)

A

what it’s going to measure: concisely define the construct of measurement cost effectiveness standardization: this includes how to handle situations for disabled applicants ease of use: less expensive when people scoring it don’t need a lot of training is it acceptable to the org and the candidate?

57
Q

list the primary considerations when choosing or developing a criterion

A

is the criterion job relevant? is it acceptable to management? managers must support its use based on their values are work changes likely to alter the need for the criterion? criteria need to be reviewed periodically to ensure they’re still relevant and appropriate for use is is free of bias so that meaningful comparisons can be made between individuals? differences in tools and equipment, differences in work shifts, etc. may cause bias in measurement will It detect differences if differences actually do exist?

58
Q

list primary considerations that apply to both predictor and criterion choice/development

A

discrimination: significant differences among protected groups that systematically differ should be further examined does it lend itself to quantification? you would rather have quantitative data for selection; analysis is more meaningful and can use more sophisticated analyses measure scoring consistency: all raters should receive the same training how reliable are the data provided by the measure? consistent scores across individuals on same measure construct validity: how well is it measuring the construct its intended to measure?

59
Q

steps in developing a selection measure

A
  1. analyze the work 2. select the method of measurement 3. develop the specifications or plan of the measure 4. construct the preliminary measure 5. administer and analyze the preliminary measure 6. prepare a revised measure 7. determine the reliability and validity of the revised measure 8. implement and monitor the measure in the HR system
60
Q

steps in developing a selection measure: Step 1: analyze the work

A

how you know which constructs you want to measure provides information about criterion measures as well

61
Q

describe the movement from job analysis to work analysis

A

the world of work is seeing an enormous shift in job responsibilities to changes in the workforce and technology. this means that the specificity of job analysis is not the best way. work analysis is a broader scope of analysis common across jobs or to the entire organization

62
Q

steps in developing a selection measure: step 2: select the method of measurement

A

specific nature of the job and skill of individual responsible for scoring the measure will need to be considered. consider costs of testing and resources, think about the organizational context in which the job is performed. which method of measurement is likely to yield the most information? what makes the most sense? for example written tests make the most sense for knowledge constructs, while a work sample test would work better for evaluating an ability

63
Q

steps in developing a selection measure: step 3: developing specifications for the selection measure

A

you need to clarify the nature and details of the measure by developing specifications for its construction. these specifications should include: purposes of the measure operational definition of the construct the way the KSAOs will be gathered and scored (administration, format, scoring procedures) nature of the population for which its designed the time limits for completing the measure which statistical procedures to use in selecting and editing items develop a content outline that describes what a person in that job needs to know ; this comes from JA information and discussions with incumbents and supervisors. so fro example you have the knowledge statement “knowledge of principles of electrical writing” and here you will break that down into further detail: “installing electrical grounds” and “reading drawings” next, develop a “test item budget”; how many items will you use to assess this construct?

64
Q

steps in developing a selection measure: step 4: construct the preliminary measure

A

develop items according to the content developed when you made the specifications. follow these steps: 1. generate behavioral samples or items 2. SMEs review/evaluate the items by considering the appropriateness of the item content and format, clarity, and if it might contain bias 3. develop the scoring and administration procedures; consider fixed response (MC) vs. open ended

65
Q

steps in developing a selection measure: step 5: administer the preliminary form of the measure

A

pilot test the initial form of the test. administer to a sample from same population for which its being developed; consider demographics, motivation, ability, and experience of applicant pool of interest. make sure sample is sizable, several hundred preferred if doing any item or factor analyses

66
Q

steps in developing a selection measure: step 6: revise or replace items as needed

A

based on pilot test data, perform item analyses to know if and how the measure should be revised. consider these psychometric characteristics when doing item analyses: -reliability or consistency of item scores -validity of the intended inferences: do the items actually differentiate between ability levels of applicants? look at item difficulty especially for ability measures with right/wrong answers -item fairness: scores should have same meaning for members of different subgroups of the population; compare levels of item difficulty to make sure. if needed to revise, consult SMEs who have expertise

67
Q

steps in developing a selection measure: step 7: determine reliability and validity of measure

A

conduct reliability and validity studies on the revised measure.

68
Q

steps in developing a selection measure: step 8: implement the selection measure

A

after getting reliability and validity evidence the measure can be integrated into the selection system. continue to monitor the measure’s performance to ensure it is performing well.

69
Q

list the primary ways to evaluate selection measure scores

A

norms, percentiles, or standard scores

70
Q

scoring selection measures: using norms; what are they and how should you decide to use them?

A

need to know how others scored on the measure and the validity of the measure in order to interpret a score meaningfully. norms = scores of relevant others in groups; show how well an individual performs relative to a specified comparison group (called a normative sample) to choose norming group/normative data to use, keep the following in mind: norm group selected should be relevant for its purpose (if you’re selecting experienced electricians, don’t use a norm group of recent trade school grads) use local norms rather than national data; a local norm is one based on selection measures given to people making an application for employment to a particular employer. at first you may have to use test manuals when the test is first rolled out, but local norms can be developed when you have 100 or more scores for a particular group. passing scores should then be established on the basis of these local data. norms are transitory (specific to the time point in which they were collected); they change over time; for attributes that are likely to change over time this is especially relevant and more current normative data should be used know that norms don’t always need to be used in selection. if you’re doing top down for example you wouldn’t need them, but if you notice that the applicants median scores are falling below a normative group’s median scores, that org’s selection system should be examined. be very cautious when using norms based on protected class info like sex or race. it’s really best to avoid these/

71
Q

scoring selection measures: using percentiles

A

most frequently used in reporting normative data; percentiles show the percentage of persons in a norm group who fall below a given score on a measure.

72
Q

scoring selection measures: using standard scores

A

they represent adjustments to raw scores to determine the proportion of people that fall at various levels. z scores are a common one from -4 to 4. you can use T scores to avoid negative numbers biggest issue is that they’re subject to misinterpretation because they don’t tell you what a score means in terms of job performance

73
Q

discuss how to conduct content validation

A
  1. job analysis: define the job content domain 2. select the SMEs: select based on qualifications and experience 3. specification of measure content: domain sampling: the items are chosen or the measure to represent the KSAs determined via the JA; ensure that your questions have physical and psychological fidelity as much as possible 4. assessment of selection measure and job content relevance: SMEs rate the items for content relevance. a content validation index (CVI) can be applied to determine the measure to job content overlap
74
Q

list and discuss the primary factors that influence reliability of a measure

A

method of estimation individual differences: if variation among respondents is wide, the device can more reliably distinguish among people; greater variability = higher reliability stability: of the construct and in measurement over time; for example measures of anxiety or emotion are going to be less reliable because they are less stable over time the sample: should be representative of population of interest for the measure to be used. also sample size affects reliability; larger samples ensure influence of random error is normally distributed so you’ll get a better estimate length of measure: longer = higher reliability; remember you’re sampling from the “universe” of all test items for a construct test question difficulty: if you have right or wrong answers it will affect reliability by affecting differences among applicant test scores. if the items are very hard or very easy, differences in individual’s scores will be reduced because many will have roughly the same test score; either very low or very high. finer distinctions between ability levels can be made with moderately difficult questions, which will increase reliability. response format: more categories of responses = greater reliability via a greater spread in scores administration/scoring of the measure: can contribute to errors in scoring, which decreases reliability standard error of measurement: a number in same units as the measure that determines whether scores for individuals differ significantly from one another and how confident we can be in the scores we get from different groups of people

75
Q

when might content validation not be appropriate?

A

for KSAs that are more abstract and less observable, its difficult tot accurately establish content validity; other validation strategies such as criterion related ones may be necessary regarding job analysis and content validation, a link needs to be established between the KSAs for job performance and the KSAs needed to do well on the selection measure. this highlights the criticality of doing a very thorough and well rounded JA according to the uniform guidelines, content validation is not appropriate when psychological constructs are not directly observable, when the selection procedure involves KSAs an employee is expected to learn on the job, or when the content of the measure doesn’t resemble work behavior

76
Q

discuss the requirements for a criterion related validity study

A
  1. job should be reasonably stable, not in a period of transition 2. the criterion should be reasonably free from contamination 3. access to a representative sample 4. a large enough and representative sample of people on whom both the predictor and criterion data have been collected must be available; over several hundred are needed or a type II error may occur
77
Q

steps to conducting a construct validation study

A

this is where you establish that the construct is measuring what it’s supposed to measure via its relationships with other constructs 1. define the construct and formulate hypotheses about its relationships with other variables 2. construct the measure 3. test the hypothesized relationships between measures of the other constructs

78
Q

In terms of criterion related validity, what range of a coefficient are you looking for?

A

.3-.5; its unlikely that a measure will exceed .5

79
Q

discuss the idea of cross-validation in relation to criterion related validity

A

because of the possibility of error when using regression for criterion validity estimation based on a new group, cross validation is important to test using empirical or formula estimations empirical estimations split sample method: a group of people with predictor and criterion data is randomly divided into two groups, a regression equation is developed on one of the groups (the weighting group). this equation is used to predict the criterion for the other (hold out) group, and these predicted criterion scores are then correlated with their actual criterion scores. look for a significant correlation between these two. this indicates that the regression equation is useful for people other than those on whom the equation was developed.

80
Q

discuss factors affecting the size of validity coefficients

A

reliability of the criterion and predictor; error lowers validity; reliability is a necessary precursor to validity. you should use a formula to correct for this when needed. range restriction: the variance in scores is reduced when people scoring low on a test were not hired; especially true for predictive validation strategies. criterion scores are also often restricted based on turnover, good people being hired, etc. you can use a formula to correct for this. criterion contamination violation of statistical assumptions: there must be a linear relationship between predictor and criterion or else the validity coefficient will be underestimated of the true relationship

81
Q

discuss the idea of validity generalization and its steps

A

involves evidence that validity information accumulated from deletion measures form a series of studies can be applied to a new setting involving the same or similar job. 1. obtain large sample of published and unpublished validation studies 2. compute average validity coefficient for these studies 3. calculate the variance of differences among these validity coefficients 4. subs tact form the amount of the differences; the variance due to the effects of small sample size 5. correct the coefficient for error 6. compare the corrected variance to the average validity coefficient to determine the variation in study results 7. if differences among the validity coefficients are very small, then validity coefficient differences are concluded to be due to methodological issues and not to nature of situation; so the validity is generalizable across situations used mainly with general mental ability testing, paper pencil tests, bio data, personality inventories, and assessment centers, and is a global perspective on jobs

82
Q

discuss the idea of job component validity and steps to conduct a job component validity study

A

its a standardized way of obtaining information on the jobs for which a validation study is being conducted; it involves a more detailed examination of a job. you assume that KSAsare similar across similar jobs and that the validity for a KSA for job performance component is reasonably consistent across jobs. steps: 1. conduct analysis of job using the PAQ 2. identify major components of work of the job 3. identify the attributes required for performing these components using SME ratings 4. choose tests that measure these attributes as identified via the PAQ

83
Q

what is the PAQ?

A

a commercially available questionnaire that contains a comprehensive list of general job behaviors. a respondent answers the PAQ by indicating the extent to which these descriptions accurately reflect work behaviors of a certain job.

84
Q

methods for collecting and combining predictor information

A

in selection there are various methods to collect information on job applicants and combining that information the basic methods involve either mechanical (e.g., an applicant taking a mental ability test) or judgmental (e.g., an applicant interview). you can also combine these methods. in order to reach a selection decision, you must combine this information. combining data mechanically may involve using statistics like an additive/regression equation that can predict job performance. judgmental combining might involve “gut instincts” about applicants or human intuition making an overall impression of an applicant. these methods include different combinations of judgmental vs mechanical collection and combining. but the best way I think is “mechanical composite” which involves collecting both mechanical and judgmental data and then combining the data mechanically using regression. mechanical techniques yield better results. they involve less error and accuracy can be built up over time as more data is collected for prediction. it also takes away the bias involved in human judgment in decision making.

85
Q

selection decision makers should consider what things?

A

use standardized selection procedures that are reliable and valid use selection procedures that minimize use of human judgment in information collection; make keys and scoring criteria for those that do involve human judgment generally better to use mechanical combining techniques especially statistical equations

86
Q

when is it best to use multiple cutoffs with P/F, vs multiple hurdle, vs. combination method for combining predictor scores

A

It’s best too use multiple cutoffs with pass/fail for each predictor when physical abilities are important to job performance.

Multiple hurdle is best when subsequent training is expensive/complex, or KSA can’t be compensated for by another KSA. Range restriction likely a problem but less costly.

Combination method: uses cutoffs and multiple regression to calculate overall score for those who pass the cutoffs. Best used when more of one KSA can compensate for lack of other KSA. Everyone does every measure so can be costly.

87
Q

main approaches for making selection decisions

A

top down selection: maximum utility but problems with adverse impact cut off scores: you can base this off of how job applicants or other people performed on the measure by developing local norms or by using SME judgments. banding

88
Q

scale development process

A
  1. Determine what you want to measure: theory will guide this step. Theory must be formulated and laid out first. It is important to think about how general or specific you want your measure to be; remember that broad predicts broad and specific predicts specific, so think about what the intended function of the scale will be when deciding this. Determine if and how the construct is conceptually and theoretically distinct from other constructs. 2. Generate an item pool Each item should reflect the purpose of the scale and the definition of the latent variable youre measuring. Redundancy can be good or bad: need to find a balance, you want redundancy in measurement to increase reliability and to help decide which items are best but not in wording or grammatical structure The more specific your construct is, the more similar some of your items will probably be # of items: the more the better but keep in mind it shouldn’t be so many that your sample will suffer from fatigue; if an item pool is too large, cut it down before administering Item writing suggestions: avoid lengthy items, double barreled items, consider reading level and quantify the reading level, 5-7th grade is good for general pop; negatively worded items can be bad cause they can confuse people 3. Determine format for measurement Consider type of scale, how many response categories: physical placement of options is important as well as thinking about how well respondents will be able to discriminate between response options meaningfully 4. SME Item Pool Review Provide working definition of construct and have them rate items based on relevance, clarity, conciseness; can have discussions about items if necessary to determine its quality and inclusion in the scale (Mumford et al method) 5. Consider inclusion of validation items May want to include social desirability or other measure to determine construct validity all in one sample 6. Adminiter items to Pilot Sample 300 people adequate in order to eliminate subject variance; size is function of number of items and number of dimensions to be extracted. If you don’t get enough people, can affect psychometric properties and paint inaccurate picture of item covariances 7. Evaluate the items Correlations among items tell you about the relationships to true scores; higher correlations = higher individual item reliabilities. We want the items to be highly inter correlated, this is done by looking at the correlation matrix Item variances: higher = better, indicates good discrimination between different levels of the construct or ability Item means: want a good range of variance this tells you the item is assessing all levels of the construct; want mean to be close to center of range of possible scores not at an extreme end of distribution Coefficient alpha: indicates proportion of variance in scale scores that is attributable to true score; assumes items are unidimensional, .7 is lower bound, .8-.9 ver good, over .9 shorten the scale you have redundancy in your items 8. Optimize scale length covariation among items + number of items in scale influences optimal length; adding more will increase alpha, removing will lower it. Determine optimal trade off between brevity and reliability. Can check reliability if item deleted to see. The correlation of item to scale will raise alpha if dropped. look at loadings of items if looking at multidimensionality Consider splitting sample to cross validate since subsamples are more likely to represent same pop versus a whole new sample
89
Q

describe banding

A

A compromise between top down and passing scores Takes into consideration the degree of error associated with any test score How many points apart do two applicants have to be before we say their test scores are significantly different? uses standard error of measurement (SEM) Top scorers are hired while still allowing flexibility for affirmative action (Campion et al 2001)

90
Q

discuss criticisms of banding

A

Banding can result in lower utility than top down hiring It may not actually reduce adverse impact in any significant way Usefulness of achieving affirmative action goals is affected by factors like selection ratio and percentage of minority applicants SEM formula isn’t correct and standard error of estimate should be used instead But bands resulting from use of SEE will be smaller than those resulting from use of SEM, thus diminishing the already questionable usefulness of banding as a means of reducing adverse impact

91
Q

What is test score banding? How/when do you use banding. Provide a brief example.

A

Banding is the process of treating certain groups of test scores the same. The basic premise lies in psychometric theory (CTT): measurement error influences the accuracy of test scores, and therefore our ability to determine one’s true score (T). Small differences in test scores may not mean that there are true differences in abilities between candidates when we consider the effects of measurement error on observed scores. This is compounded by the fact that many studies indicate that selection tests are flawed in their predictive ability to predict job performance. These issues can lead to false hits and false misses as a result of the selection process. In order to counter these error effects, certain scores are grouped into bands and therefore treated the same. There are different types of banding. In traditional banding, the bands are determined by the use of trend analysis and expert opinion. In the SED (standard error of difference) method, statistical significance testing is used to determine how to group the scores into bands. There are also fixed or sliding bands. In fixed bands, all those from the band are selected before those from the lower band. In sliding bands, bands slide down as each person is removed from the top and the bands are re-made. Banding serves many purposes. The primary purpose is to ensure fairness by treating small differences in scores as the same. However, there is some loss of predictive power as compared to traditional, top down selection, although much research indicates that the loss in utility is minor. Another purpose is to increase diversity. Especially with the use of sliding bands and points for minority status, this makes for the increased likelihood of minorities to be selected. In order for banding to adhere to legal standards, the selection from bands has to be based on a myriad of factors, and not just on the sole characteristic of minority status. Organizations should make it a point to educate applicants and employees on their use of banding and provide information for their rationale. Policies and processes should also be provided in writing. An example of when an organization might use banding is as a part of a mandated AA plan, perhaps to correct for past transgressions and/or mitigate adverse impact (under the assumption that the test is non-biased, reliable, and valid)

92
Q

Discuss/explain how you decide to use banding methods or strict top-down selection when referring job-candidates for interview with hiring supervisors.

A

When determining the type of banding to use, you want to consider the effect to utility and adverse impact that may result. Strict top-down selection produces the highest utility but also the most adverse impact. Top-down within-groups will produce the greatest loss in utility. Banding methods that use minority preference selection within bands eliminate adverse impact. The utility loss in this type is no more than 3 points on the selection procedure composite score. Sliding bands, minority preference, top-down selection within bands don’t produce adverse impact and will produce utility that is comparable to strict top-down selection. Critics of banding state that there is no justification for using such a method The legal reasons for banding include ease of administration, getting rid of non-meaningful differences, and to improve the chances of hires that are of traditionally lower-scoring groups. If the difference between scores has no practical significance, it is safe to band. An organization would benefit from banding if they are trying to improve the diversity of their organization or if it is is important for their employees to be representative of the population (i.e., police dept.). If a top-down approach is used, it is likely that the majority of hires will be white since minorities, on average, score lower than whites on tests that are cognitively loaded. Banding is useful when it is expected that small differences in test scores may not represent significant differences in actual job performance, as all scores within a band are treated the same.

93
Q

under what conditions should banding definitely be used?

A

Should be used when organization has been legally reprimanded for adverse impact/when an organization has been mandated to implement an affirmative action plan to compensate for past adverse impact misconduct

94
Q

establishing bands: statistical methods (general)

A

the standard error of measurement: results in a range of scores that are basically equivalent 95% of the time the standard error of difference: most frequently used; determines range of scores where you can be 95% sure that the difference is real

95
Q

describe fixed bands

A

the range of the top band is determined from the highest selection procedure achieved by an applicant. all applicants within this top band must be selected before applicants in lower bands can be chosen. since scores are considered equivalent within the band, people can be chosen in any order. the number of bands created depends on how many people the org needs to fill a position.

96
Q

describe sliding bands

A

they are also determined by the top applicant score. but once that top applicant is selected, the band is recualtued using the next highest applicant score. this is a sequential process that basically recalculates after a person is chosen because people are picked relative to what’s left in the applicant pool

97
Q

selecting within a band

A

once bands are created you need to decide how you’re going to select individuals from within the band. minority group status is considered after band creation ALONG WITH other criteria in choosing people from within the same band. this can include job experience, personal skills, training ,professional conduct, etc. you CANNOT use minority status as a singular basis for choosing people within a band!