Selection (Measurement/Testing/Reliability) Flashcards

Question

Discuss the idea of test-retest reliability

Answer 1

Administer the measure twice and then correlate the two sets of scores using the Pearson product-moment correlation coefficient. This method is referred to as test-retest reliabil- ity. It is called test-retest reliability because the same measure is used to collect data from the same respondents at two different points in time. Because a correlation coefficient is calculated between the two sets of scores over time, the obtained reliability coefficient represents a coefficient of stability. The coefficient indicates the extent to which the test can be generalized from one time period to the next. The higher the test-retest reliability coefficient, the greater the true score and the less error present. If reliability were equal to 1.00, no error would exist in the scores; true scores would be perfectly represented by the obtained scores. any factor that causes scores within a group to change differentially over time will decrease test-retest reliability. Similarly, any factor that causes scores to remain the same over time will increase the reliability estimate.

Answer 2

1. Test-retest reliability is appropriate when the length of time between the two administrations is long enough to offset the effects of memory or practice. 2. When there is little reason to believe that memory will affect responses to a measure, test-retest reliability may be employed. Memory may have minimal effects in situa- tions where (a) a large number of items appear on the measure, (b) the items are too complex to remember (for example, items involving detailed drawings, complex shapes, or detailed questions), and (c) retesting occurs after at least eight weeks.11 3. When it can be confidently determined that nothing has occurred between the two testings that will affect responses (learning, for example), test-retest can be used. 4. When information is available on only a single item measure, test-retest reliability may be used.

Answer 3

-helps offset the effects of memory on test-retest reliability -Pearson correlation calculated between two sets of scores of different but equivalent items -As the coefficient approaches 1.00, the set of measures is viewed as equivalent or the same for the attribute measured. - If equivalent forms are administered on different occasions, then this design also reflects the degree of temporal stability of the measure. In such cases, the reliability estimate is referred to as a coefficient of equivalence and stability. The use of equivalent forms administered over time accounts for the influence of random error to the test con- tent (over equivalent forms) and transient error (across situations).

Answer 4

1. the process of developing equivalent forms initially begins with the identification of a universe of possible math ability items, called the universe of possible math items 2. Items from this domain are administered to a large sample of individuals representative of those to whom the math ability test will be given 3. Individuals’ responses are used to identify the difficulty of the items through item analyses and to ensure the items are measuring the same math ability construct 4. Next, the items are rank- ordered according to their difficulty and randomly assigned in pairs to form two sets of items (form A and B).

Answer 5

Because it is difficult to meet all of the criteria of equivalent forms, some writers use the term alternate forms to refer to forms that approximate but do not meet the criteria of parallel forms.

Answer 6

An index of a measure’s similarity of content is an internal consistency reliability esti- mate. Basically, an internal consistency reliability estimate shows the extent to which all parts of a measure (such as items or questions) are similar in what they measure.13 Thus a selection measure is internally consistent or homogeneous when individuals’ responses on one part of the measure are related to their responses on other parts -If the sample of selected items truly assesses the same concept, then re- spondents should answer these items in the same way. What must be determined for the items chosen is whether respondents answer the sample of items similarly

Answer 7

split half, Kuder Richardson, & Chronbach's alpha

Answer 8

-single administration of the measure -then measure is divided or split into two halves -performance on half 1 should be associated with performance on half 2 if all items measure the same attribute -most common method is to split by even and odd numbered test items -Problem: too many ways to split, not all splits reveal the same reliability estimate; this isn't used as much today -limitations: can't detect any errors associated with time

Answer 9

-single administration of the measure -used to determine consistency of answers to any measure that has items scored dichotomously (like right or wrong answers) -estimates average of the reliability coefficients that would result from all possible ways of subdividing a test

Answer 10

-can use for continuous item responses -a general version of KR20 -still represents average reliability computed from all possible split half reliabilities -gives average correlation of each item to every other item on a measure - If coefficient alpha reliability is unacceptably low, then the items on the selection measure may be assessing more than one characteristic.

Answer 11

-when scoring is based on individual judgment -rater behaviors and rater biases contribute to rater error -defined as the consistency among raters; determine whether multiple raters are consistent in their judgments -computation can involve interrater agreement, interclass correlation, and intraclass correlation

Answer 12

-some are not good estimators of reliability -often doesn't take consideration of rater agreement due to chance into account -generally restricted to nominal or categorical data that reduce their flexibility

Answer 13

employed when two raters make judgements on an interval scale -pearson product moment correlation and cohen's weighted kappa -shows amount of error between two raters -Relatively low interclass correlations indicate that more specific operational criteria for making the ratings, or additional rater training in how to apply the rating criteria, may be needed to enhance interrater reliability.

Answer 14

-When three or more raters have made ratings on one or more targets, intraclass correla- tions can be calculated. -This procedure is usually viewed as the best way to determine whether multiple raters differ in their subjective scores or ratings on the trait or behavior being assessed. -Intraclass correlation shows the average relationship among raters for all targets being rated. -how much of the difference in ratings is due to true differences vs measurement error

Answer 15

- Unfortunately, there is no clear-cut, generally agreed upon value above which reliability is acceptable and be- low which it is unacceptable. -Obviously, we want the coefficient to be as high as possible; however, how low a coefficient can be and still be used will depend on the purpose for which the measure is to be used. -The following principle generally applies: The more critical the decision to be made, the greater the need for precision of the measure on which the decision will be based, and the higher the required reliability coefficient.

Answer 16

-method of estimation -stability -sample used -individual respondent differences -length of measure -test question difficulty -homogeneity of measure content -response format -administration of the measure

Answer 17

-To obtain an estimate of the error for an individual, we can use another statistic called the standard error of measurement. -The statistic is simply a number in the same measurement units as the measure for which it is being calculated. -The higher the standard error, the more error present in the measure, and the lower its reliability. -Thus the standard error of measurement is not another approach to estimating reliability; it is just another way of expressing it (using the test’s standard deviation units). -Because of its importance, the statistic should be routinely reported with each reli- ability coefficient computed for a measure. In practice, if you want to make a cursory comparison of the reliability of different measures, you can use the reliability coefficient. However, to obtain a complete picture of the dependability of information produced by a measure and to help interpret individual scores on the measure, the standard error of measurement is essential.

Answer 18

most appropriate for use at entry level, not for jobs where prior job specific preparation is required; when this is the case, content valid tests are more appropriate and acceptable to users and applicants than GMA tests

Answer 19

the measure of the difference in the average scores of sub groups measured in SD units; so an effect size of .88 means that that subgroup scores .88 units above the mean of the other subgroup

Answer 20

making them job specific; this increases the criterion related validity, especially in high complexity jobs. It also increases face validity and perceptions of fairness. Using content validity as a way to develop job knowledge tests is likely to result in high criterion related validity.

Answer 21

for entry level jobs where no prior job specific training or experience is required

Answer 22

SJTs correlate well with GMA tests, but have less adverse impact. they also have higher face validity than GMA tests

Answer 23

Factor analysis is a data reduction technique used to reduce the original data into a common set of categories, factors, or traits (Anastasi & Urbina, 1997). These factors are considered to be what makes up the factorial composition of the test. Simply put, factor analysis allows us to determine relationships between variables. Factor analysis is most useful for more complex or latent variables (i.e., variables that cannot be directly measured). The result is that a large number of variables are collapsed into a smaller number of more interpretable variables, while accounting for the most variance possible. The idea is that each of these variables will produce a similar pattern of responses that all tie to the latent trait that is being measured. Factors that explain the most amount of variance in the observed variables are retained, while those that do not account for as much variance can be thrown out. In the context of test development, factor analysis can help us in improving the psychometric properties of the test, providing evidence for validity, and helping us decide which items to retain or drop from the test itself (e.g., factor loadings). In particular, factor analysis can help us in an exploratory sense by determining the dimensionality of constructs to be measured through the instrument (Floyd & Widaman, 1995). The factors identified are then used as subtests for the overall instrument, Through the identification of common vs. unique variance, factor analysis can help us determine the structure of the instrument.

Answer 24

Test bias occurs when a test results in statistical errors of prediction for certain groups/subgroups. Bias can mean that a test functions differently or produces different results for certain groups. Some tests may not predict outcomes or behaviors consistently among subgroups. Bias can be a result of internal factors, such as the psychometric properties of the test, or external, such as selection. There are different types of test bias, such as construct, method, or item bias. One relevant type of test bias is cultural bias, in which the test bias occurs as a function of a cultural variable, such as ethnicity or gender (Reynolds & Suzuki). One common example of cultural test bias is with intelligence tests, which have come under legal and social scrutiny in the past. When a test produces different scores between groups, but that difference is considered true, then there is no bias present. Sources of test bias can include inappropriate content, inappropriate standardization samples, test language, inequitable social consequences, and differential predictive validity (Reynolds & Suzuki). Some ways to avoid test bias include using neutral language in test items, using standard English (i.e., avoiding slang), avoiding references to race, ethnicity, gender, age, etc., and avoiding using language specific to a certain geographical location. Additionally, avoiding external influences such as stereotype threat can help reduce biases. Finally, it is recommended that items asking for demographic information, such as race and gender, should not be placed before the test items. With respect to the test development process, ways to prevent test bias include using a heterogenous group of test developers, having minority individuals review test items, using performance items over language, and pilot testing effectively (Green & Ross, 1979). Finally, objectively scorable measures should be used in order to reduce the possibility of bias resulting from subjective judgments and test scoring.

Answer 25

Distractor analysis involves the comparison of the performance of the highest and lowest scoring 25% of examinees on options presented on the test that are incorrect. Ideally, fewer of the higher scorers on the test should choose each of the distracters as their answers as compared to the lower scorers. Distractor analysis helps us determine whether the alternative choices we provide on the test are functioning effectively. These items should draw the examinee away from the correct answer. Ideally, a good distractor would be one that is selected enough times. Acceptable values are determined by item difficulty and can be subjective. The number of people who select each distractor also helps determine that particular distractor’s effectiveness. For example, if 10 people took a test and the item difficulty is .6, and 6/10 answered correctly (answer choice “B”). Of the 4 examinees who answered incorrectly, and all selected the distractor A, then distracters C and D would be deemed ineffective. We should have roughly the same number of examinees selecting each distractor. Item discrimination can also be used, such that a negative value would be expected as more lower performers should select distracters (each item has its own discrimination value). The concept of distracters and their analysis is important because it can affect test scores; if distracters are provided that are not effective, a test might become too easy, and the interpretation of test scores that indicate true ability level may become inflated through the process. This means that there is a high probability of examinees selecting the correct answers because the distracters are ineffective, rather than their true ability levels, will influence test performance.

Answer 26

In selection, the criterion problem refers to issues that revolve around what and how criterion is measured, as well as how it’s defined. A criterion measure that is easily accessible is not necessarily a valid measure of the criterion (e.g., job performance). This means that the interpretations made can be inaccurate or otherwise null and void. In organizations this can especially be difficult because of the fast-paced environment in which they often operate. Additionally, it may be hard to convince organizational leadership that one way of measuring job performance is more effective and valid over another way. This often results in the use of convenience measures of performance, which may not be solid indicators of actual performance but rather a means of ease of measurement. From a philosophical standpoint, it is difficult to be sure that even what we perceive as being the “best” measures of performance actually are, indeed, the best. Although we can do our best, it is never guaranteed that we are measuring performance properly, even when there is a suitable definition of performance available. Jex (2002) discusses the tradeoff between what is easy to measure and what may be more meaningful. The criterion problem exists mainly in selection, but problems can also arise with organizational initiatives as organizations insist on using global measures of effectiveness rather than defining effectiveness using a myriad of criterion. In terms of selection and validation studies, a domino effect will take place. If you aren’t using a valid criterion measure, you simply cannot interpret the study results as valid either.

Answer 27

The multi-trait multi-method matrix is an approach to convergent and discriminant validation proposed by Campbell & Fiske (1959). Two or more traits are measured using two or more methods. The matrix is a visualization of this process, and includes all possible correlations among scores when multiple traits are measured by multiple methods. Reliability and validity coefficients are also presented. The validity coefficients the cores obtained for the same trait by the different methods are correlated, wherein each measure is being compared against other measures of that same trait. The table also includes correlations between different traits measured by the same method and correlations between different traits measured by different methods. The validity coefficients should be higher than the correlations for different traits measured by both the same and different methods in order to demonstrate construct validity.

Answer 28

A passing point should be used to determine the differentiation point between minimally qualified applicants and non-qualified applicants, particularly for tests in which a pass/fail dichotomous scoring method is going to be applied. The purposes of the test should be basis for determining setting passing points, and a passing point should be set by consulting SMEs rather than arbitrarily set (using percentages of minimally qualified people who should be able to correctly respond to the item). The SMEs should be chosen based on their familiarity with test content and the job being considered (i.e., KSAs needed to perform the job). Should think about the consequence of classification errors before putting a passing point in place.

Answer 29

Similar to strict top down, maximizing utility When combining two assessment scores in order to determine scores/ lists based on the scores of the two tests Use Spearman’s r to assess to evaluate relationships involving ordinal variables

Answer 30

Looks at differences between groups, used to find confidence intervals and differences between correlations Testing the difference between two correlation coefficients from independent samples DIF procedures differ from the standard methods of item analysis Shows the extent to which an item might be measuring different abilities for members of separate subgroups Looks to see whether the items truly have differing correlational strength from each other Both look at the differences between groups, looking at item characteristics and statistics to determine test differences If there are significantly different correlations between subgroups’ scores and a given criterion (job performance, for example), this can serve as evidence of DIF

Answer 31

content: all persons being assessed are measured by the same content and same format administration: information is collected the same way in all locations across administrations Scoring: rules for scoring are specified before administration and applied the same way for each application of the measure

Answer 32

what it's going to measure: concisely define the construct of measurement cost effectiveness standardization: this includes how to handle situations for disabled applicants ease of use: less expensive when people scoring it don't need a lot of training is it acceptable to the org and the candidate?

Answer 33

is the criterion job relevant? is it acceptable to management? managers must support its use based on their values are work changes likely to alter the need for the criterion? criteria need to be reviewed periodically to ensure they're still relevant and appropriate for use is is free of bias so that meaningful comparisons can be made between individuals? differences in tools and equipment, differences in work shifts, etc. may cause bias in measurement will It detect differences if differences actually do exist?

Answer 34

discrimination: significant differences among protected groups that systematically differ should be further examined does it lend itself to quantification? you would rather have quantitative data for selection; analysis is more meaningful and can use more sophisticated analyses measure scoring consistency: all raters should receive the same training how reliable are the data provided by the measure? consistent scores across individuals on same measure construct validity: how well is it measuring the construct its intended to measure?

Answer 35

1. analyze the work 2. select the method of measurement 3. develop the specifications or plan of the measure 4. construct the preliminary measure 5. administer and analyze the preliminary measure 6. prepare a revised measure 7. determine the reliability and validity of the revised measure 8. implement and monitor the measure in the HR system

Answer 36

how you know which constructs you want to measure provides information about criterion measures as well

Answer 37

the world of work is seeing an enormous shift in job responsibilities to changes in the workforce and technology. this means that the specificity of job analysis is not the best way. work analysis is a broader scope of analysis common across jobs or to the entire organization

Answer 38

specific nature of the job and skill of individual responsible for scoring the measure will need to be considered. consider costs of testing and resources, think about the organizational context in which the job is performed. which method of measurement is likely to yield the most information? what makes the most sense? for example written tests make the most sense for knowledge constructs, while a work sample test would work better for evaluating an ability

Answer 39

you need to clarify the nature and details of the measure by developing specifications for its construction. these specifications should include: purposes of the measure operational definition of the construct the way the KSAOs will be gathered and scored (administration, format, scoring procedures) nature of the population for which its designed the time limits for completing the measure which statistical procedures to use in selecting and editing items develop a content outline that describes what a person in that job needs to know ; this comes from JA information and discussions with incumbents and supervisors. so fro example you have the knowledge statement "knowledge of principles of electrical writing" and here you will break that down into further detail: "installing electrical grounds" and "reading drawings" next, develop a "test item budget"; how many items will you use to assess this construct?

Answer 40

develop items according to the content developed when you made the specifications. follow these steps: 1. generate behavioral samples or items 2. SMEs review/evaluate the items by considering the appropriateness of the item content and format, clarity, and if it might contain bias 3. develop the scoring and administration procedures; consider fixed response (MC) vs. open ended

Answer 41

pilot test the initial form of the test. administer to a sample from same population for which its being developed; consider demographics, motivation, ability, and experience of applicant pool of interest. make sure sample is sizable, several hundred preferred if doing any item or factor analyses

Answer 42

based on pilot test data, perform item analyses to know if and how the measure should be revised. consider these psychometric characteristics when doing item analyses: -reliability or consistency of item scores -validity of the intended inferences: do the items actually differentiate between ability levels of applicants? look at item difficulty especially for ability measures with right/wrong answers -item fairness: scores should have same meaning for members of different subgroups of the population; compare levels of item difficulty to make sure. if needed to revise, consult SMEs who have expertise

Answer 43

conduct reliability and validity studies on the revised measure.

Answer 44

after getting reliability and validity evidence the measure can be integrated into the selection system. continue to monitor the measure's performance to ensure it is performing well.

Answer 45

norms, percentiles, or standard scores

Answer 46

need to know how others scored on the measure and the validity of the measure in order to interpret a score meaningfully. norms = scores of relevant others in groups; show how well an individual performs relative to a specified comparison group (called a normative sample) to choose norming group/normative data to use, keep the following in mind: norm group selected should be relevant for its purpose (if you're selecting experienced electricians, don't use a norm group of recent trade school grads) use local norms rather than national data; a local norm is one based on selection measures given to people making an application for employment to a particular employer. at first you may have to use test manuals when the test is first rolled out, but local norms can be developed when you have 100 or more scores for a particular group. passing scores should then be established on the basis of these local data. norms are transitory (specific to the time point in which they were collected); they change over time; for attributes that are likely to change over time this is especially relevant and more current normative data should be used know that norms don't always need to be used in selection. if you're doing top down for example you wouldn't need them, but if you notice that the applicants median scores are falling below a normative group's median scores, that org's selection system should be examined. be very cautious when using norms based on protected class info like sex or race. it's really best to avoid these/

Answer 47

most frequently used in reporting normative data; percentiles show the percentage of persons in a norm group who fall below a given score on a measure.

Answer 48

they represent adjustments to raw scores to determine the proportion of people that fall at various levels. z scores are a common one from -4 to 4. you can use T scores to avoid negative numbers biggest issue is that they're subject to misinterpretation because they don't tell you what a score means in terms of job performance

Answer 49

1. job analysis: define the job content domain 2. select the SMEs: select based on qualifications and experience 3. specification of measure content: domain sampling: the items are chosen or the measure to represent the KSAs determined via the JA; ensure that your questions have physical and psychological fidelity as much as possible 4. assessment of selection measure and job content relevance: SMEs rate the items for content relevance. a content validation index (CVI) can be applied to determine the measure to job content overlap

Answer 50

method of estimation individual differences: if variation among respondents is wide, the device can more reliably distinguish among people; greater variability = higher reliability stability: of the construct and in measurement over time; for example measures of anxiety or emotion are going to be less reliable because they are less stable over time the sample: should be representative of population of interest for the measure to be used. also sample size affects reliability; larger samples ensure influence of random error is normally distributed so you'll get a better estimate length of measure: longer = higher reliability; remember you're sampling from the "universe" of all test items for a construct test question difficulty: if you have right or wrong answers it will affect reliability by affecting differences among applicant test scores. if the items are very hard or very easy, differences in individual's scores will be reduced because many will have roughly the same test score; either very low or very high. finer distinctions between ability levels can be made with moderately difficult questions, which will increase reliability. response format: more categories of responses = greater reliability via a greater spread in scores administration/scoring of the measure: can contribute to errors in scoring, which decreases reliability standard error of measurement: a number in same units as the measure that determines whether scores for individuals differ significantly from one another and how confident we can be in the scores we get from different groups of people

Answer 51

for KSAs that are more abstract and less observable, its difficult tot accurately establish content validity; other validation strategies such as criterion related ones may be necessary regarding job analysis and content validation, a link needs to be established between the KSAs for job performance and the KSAs needed to do well on the selection measure. this highlights the criticality of doing a very thorough and well rounded JA according to the uniform guidelines, content validation is not appropriate when psychological constructs are not directly observable, when the selection procedure involves KSAs an employee is expected to learn on the job, or when the content of the measure doesn't resemble work behavior

Answer 52

1. job should be reasonably stable, not in a period of transition 2. the criterion should be reasonably free from contamination 3. access to a representative sample 4. a large enough and representative sample of people on whom both the predictor and criterion data have been collected must be available; over several hundred are needed or a type II error may occur

Answer 53

this is where you establish that the construct is measuring what it's supposed to measure via its relationships with other constructs 1. define the construct and formulate hypotheses about its relationships with other variables 2. construct the measure 3. test the hypothesized relationships between measures of the other constructs

Answer 54

.3-.5; its unlikely that a measure will exceed .5

Answer 55

because of the possibility of error when using regression for criterion validity estimation based on a new group, cross validation is important to test using empirical or formula estimations empirical estimations split sample method: a group of people with predictor and criterion data is randomly divided into two groups, a regression equation is developed on one of the groups (the weighting group). this equation is used to predict the criterion for the other (hold out) group, and these predicted criterion scores are then correlated with their actual criterion scores. look for a significant correlation between these two. this indicates that the regression equation is useful for people other than those on whom the equation was developed.

Answer 56

reliability of the criterion and predictor; error lowers validity; reliability is a necessary precursor to validity. you should use a formula to correct for this when needed. range restriction: the variance in scores is reduced when people scoring low on a test were not hired; especially true for predictive validation strategies. criterion scores are also often restricted based on turnover, good people being hired, etc. you can use a formula to correct for this. criterion contamination violation of statistical assumptions: there must be a linear relationship between predictor and criterion or else the validity coefficient will be underestimated of the true relationship

Answer 57

involves evidence that validity information accumulated from deletion measures form a series of studies can be applied to a new setting involving the same or similar job. 1. obtain large sample of published and unpublished validation studies 2. compute average validity coefficient for these studies 3. calculate the variance of differences among these validity coefficients 4. subs tact form the amount of the differences; the variance due to the effects of small sample size 5. correct the coefficient for error 6. compare the corrected variance to the average validity coefficient to determine the variation in study results 7. if differences among the validity coefficients are very small, then validity coefficient differences are concluded to be due to methodological issues and not to nature of situation; so the validity is generalizable across situations used mainly with general mental ability testing, paper pencil tests, bio data, personality inventories, and assessment centers, and is a global perspective on jobs

Answer 58

its a standardized way of obtaining information on the jobs for which a validation study is being conducted; it involves a more detailed examination of a job. you assume that KSAsare similar across similar jobs and that the validity for a KSA for job performance component is reasonably consistent across jobs. steps: 1. conduct analysis of job using the PAQ 2. identify major components of work of the job 3. identify the attributes required for performing these components using SME ratings 4. choose tests that measure these attributes as identified via the PAQ

Answer 59

a commercially available questionnaire that contains a comprehensive list of general job behaviors. a respondent answers the PAQ by indicating the extent to which these descriptions accurately reflect work behaviors of a certain job.

Answer 60

in selection there are various methods to collect information on job applicants and combining that information the basic methods involve either mechanical (e.g., an applicant taking a mental ability test) or judgmental (e.g., an applicant interview). you can also combine these methods. in order to reach a selection decision, you must combine this information. combining data mechanically may involve using statistics like an additive/regression equation that can predict job performance. judgmental combining might involve "gut instincts" about applicants or human intuition making an overall impression of an applicant. these methods include different combinations of judgmental vs mechanical collection and combining. but the best way I think is "mechanical composite" which involves collecting both mechanical and judgmental data and then combining the data mechanically using regression. mechanical techniques yield better results. they involve less error and accuracy can be built up over time as more data is collected for prediction. it also takes away the bias involved in human judgment in decision making.

Answer 61

use standardized selection procedures that are reliable and valid use selection procedures that minimize use of human judgment in information collection; make keys and scoring criteria for those that do involve human judgment generally better to use mechanical combining techniques especially statistical equations

Answer 62

It’s best too use multiple cutoffs with pass/fail for each predictor when physical abilities are important to job performance. Multiple hurdle is best when subsequent training is expensive/complex, or KSA can’t be compensated for by another KSA. Range restriction likely a problem but less costly. Combination method: uses cutoffs and multiple regression to calculate overall score for those who pass the cutoffs. Best used when more of one KSA can compensate for lack of other KSA. Everyone does every measure so can be costly.

Answer 63

top down selection: maximum utility but problems with adverse impact cut off scores: you can base this off of how job applicants or other people performed on the measure by developing local norms or by using SME judgments. banding

Answer 64

1. Determine what you want to measure: theory will guide this step. Theory must be formulated and laid out first. It is important to think about how general or specific you want your measure to be; remember that broad predicts broad and specific predicts specific, so think about what the intended function of the scale will be when deciding this. Determine if and how the construct is conceptually and theoretically distinct from other constructs. 2. Generate an item pool Each item should reflect the purpose of the scale and the definition of the latent variable youre measuring. Redundancy can be good or bad: need to find a balance, you want redundancy in measurement to increase reliability and to help decide which items are best but not in wording or grammatical structure The more specific your construct is, the more similar some of your items will probably be # of items: the more the better but keep in mind it shouldn’t be so many that your sample will suffer from fatigue; if an item pool is too large, cut it down before administering Item writing suggestions: avoid lengthy items, double barreled items, consider reading level and quantify the reading level, 5-7th grade is good for general pop; negatively worded items can be bad cause they can confuse people 3. Determine format for measurement Consider type of scale, how many response categories: physical placement of options is important as well as thinking about how well respondents will be able to discriminate between response options meaningfully 4. SME Item Pool Review Provide working definition of construct and have them rate items based on relevance, clarity, conciseness; can have discussions about items if necessary to determine its quality and inclusion in the scale (Mumford et al method) 5. Consider inclusion of validation items May want to include social desirability or other measure to determine construct validity all in one sample 6. Adminiter items to Pilot Sample 300 people adequate in order to eliminate subject variance; size is function of number of items and number of dimensions to be extracted. If you don’t get enough people, can affect psychometric properties and paint inaccurate picture of item covariances 7. Evaluate the items Correlations among items tell you about the relationships to true scores; higher correlations = higher individual item reliabilities. We want the items to be highly inter correlated, this is done by looking at the correlation matrix Item variances: higher = better, indicates good discrimination between different levels of the construct or ability Item means: want a good range of variance this tells you the item is assessing all levels of the construct; want mean to be close to center of range of possible scores not at an extreme end of distribution Coefficient alpha: indicates proportion of variance in scale scores that is attributable to true score; assumes items are unidimensional, .7 is lower bound, .8-.9 ver good, over .9 shorten the scale you have redundancy in your items 8. Optimize scale length covariation among items + number of items in scale influences optimal length; adding more will increase alpha, removing will lower it. Determine optimal trade off between brevity and reliability. Can check reliability if item deleted to see. The correlation of item to scale will raise alpha if dropped. look at loadings of items if looking at multidimensionality Consider splitting sample to cross validate since subsamples are more likely to represent same pop versus a whole new sample

Answer 65

A compromise between top down and passing scores Takes into consideration the degree of error associated with any test score How many points apart do two applicants have to be before we say their test scores are significantly different? uses standard error of measurement (SEM) Top scorers are hired while still allowing flexibility for affirmative action (Campion et al 2001)

Answer 66

Banding can result in lower utility than top down hiring It may not actually reduce adverse impact in any significant way Usefulness of achieving affirmative action goals is affected by factors like selection ratio and percentage of minority applicants SEM formula isn’t correct and standard error of estimate should be used instead But bands resulting from use of SEE will be smaller than those resulting from use of SEM, thus diminishing the already questionable usefulness of banding as a means of reducing adverse impact

Answer 67

Banding is the process of treating certain groups of test scores the same. The basic premise lies in psychometric theory (CTT): measurement error influences the accuracy of test scores, and therefore our ability to determine one’s true score (T). Small differences in test scores may not mean that there are true differences in abilities between candidates when we consider the effects of measurement error on observed scores. This is compounded by the fact that many studies indicate that selection tests are flawed in their predictive ability to predict job performance. These issues can lead to false hits and false misses as a result of the selection process. In order to counter these error effects, certain scores are grouped into bands and therefore treated the same. There are different types of banding. In traditional banding, the bands are determined by the use of trend analysis and expert opinion. In the SED (standard error of difference) method, statistical significance testing is used to determine how to group the scores into bands. There are also fixed or sliding bands. In fixed bands, all those from the band are selected before those from the lower band. In sliding bands, bands slide down as each person is removed from the top and the bands are re-made. Banding serves many purposes. The primary purpose is to ensure fairness by treating small differences in scores as the same. However, there is some loss of predictive power as compared to traditional, top down selection, although much research indicates that the loss in utility is minor. Another purpose is to increase diversity. Especially with the use of sliding bands and points for minority status, this makes for the increased likelihood of minorities to be selected. In order for banding to adhere to legal standards, the selection from bands has to be based on a myriad of factors, and not just on the sole characteristic of minority status. Organizations should make it a point to educate applicants and employees on their use of banding and provide information for their rationale. Policies and processes should also be provided in writing. An example of when an organization might use banding is as a part of a mandated AA plan, perhaps to correct for past transgressions and/or mitigate adverse impact (under the assumption that the test is non-biased, reliable, and valid)

Answer 68

When determining the type of banding to use, you want to consider the effect to utility and adverse impact that may result. Strict top-down selection produces the highest utility but also the most adverse impact. Top-down within-groups will produce the greatest loss in utility. Banding methods that use minority preference selection within bands eliminate adverse impact. The utility loss in this type is no more than 3 points on the selection procedure composite score. Sliding bands, minority preference, top-down selection within bands don’t produce adverse impact and will produce utility that is comparable to strict top-down selection. Critics of banding state that there is no justification for using such a method The legal reasons for banding include ease of administration, getting rid of non-meaningful differences, and to improve the chances of hires that are of traditionally lower-scoring groups. If the difference between scores has no practical significance, it is safe to band. An organization would benefit from banding if they are trying to improve the diversity of their organization or if it is is important for their employees to be representative of the population (i.e., police dept.). If a top-down approach is used, it is likely that the majority of hires will be white since minorities, on average, score lower than whites on tests that are cognitively loaded. Banding is useful when it is expected that small differences in test scores may not represent significant differences in actual job performance, as all scores within a band are treated the same.

Answer 69

Should be used when organization has been legally reprimanded for adverse impact/when an organization has been mandated to implement an affirmative action plan to compensate for past adverse impact misconduct

Answer 70

the standard error of measurement: results in a range of scores that are basically equivalent 95% of the time the standard error of difference: most frequently used; determines range of scores where you can be 95% sure that the difference is real

Answer 71

the range of the top band is determined from the highest selection procedure achieved by an applicant. all applicants within this top band must be selected before applicants in lower bands can be chosen. since scores are considered equivalent within the band, people can be chosen in any order. the number of bands created depends on how many people the org needs to fill a position.

Answer 72

they are also determined by the top applicant score. but once that top applicant is selected, the band is recualtued using the next highest applicant score. this is a sequential process that basically recalculates after a person is chosen because people are picked relative to what's left in the applicant pool

Answer 73

once bands are created you need to decide how you're going to select individuals from within the band. minority group status is considered after band creation ALONG WITH other criteria in choosing people from within the same band. this can include job experience, personal skills, training ,professional conduct, etc. you CANNOT use minority status as a singular basis for choosing people within a band!

Selection (Measurement/Testing/Reliability) Flashcards

(97 cards)