Selection (Measurement/Testing/Reliability) Flashcards
define measurement in the context of selection
“systematic application of pre-established rules or standards for assigning scores to the attributes or traits of an individual.” - Gatewood & Field, 7e
what is the overarching purpose of selection measures?
to be used as a predictor or criterion; to detect any true differences that may exist among individuals with regard to the attribute being measured
A predictor or criterion measure is standardized if it possesses each of the follow- ing characteristics
- Content—All persons being assessed are measured by the same information or content. This includes the same format (for example, multiple-choice, essay, and so on) and medium (for example, paper-and-pencil, computer, video). 2. Administration—Information is collected the same way in all locations and across all administrators, each time the selection measure is applied. 3. Scoring—Rules for scoring are specified before administering the measure and are applied the same way with each application. For example, if scoring requires subjective judgment, steps should be taken (such as rater training) to ensure inter-rater agreement or reliability.6
what scale must a selection criterion be measured at?
interval
List and describe types of criterion methods
- Objective production data—These data tend to be physical measures of work. Number of goods produced, amount of scrap left, and dollar sales are examples of objective production data. 2. Personnel data—Personnel records and files frequently contain information on workers that can serve as important criterion measures. Absenteeism, tardiness, voluntary turnover, accident rates, salary history, promotions, and special awards are examples of such measures. 3. Judgmental data—Performance appraisals or ratings frequently serve as criteria in selection research. They most often involve a supervisor’s rating of a subordi- nate on a series of behaviors or outcomes found to be important to job success, including task performance, citizenship behavior, and counterproductive behav- ior. Supervisor or rater judgments play a predominant role in defining this type of criterion data. 4. Job or work sample data—These data are obtained from a measure developed to resemble the job in miniature or sample of specific aspects of the work process or outcomes (for example, a typing test for a secretary). Measurements (for exam- ple, quantity and error rate) are taken on individual performance of these job tasks, and these measures serve as criteria. 5. Training proficiency data—This type of criterion focuses on how quickly and how well employees learn during job training activities. Often, such criteria are labeled trainability measures. Error rates during a training period and scores on training performance tests administered during training sessions are examples of training proficiency data.
what are the two basic options for choosing selection measures?
locate existing measures or create your own measures
locating existing selection measures: discuss the advantages
- Use of existing measures is usually less expensive and less time-consuming than developing new ones. 2. If previous research was conducted, we will have some idea about the reliability, validity, and other characteristics of the measures. 3. Existing measures often will be superior to what could be developed in-house.
List the basic steps involved in developing your own selection measure
- Analyzing the job for which a measure is being developed 2. Selecting the method of measurement to be used 3. Planning and developing the measure 4. Administering, analyzing, and revising the preliminary measure 5. Determining the reliability and validity of the revised measure for the jobs studied 6. Implementing and monitoring the measure in the human resource selection system
creating your own selection measure: 1. work analysis
-broader analysis of work can be used in situations wherein technology/jobs are changing too rapidly and quickly for a traditional job analysis to be carried out -purpose is to determine KSAs necessary for the work activities or identify employee competencies from broader perspective -provides foundation for criterion measures to be chosen/developed -
creating selection measures: selecting the measurement method
depends on: -nature of job tasks and level of responsibility -skill of people who are administering and scoring -costs -resources available for development -applicant characteristics -choose the method that’s most appropriate; for example, to test an industrial electrician applicant’s ability to solder connections, you wouldn’t give a paper pencil test, but probably a work sample test
creating selection measures: planning and developing the selection measure; specifications required for each measure
-prepare an initial version of the measure 1. The purposes and uses the measure is intended to serve. 2. The nature of the population for which the measure is to be designed. 3. The way the behaviors or knowledge, skills, abilities, and other attributes (KSAOs) will be gathered and scored. This includes decisions about the method of administration, the format of test items and responses, and the scoring procedures.15
describe the general method for generating items for selection measures
Substantial work is involved in selecting and refining the items or questions to be used to measure the attribute of interest. This often involves having subject-matter experts (SMEs) create the items or rewrite them. In developing these items, the reviewers (for example, SMEs) should consider the appropriateness of item content and format for fulfilling its purpose, including characteristics of the applicant pool; clarity and grammatical correctness; and consideration of bias or offensive portrayals of a subgroup of the population.
discuss the two types of response formats for selection measure responses
Broadly, there are two types of formats—the first uses objective or fixed-response items (multiple-choice, true-false); the second elicits open-ended, free-response formats (essay or fill-in-the-blank). The fixed-response format is the most popular; it makes efficient use of testing time, results in few (or no) scoring errors, and can easily and reliably be transformed into a numerical scale for scoring purposes. The primary advantage of the free-response format is that it can provide greater detail or richer samples of the candidates’ behavior and may allow unique characteris- tics, such as creativity, to emerge. Primarily due to both the ease of adminis- tration and objectivity of scoring, fixed-response formats are most frequently utilized today, particularly if the measure is likely to be administered in a group setting. Finally, explicit scoring of the measure is particularly critical. Well-developed hiring tools will provide an “optimal” score for each item that is uniformally applied.
creating selection measures: administering, analyzing, and revising
-pilot testing -The measure should be administered to a sample of people from the same population for which it is being developed. -Choice of participants should take into account the demographics, motivation, ability, and experience of the applicant pool of interest. - if a test is being developed for which item analyses (for ex- ample, factor analyses or the calculation of means, standard deviations, and reliabilities) are to be performed, a sample of at least a hundred, preferably several hundred, will be needed. -Based on the data collected, item analyses are performed on the preliminary data. The objective is to revise the proposed measure by correcting any weakness and deficien- cies noted. Item analyses are used to choose the content, permiting it to discriminate between those who know and those who do not know the information covered.
creating selection measures: psychometric characteristics to consider when analyzing pilot test data
- The reliability or consistency of scores on the items. In part, reliability is based on the consistency and precision of the results of the measurement process and in- dicates whether items are free from measurement error. 2. The validity of the intended inferences. Do responses to an item differentiate among applicants with regard to the characteristics or traits that the measure is designed to assess? For example, if the test measures verbal ability, high-ability individuals will answer an item differently than those with low verbal ability. Often items that differentiate are those with moderate difficulty, where 50 percent of applicants answer the item correctly. This is true for measures of ability, which have either a correct or incorrect answer. 3. Item fairness or differences among subgroups. A fair test has scores that have the same meaning for members of different subgroups of the population. Such tests would have comparable levels of item difficulty for individuals from diverse de- mographic groups. Panels of demographically heterogeneous raters, who are qualified by their expertise or sensitivity to linguistic or cultural bias in the areas covered by the test, may be used to revise or discard offending items as war- ranted. An item sensitivity review is used to eliminate or revise any item that could be demeaning or offensive to members of a specific subgroup.
creating selection measures: implementing the measure
After we obtain the necessary reliability and va- lidity evidence, we can then implement our measure. Cut-off or passing scores may be de- veloped. Norms or standards for interpreting how various groups score on the measure (categorized by gender, ethnicity, level of education, and so on) will be developed to help interpret the results. Once the selection measure is implemented, we will continue to mon- itor its performance to ensure that it is performing the function for which it is intended. Ultimately, this evaluation should be guided by whether the current decision-making pro- cess has been improved by the addition of the test.
Using norms to interpret scores on selection measures
-a score may take on different meanings depending on how it stands relative to the scores of others in particular groups. Our interpretation will depend on the score’s relative standing in these other groups. -norm group for comparison should be relevant/comparable to the applicant group -use local norms -norms are transitory- they’re specific to the point in time when they were collected, and probably change over time -Norms are not always necessary in HR selection. For example, if five of the best per- formers on a test must be hired, or if persons with scores of 70 or better on the test are known to make suitable employees, then a norm is not necessary in employment decision making. One can simply use the individuals’ test scores. On the other hand, if one finds that applicants’ median selection test scores are significantly below that of a norm group, then the firm’s recruitment practices should be examined. The practices may not be at- tracting the best job applicants; normative data would help in analyzing this situation.
Reliability: definition (selection measures)
degree of dependability, consistency, or stability of scores on a measure used in selection - Gatewood, 7e
In general, how is reliability of a measure determined?
by the degree of consistency between two sets of scores on the measure
In general, what determines whether a measure has low or high reliability?
more measurement error = lower reliability less measurement error = higher reliability
Discuss the concept of “true scores” in the context of reliability of selection measures
The true score is really an ideal conception. It is the score individuals would obtain if external and internal conditions to a measure were perfect. For example, in our mathe- matics ability test, an ideal or true score would be one for which both of the following conditions existed: 1. Individuals answered correctly the same percentage of problems on the test that they would have if all possible problems had been given and the test were a construct valid measure of the underlying phenomenon of interest (see next chapter). 2. Individuals answered correctly the problems they actually knew without being affected by external factors such as lighting or temperature of the room in which the testing took place, their emotional state, or their physical health. Because a true score can never be measured exactly, the obtained score is used to estimate the true score. Reliability answers this question: How confident can we be that an individual’s obtained score represents his or her true score?
Discuss the idea of error score in the context of reliability of selection measures
A second part of the obtained score is the error score. This score represents errors of measurement. Errors of measurement are those factors that affect obtained scores but are not related to the characteristic, trait, or attribute being measured.7 These factors, present at the time of measurement, distort respondents’ scores either over or under what they would have been on another measurement occasion. There are many reasons why individuals’ scores differ from one measurement occasion to the next. Fatigue, anxi- ety, or noise during testing that distracts some text takers but not others are only a few of the factors that explain differences in individuals’ scores over different measurement occasions.
discuss the function of the reliability coefficient in the context of selection measures
A reliability coefficient is simply an index of relationship. It summarizes the relation between two sets of measures for which a reliability estimate is being made. The calcu- lated index varies from 0.00 to 1.00. In calculating reliability estimates, the correlation coefficient obtained is regarded as a direct measure of the reliability estimate. The higher the coefficient, the less the measurement error and the higher the reliability estimate. Conversely, as the coefficient approaches 0.00, errors of measurement increase and reli- ability correspondingly decreases. Of course, we want to employ selection measures hav- ing high reliability coefficients. With high reliability, we can be more confident that a particular measure is giving a dependable picture of true scores for whatever attribute is being measured.
list the primary types of methods of estimating reliability for selection measures
- test-retest 2. parallel (equivalent forms) 3. internal consistency 4. interrater
Discuss the idea of test-retest reliability
Administer the measure twice and then correlate the two sets of scores using the Pearson product-moment correlation coefficient. This method is referred to as test-retest reliabil- ity. It is called test-retest reliability because the same measure is used to collect data from the same respondents at two different points in time. Because a correlation coefficient is calculated between the two sets of scores over time, the obtained reliability coefficient represents a coefficient of stability. The coefficient indicates the extent to which the test can be generalized from one time period to the next. The higher the test-retest reliability coefficient, the greater the true score and the less error present. If reliability were equal to 1.00, no error would exist in the scores; true scores would be perfectly represented by the obtained scores. any factor that causes scores within a group to change differentially over time will decrease test-retest reliability. Similarly, any factor that causes scores to remain the same over time will increase the reliability estimate.
guidelines for using test re test reliability
- Test-retest reliability is appropriate when the length of time between the two administrations is long enough to offset the effects of memory or practice. 2. When there is little reason to believe that memory will affect responses to a measure, test-retest reliability may be employed. Memory may have minimal effects in situa- tions where (a) a large number of items appear on the measure, (b) the items are too complex to remember (for example, items involving detailed drawings, complex shapes, or detailed questions), and (c) retesting occurs after at least eight weeks.11 3. When it can be confidently determined that nothing has occurred between the two testings that will affect responses (learning, for example), test-retest can be used. 4. When information is available on only a single item measure, test-retest reliability may be used.
Parallel forms; what does it do, how it’s measured, what it means, etc.
-helps offset the effects of memory on test-retest reliability -Pearson correlation calculated between two sets of scores of different but equivalent items -As the coefficient approaches 1.00, the set of measures is viewed as equivalent or the same for the attribute measured. - If equivalent forms are administered on different occasions, then this design also reflects the degree of temporal stability of the measure. In such cases, the reliability estimate is referred to as a coefficient of equivalence and stability. The use of equivalent forms administered over time accounts for the influence of random error to the test con- tent (over equivalent forms) and transient error (across situations).
Parallel forms: the basic process
- the process of developing equivalent forms initially begins with the identification of a universe of possible math ability items, called the universe of possible math items 2. Items from this domain are administered to a large sample of individuals representative of those to whom the math ability test will be given 3. Individuals’ responses are used to identify the difficulty of the items through item analyses and to ensure the items are measuring the same math ability construct 4. Next, the items are rank- ordered according to their difficulty and randomly assigned in pairs to form two sets of items (form A and B).
difference between alternate forms and parallel forms?
Because it is difficult to meet all of the criteria of equivalent forms, some writers use the term alternate forms to refer to forms that approximate but do not meet the criteria of parallel forms.
internal consistency reliability measurement
An index of a measure’s similarity of content is an internal consistency reliability esti- mate. Basically, an internal consistency reliability estimate shows the extent to which all parts of a measure (such as items or questions) are similar in what they measure.13 Thus a selection measure is internally consistent or homogeneous when individuals’ responses on one part of the measure are related to their responses on other parts -If the sample of selected items truly assesses the same concept, then re- spondents should answer these items in the same way. What must be determined for the items chosen is whether respondents answer the sample of items similarly
list the procedures most often applied to obtain internal consistency estimates
split half, Kuder Richardson, & Chronbach’s alpha
describe split half reliability
-single administration of the measure -then measure is divided or split into two halves -performance on half 1 should be associated with performance on half 2 if all items measure the same attribute -most common method is to split by even and odd numbered test items -Problem: too many ways to split, not all splits reveal the same reliability estimate; this isn’t used as much today -limitations: can’t detect any errors associated with time
Describe Kuder-Richardson reliability estimates (KR20)
-single administration of the measure -used to determine consistency of answers to any measure that has items scored dichotomously (like right or wrong answers) -estimates average of the reliability coefficients that would result from all possible ways of subdividing a test
Describe Chronbach’s alpha reliability estimate
-can use for continuous item responses -a general version of KR20 -still represents average reliability computed from all possible split half reliabilities -gives average correlation of each item to every other item on a measure - If coefficient alpha reliability is unacceptably low, then the items on the selection measure may be assessing more than one characteristic.
Describe interrater reliability estimates
-when scoring is based on individual judgment -rater behaviors and rater biases contribute to rater error -defined as the consistency among raters; determine whether multiple raters are consistent in their judgments -computation can involve interrater agreement, interclass correlation, and intraclass correlation
Describe interrater agreement
-some are not good estimators of reliability -often doesn’t take consideration of rater agreement due to chance into account -generally restricted to nominal or categorical data that reduce their flexibility
interclass correlation
employed when two raters make judgements on an interval scale -pearson product moment correlation and cohen’s weighted kappa -shows amount of error between two raters -Relatively low interclass correlations indicate that more specific operational criteria for making the ratings, or additional rater training in how to apply the rating criteria, may be needed to enhance interrater reliability.
describe Intraclass correlation
-When three or more raters have made ratings on one or more targets, intraclass correla- tions can be calculated. -This procedure is usually viewed as the best way to determine whether multiple raters differ in their subjective scores or ratings on the trait or behavior being assessed. -Intraclass correlation shows the average relationship among raters for all targets being rated. -how much of the difference in ratings is due to true differences vs measurement error