Test Bias and Test Utility Flashcards
Statistical decision analysis is:
(a) A method of selecting the best pass mark for your diagnostic test by plotting hit rate against false positive rate.
(b) A method of determining how many false positives are present in your data.
(c) A method for testing the suitability of job applicants for job positions.
(d) A method that can be used to evaluate the usefulness of a test.
The answer was d. Statistical decision analysis (utility analysis) can be used to evaluate the usefulness of a test (for example, in selecting employees). Note that decision analysis and utility analysis are the same thing: see this point from the slides of Lecture 9, where it says (slide entitled “utility analysis: more sophisticated options”): “A number of researchers have demonstrated that the use of utility analysis (statistical decision analysis) can save employers substantial amounts of money”.
Imagine I used a psychological test to decide which of 30 applicants to hire for 15 potential job positions. The test has high criterion-referenced validity with respect to job performance. If I set a very high cut-off for the test to minimise false positives then what would be the most likely outcome?
(a) I would be unlikely to be able to select enough people to fill all the job positions.
(b) The worst candidates would be likely to perform better at the test than the best candidates.
(c) I could obtain my 15 applicants but it would be unlikely that I had selected the best people for the job.
(d) Most of the applicants would pass the test and I would end up with more people than job positions.
The answer was a. See Lecture 9. A high cut-off for a high validity test will result in selecting the best people - but I risk not ending up with enough of them (because only a tiny number of people would pass the test and hence be eligible).
One group of people scores higher on a test designed to predict job performance than another group of people. Overall the test was found to be valid. On a scatterplot (job performance versus test score), the two groups are best modelled with a single regression line. What does this mean?
(a) Either the test is not biased or the test score and job performance measure are both biased.
(b) The test is biased but can still predict job performance to the same degree within each group separately.
(c) The test is biased and yields a different criterion validity coefficient for each group.
(d) The test is differentially valid.
The answer was a. See Lecture 9. The question describes Scenario 1 in the slides “Conceptualising test bias”. Either the test is not biased or both job performance measure and test score are equivalently biased.
Statement 1: In US court cases, intelligence tests, such as the WISC and Stanford-Binet, have been CONSISTENTLY judged to be racially biased when used in educational settings.
Statement 2: In Australian employment law, it is possible to overcome a claim of discrimination on the basis of a disability, if you can demonstrate that the disability is directly relevant to some inherent requirement of the job.
(a) Both statements are true.
(b) Statement 1 true; Statement 2 false.
(c) Statement 1 false; Statement 2 true.
(d) Both statements are false.
The answer was c. See Lecture 9. Statement 1 is not true because US courts have actually been inconsistent in their rulings regarding the use of IQ testing in education. In the example in the lecture, two courts in different parts of the US, upheld opposite verdicts at virtually
the same time. Statement 2 is true: “To overcome a claim of discrimination, the deficit must be directly tied to an inherent requirement of the job.”
Statement 1: When using aptitude and personality tests for employee recruitment in Australia, the tests need to have content validity with regards to the requirements of the job.
Statement 2: In principle, it is possible for a test to have good utility even if it does not have decent reliability and validity.
(a) Both statements are true.
(b) Statement 1 true; Statement 2 false.
(c) Statement 1 false; Statement 2 true.
(d) Both statements are false.
The answer was a. See Lecture 9. Statement 1 is true. “Under current Australian law, all tests must measure the person for the inherent requirements of the job and not the person in the abstract (i.e. content validity as well as criterion validity needed).” Statement 2 is true. This might happen when a test is being used for a purpose other than generating a meaningful test score. For example, lie detector tests might be useful for putting pressure on individuals regardless of whether the tests actually work or not.
Schmidt et al. (1979) used utility analysis to evaluate the efficacy of a test (Programmer Aptitude Test) for selecting programmers over traditional non-test methods such as interview. The test led to an estimated saving of $6 million per year. What was the key factor behind this saving?
(a) The test could be used in addition to the non-test methods.
(b) The test was quicker to administer than the non-test methods.
(c) The test had much better psychometric validity than the non-test methods.
(d) The test had much better face validity than the non-test methods.
The answer was c. See Lecture 9. The key difference was the test validity (i.e. it was much better at selecting out the best people for the job than previous options).
Imagine I used a psychological test to decide which of 30 applicants to hire for 15 potential job positions. If the test had a trivially small (but positive) criterion validity coefficient with respect to job performance, what would be the most likely outcome?
(a) I would be unlikely to be able to select enough people to fill all the job positions.
(b) The worst candidates would be likely to perform better at the test than the best candidates.
(c) I could obtain my 15 applicants but it would be unlikely that I had selected the best people for the job.
(d) Most of the applicants would pass the test and I would end up with more people than job positions.
The answer was c. See Lecture 9. If the validity coefficient is close to zero, then I would be effectively selecting applicants at chance. Hence it would be unlikely I would end up selecting the best people for the job.
One group of people scores higher on a test designed to predict job performance than another group of people. Overall, the test is found to be valid. On a scatterplot (job performance versus test score), the two groups are best modelled using two separate but parallel regression lines. What does this mean?
(a) The test and the measure of job performance are equally biased.
(b) The test is biased but can still predict job performance within each group separately.
(c) The test is biased and yields a different criterion validity coefficient for each group.
(d) The test is differentially valid.
The answer was b. See Lecture 9. The question describes Scenario 2 in the slides “Conceptualising test bias”. There is an intercept bias
but the test is equally predictive of job performance within each group separately.
Statement 1: In the case “Australian Industrial Relations Commission vs. Coms21” (1999), the court held that Coms21 had terminated the employment of five individuals unfairly because, IN ADDITION TO the usual competency tests, the terminations were also based on personality profiles.
Statement 2: The current national Australian body dealing with unfair workplace practices is the Fair Work Commission.
(a) Both statements are true.
(b) Statement 1 true; Statement 2 false.
(c) Statement 1 false; Statement 2 true.
(d) Both statements are false.
The answer was c. See Lecture 9. Statement 1 is not true, because Coms21 ONLY used personality tests and one of the criticisms levelled at them by the court was that they failed to use competency tests. Statement 2 is true - see www.fwc.gov.au.
The research on the “Pygmalion effect” (Rosenthal & Jacobsen, 1966) found that if teachers were told that certain students were likely to do well academically (when in reality these students were identified at random) then the selected students showed greater IQ score gains:
(a) Across all age groups.
(b) Only in the younger classes.
(c) Only in the older classes.
(d) Only if they were of African-American descent.
The answer was b. See Lecture 9. The bar graph depicting the results of the study indicates that there’s a difference in IQ Gain between the experimental groups in Grades 1 and 2, but no difference for older grades.