The Tyranny of Metrics Flashcards
Jerry Muller
The series, Bodies, written by Jed Mercurio, a former hospital physician, takes place in the obstetrics and gynecology ward of a metropolitan hospital. In the first episode, a newly arrived senior surgeon performs an operation on a patient with complex comorbidities, after which she dies. His rival then provides him with this advice: “The superior surgeon uses his superior judgment to steer clear of any situation that might test his superior ability.”
Bodies is a medical drama, but the phenomena it depicts exist in the real world. Numerous studies have shown that when surgeons, for example, are rated or remunerated according to their success rates, some respond by refusing to operate on patients with more complex or critical conditions. Excluding the more difficult cases—those that involve the likelihood of poorer outcomes—improves the surgeons’ success rates, and hence their metrics, their reputation, and their remuneration. That of course comes at the expense of the excluded
patients, who pay with their lives. But those deaths do not show up in the metrics.
Gaming the metrics occurs in every realm: in policing; in primary, secondary, and higher education; in medicine; in nonprofit organizations; and, of course, in business. And gaming is only one class of problems that inevitably arise when using performance metrics as the basis of reward or sanction. There are things that can be measured. There are things that are worth measuring. But what can be measured is not always what is worth measuring; what gets measured may have no relationship to what we really want to know. The costs of measuring may be greater than the benefits. The things that get measured may draw effort away from the things we really care about. And measurement may provide us with distorted knowledge—knowledge that seems solid but is actually deceptive.
Starting in the late 1990s, Gallup conducted a multiyear examination of high-performing teams that eventually involved more than 1.4 million employees, 50,000 teams, and 192 organizations. Gallup asked both high- and lower-performing teams questions on numerous subjects, from mission and purpose to pay and career opportunities, and isolated the questions on which the high-performing teams strongly agreed and the rest did not. It found at the beginning of the study that almost all the variation between high- and lower-performing teams was explained by a very small group of items. The most powerful one proved to be “At work, I have the opportunity to do what I do best every day.” Business units whose employees chose “strongly agree” for this item were 44% more likely to earn high customer satisfaction scores, 50% more likely to have low employee turnover, and 38% more likely to be productive.
Taken from an HBR Article on performance management
We set out to see whether those results held at Deloitte. First we identified 60 high-performing teams, which involved 1,287 employees and represented all parts of the organization. For the control group, we chose a representative sample of 1,954 employees. To measure the conditions within a team, we employed a six-item survey. When the results were in and tallied, three items correlated best with high performance for a team: “My coworkers are committed to doing quality work,” “The mission of our company inspires me,” and “I have the chance to use my strengths every day.” Of these, the third was the most powerful across the organization.
At the end of every project (or once every quarter for long-term projects) we will ask team leaders to respond to four future-focused statements about each team member. We’ve refined the wording of these statements through successive tests, and we know that at Deloitte they clearly highlight differences among individuals and reliably measure performance. Here are the four:
- Given what I know of this person’s performance, and if it were my money, I would award this person the highest possible compensation increase and bonus [measures overall performance and unique value to the organization on a five-point scale from “strongly agree” to “strongly disagree”].
- Given what I know of this person’s performance, I would always want him or her on my team [measures ability to work well with others on the same five-point scale].
- This person is at risk for low performance [identifies problems that might harm the customer or the team on a yes-or-no basis].
- This person is ready for promotion today [measures potential on a yes-or-no basis].
Taken from an HBR Article on performance management
We ask leaders what they’d do with their team members, not what they think of them.
Research into the practices of the best team leaders reveals that they conduct regular check-ins with each team member about near-term work. These brief conversations allow leaders to set expectations for the upcoming week, review priorities, comment on recent work, and provide course correction, coaching, or important new information. The conversations provide clarity regarding what is expected of each team member and why, what great work looks like, and how each can do his or her best work in the upcoming days—in other words, exactly the trinity of purpose, expectations, and strengths that characterizes our best teams.
Taken from an HBR Article on performance management
Our design calls for every team leader to check in with each team member once a week. For us, these check-ins are not in addition to the work of a team leader; they are the work of a team leader. If a leader checks in less often than once a week, the team member’s priorities may become vague and aspirational, and the leader can’t be as helpful—and the conversation will shift from coaching for near-term work to giving feedback about past performance. In other words, the content of these conversations will be a direct outcome of their frequency: If you want people to talk about how to do their best work in the near future, they need to talk often. And so far we have found in our testing a direct and measurable correlation between the frequency of these conversations and the engagement of team members. Very frequent check-ins (we might say radically frequent check-ins) are a team leader’s killer app.
A key premise of metric fixation concerns the relationship between measurement and improvement. There is a dictum (wrongly) attributed to the great nineteenth- century physicist Lord Kelvin: “If you cannot measure it, you cannot improve it.” In 1986 the American management guru, Tom Peters, embraced the motto, “What gets measured gets done,” which became a cornerstone belief of metrics.3 In time, some drew the conclusion that “anything that can be measured can be improved.
The key components of metric fixation are
The belief that it is possible and desirable to replace judgment, acquired by personal experience and talent, with numerical indicators of comparative performance based upon standardized data (metrics);
The belief that making such metrics public (transparent) assures that institutions are actually carrying out their purposes (accountability);
The belief that the best way to motivate people within these organizations is by attaching rewards and penalties to their measured performance, rewards that are either monetary (pay- for- performance) or reputational (rankings).
Metric fixation is the persistence of these beliefs despite their unintended negative consequences when they are put into practice.6 It occurs because not everything that is important is measureable, and much that is measurable is unimportant. (Or, in the words of a familiar dictum, “Not everything that can be counted counts, and not everything that counts can be counted.”7) Most organizations have multiple purposes, and that which is measured and rewarded tends to become the focus of attention, at the expense of other essential goals. Similarly, many jobs have multiple facets, and measuring only a few aspects creates incentives to neglect the rest.8 When organizations committed to metrics wake up to this fact, they typically add more performance measures—which creates a cascade of data, data that becomes ever less useful, while gathering it sucks up more and more time and resources.
Because the theory of motivation behind pay for measured performance is stunted, results are often at odds with expectations. The typical pattern of dysfunction was formulated in
1975 by two social scientists operating on opposite sides of the Atlantic, in what appears to have been a case of independent discovery. What has come to be called “Campbell’s Law,”
named for the American social psychologist Donald T. Campbell, holds that “[t]he more any quantitative social indicator is used for social decision- making, the more subject it will be
to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.
Trying to force people to conform their work to preestablished numerical goals tends to stifle innovation and creativity— valuable qualities in most settings. And it almost inevitably
leads to a valuation of short- term goals over long- term purposes.
Some Flaws:
Measuring the most easily measurable. There is a natural human tendency to try to simplify problems by focusing on the most easily measureable elements.1 But what is most easily measured is rarely what is most important, indeed sometimes not important at all. That is the first source of metric dysfunction.
Measuring inputs rather than outcomes. It is often easier to measure the amount spent or the resources injected into a project than the results of the efforts. So organizations measure what they’ve spent, rather than what they produce, or they measure process rather than product.
An example [Arnold frequently found himself inspecting schools where students ingested mountains of facts and arithmetic, but were bereft of analytic ability and utterly incapable of understanding sophisticated prose or poetry. They were taught not to reason but to cram.6 Both before and especially after the adoption of “payment for performance,” he criticized such education for being “far too little formative and humanizing . . . much in it, which its administrators point to as valuable results, is in truth mere machinery.”
Degrading information quality through standardization.
Quantification is seductive, because it organizes and simplifies knowledge. It offers numerical information that allows for easy comparison among people and institutions.
2 But that simplification may lead to distortion, since making things comparable often means that they are stripped of their context, history, and meaning.3 The result is that the information appears more certain and authoritative than is actually the case: the caveats, the ambiguities, and uncertainties are peeled away, and nothing does more
to create the appearance of certain knowledge than expressing it in numerical form.
Gaming the metrics takes a variety of forms.
Gaming through creaming. This takes place when practitioners find simpler targets or prefer clients with less challenging circumstances, making it easier to reach the metric goal, but excluding cases where success is more difficult to achieve.
Improving numbers by lowering standards. One way of improving metric scores is by lowering the criteria for scoring. Thus, for example, graduation rates of high schools and colleges can be increased by lowering the standards for passing. Or airlines improve their on- time performance by increasing the scheduled flying time of their flights.
Improving numbers through omission or distortion of data. This strategy involves leaving out inconvenient instances, or classifying cases in a way that makes them disappear from the metrics. Police forces can “reduce” crime rates by booking felonies as misdemeanors, or by deciding not to book reported crimes at all.
Cheating. Outright cheating can take many forms. The higher the rewards and stakes, the higher the motivation to cheat.
McNamara’s Pentagon was characterized by what the military strategist Edward Luttwak called “the wholesale substitution of civilian mathematical analysis for military expertise.
The new breed of the ‘systems analysts’ introduced new standards of intellectual discipline and greatly improved bookkeeping methods, but also a trained incapacity to understand the most important aspects of military power, which happen to be non-measurable.” The various armed forces sought to maximize measurable “production”: the air force through the
number of bombing sorties; artillery through the number of shells fired; infantry through body counts, reflecting statistical indices devised by McNamara and his associates in the Pentagon. But, as Luttwak writes, “In frontless war where there are no clear lines on the map to show victory and defeat, the only true measure of progress must be political and non-quantifiable: the impact on the enemy’s will to continue to fight.
What could be precisely measured tended to overshadow what was really important
What about the things that cannot be measured?
Primary schools, for example, have their tasks of teaching reading, writing, and numeracy, and these perhaps could be monitored through standardized tests. But what about goals that are less measurable but no less important, such as instilling good behavior, inspiring a curiosity about the world, and fostering creative thought?
A number of contemporary critics have observed, the fixation on quantifiable goals so central to metric
fixation—though often implemented by politicians and policymakers who proclaim their devotion to capitalism—replicates many of the intrinsic faults of the Soviet system. Just as Soviet bloc planners set output targets for each factory to produce, so do bureaucrats set measurable performance targets for schools, hospitals, police forces, and corporations. And just as Soviet managers responded by producing shoddy goods that
met the numerical targets set by their overlords, so do schools, police forces, and businesses find ways of fulfilling quotas with shoddy goods of their own: by graduating pupils with minimal skills, or downgrading grand theft to misdemeanor- level petty larceny, or opening dummy accounts for bank clients