Your Performance Rating Says More About Your Manager Than You

 


8 minute read

Ever switched managers and watched your performance rating suddenly change? Your work stayed the same. Your projects, your output, your contributions. But the number on your review shifted. Maybe you went from "exceeds expectations" to "meets expectations." Or the reverse. Same work. Different rating.

That's not because one manager was better at evaluating you. It's because performance ratings measure the person giving the rating more than the person receiving it.

Deloitte went public with this problem in 2015 while trying to solve a different one entirely. By their own estimate, they were spending close to 2 million hours annually on performance management. Cascading objectives, consensus meetings, year-end reviews. The equivalent of 1,000 full-time employees doing nothing but paperwork.

They wanted to save time. What they found was that the entire measurement system was broken.

The Discovery

When Deloitte's leadership team started looking for ways to streamline performance management, they did what any large professional services firm would do: they reviewed the research.

What they found was unsettling.

Multiple studies showed that the majority of variance in performance ratings comes not from the employee being rated, but from the manager doing the rating. Your score reflects your boss's rating tendencies, their personal standards, their unconscious biases, their mood on the day they filled out the form.

The technical term is "idiosyncratic rater effect." Your rating is idiosyncratic to your rater.

Idiosyncratic rater variance accounted for 62% of ratings.

Actual employee performance accounted for only 21%.

One study analyzing over 4,000 managers found that, across performance dimensions, idiosyncratic rater variance accounted for 62% of the total variance in ratings. Actual employee performance accounted for only 21%. The rest was noise.

Think about that. If you get a performance rating, there's a better than even chance that rating tells someone more about your manager's rating patterns than about your work.

I see this constantly when I work with scaling companies. An employee transfers teams. Their work doesn't change. Their rating does. Leadership assumes the employee improved or declined. What actually changed was the measurement instrument.

It's like switching thermometers and concluding the room got warmer.

Why Ratings Fail

In 1920, psychologist Edward Thorndike noticed that when Army officers rated soldiers positively on one trait, every other trait followed suit, regardless of evidence. He called it the "halo effect," and a century of research has confirmed it operates unconsciously and resists training.

But the halo effect is only part of the problem. The math reveals how deep the inconsistency goes. A 1996 meta-analysis examining interrater reliability across thousands of performance evaluations found that the correlation between two different supervisors rating the same employee was just .52. That means the two raters shared only about a quarter of their variance in common. For a measurement system that drives compensation, promotions, and terminations, that level of agreement is remarkably low.

When the same manager rates the same employee at different times, reliability jumps above .80. The manager is highly consistent with themselves. They just disagree with other managers rating the same person.

This isn't random measurement error. It's systematic rater bias. Each manager has their own internal scale, their own standards, their own unconscious weights they apply to different aspects of performance. Those scales don't align.

Two managers can watch the same employee do the same work and come to different conclusions about quality. Both are being "objective" by their own standards. Those standards are just fundamentally different.

Ask yourself: if two of your managers rated the same employee right now, would they agree? If not, which one is measuring performance accurately?

The answer is neither. They're both measuring their own perception of performance, filtered through their individual biases, shaped by their personal experiences, and anchored on their previous ratings.

The Anchoring Problem

A 2025 study from Harvard researchers revealed just how self-perpetuating this bias becomes. They analyzed performance appraisals at a multinational company and found that when managers see employee self-evaluations before rating them, they anchor on those scores.

Employees who rated themselves lower received lower final ratings from managers. Not because their work was worse. Because the manager unconsciously adjusted their evaluation toward the anchor they'd been shown.

Worse, when the researchers looked at situations where managers couldn't see self-evaluations, they found managers were anchoring on something else: the employee's rating from the previous year. Last year's biased rating became this year's anchor, creating a self-perpetuating loop.

The study also found significant racial disparities in ratings that the anchoring mechanism helped perpetuate. The measurement system wasn't just capturing noise. It was amplifying and locking in bias through anchoring.

This is the fundamental problem Deloitte identified. You can't fix a measurement instrument that's measuring the wrong thing. Adding more raters doesn't help because you're just averaging multiple biased measurements. Training raters doesn't work because the bias operates unconsciously. Better rating scales don't solve it because the scale isn't the problem.

The question itself is wrong.

What Deloitte Built Instead

Deloitte realized something counterintuitive: people are terrible at rating other people's qualities, but they're quite good at describing their own intentions.

Ask a manager "How innovative is this employee?" and you get idiosyncratic noise. Ask a manager "Would you want this employee on your team?" and you get consistent, actionable data.

The difference is subtle but fundamental. The first question asks the manager to evaluate the employee. The second asks the manager to predict their own future behavior.

Deloitte replaced their entire performance management system with four questions that team leaders answer about each employee, either quarterly or at the end of each project:

"Given what I know of this person's performance, and if it were my money, I would award this person the highest possible compensation increase and bonus."

Not "rate their performance." Would you give them your money?

"Given what I know of this person's performance, I would always want them on my team."

Not "rate their teamwork." Do you want to work with them again?

"This person is at risk for low performance."

Not "rate their risk level." Yes or no, are they at risk?

"This person is ready for promotion today."

Not "rate their potential." Yes or no, promote them now?

Each question asks the manager about their own future actions, not their perception of the employee's qualities. The questions are phrased in extremes to force differentiation. And each question captures a single, concrete concept.

The system takes less than five minutes to complete. No lengthy forms. No rating scales. No paragraphs of written feedback. Just four questions about what the manager would actually do.

Why This Works

The logic of Deloitte's approach is that it sidesteps the idiosyncratic rater effect.

When you ask someone to rate another person's creativity, communication skills, or strategic thinking, you're asking them to project their internal standards onto someone else. Those standards vary wildly between raters. That's where the 62% variance comes from.

When you ask someone what they would do, give them a raise, keep them on the team, promote them, you're asking them to report their own intention. People are much more consistent about their own intended behaviors than about their perceptions of others.

A manager who says "I would always want this person on my team" is making a prediction about their own future behavior. That prediction might be wrong, but it's not suffering from idiosyncratic rater effects. It's measuring what the manager intends to do, which is ultimately what matters for compensation, development, and promotion decisions.

The shift is from "How do I perceive this person?" to "What would I do with this person?"

But Deloitte never published outcome data validating that the new system actually reduced idiosyncratic rater effects or improved the accuracy of performance differentiation. They reported internally that the design was "having a positive impact" and tested it incrementally across 2,000, then 7,000, then 40,000 employees. The published evidence speaks to process efficiency and employee engagement, not measurement validity.

The four questions still filter through a single manager's judgment. The halo effect still applies. A manager who doesn't like someone still won't want them on their team, regardless of that person's actual performance. Anchoring on last quarter's snapshot answers is just as possible as anchoring on last year's annual rating.

And the broader picture is sobering. Deloitte's own 2025 Global Human Capital Trends survey found that 61% of managers and 72% of workers still can't say they trust their organization's performance management process. A decade after Deloitte redesigned their own system and inspired an industry-wide movement away from traditional ratings, the trust problem persists across the field.

Deloitte asked a better question. But better questions don't eliminate human judgment. They redirect it. Whether that redirection produces meaningfully less biased outcomes remains an open question.

What This Means for Your Organization

Most companies aren't going to replicate Deloitte's system wholesale. But the principle applies everywhere: if your performance ratings show high variance between raters, you're not measuring performance. You're measuring raters.

Four questions diagnose whether your system has this problem:

  1. If the same employee moved to a different manager tomorrow, would their rating change? If yes, you're measuring the manager's tendencies, not the employee's work.
  2. Do your managers disagree about who your top performers are? If two managers can't agree on whether someone is excellent or merely good, your measurement system isn't capturing an objective reality. It's capturing subjective perception.
  3. Could you predict an employee's rating just by knowing which manager rates them? If certain managers consistently rate higher or lower than others, the rating reflects the manager's personal scale, not differentiated employee performance.
  4. Do employees' ratings change when they switch teams, even when their role and responsibilities stay similar? Same work, different rating means the measurement changed, not the performance.

If you answered yes to any of these, you have an idiosyncratic rater effect problem. The solution isn't better training, more detailed rubrics, or additional raters. The solution is changing what you measure.

Instead of asking managers to evaluate qualities, ask them about intentions and actions. Instead of "rate this person's leadership," ask "would you promote this person to a leadership role?" Instead of "rate their collaboration skills," ask "would you want them on your team for your most important project?"

You can't eliminate human judgment from performance evaluation. But you can measure things humans judge more consistently, their own intentions, rather than things they judge inconsistently, other people's qualities.

Deloitte moved the conversation forward by asking a better question. But asking a better question is not the same as solving the problem. The proof that intention-based measurement produces fairer, more accurate outcomes at scale doesn't exist yet.

Your performance rating might say you're a 3 out of 5. The question isn't whether that rating is accurate. It's whether we're measuring the right thing at all. And until someone builds the evidence base for what actually works, this remains a psychometric problem dressed up as an HR process.


References

Bohnet, I., Hauser, O. P., & Kristal, A. S. (2025). Can gender and race dynamics in performance appraisals be disrupted? The case of social influence. Journal of Economic Behavior & Organization, 235. https://doi.org/10.1016/j.jebo.2025.107032

Buckingham, M., & Goodall, A. (2015). Reinventing performance management. Harvard Business Review, 93(4), 40-50.

Deloitte. (2025). Employee performance management. 2025 Global Human Capital Trends. https://www.deloitte.com/us/en/insights/topics/talent/human-capital-trends/2025/employee-performance-management-optimization-effective-strategy.html

Scullen, S. E., Mount, M. K., & Goff, M. (2000). Understanding the latent structure of job performance ratings. Journal of Applied Psychology, 85(6), 956-970.

Thorndike, E. L. (1920). A constant error in psychological ratings. Journal of Applied Psychology, 4(1), 25-29.

Viswesvaran, C., Ones, D. S., & Schmidt, F. L. (1996). Comparative analysis of the reliability of job performance ratings. Journal of Applied Psychology, 81(5), 557-574.

Yalin Consulting
anil@yalin.consulting

Post a Comment

0 Comments