Engineering Performance Reviews Without Spreadsheets

Performance reviews are one of the most important responsibilities of an engineering manager — and one of the hardest. The challenge isn't effort or intent. It's that the information needed to write fair reviews is spread across months of work, conversations, and tools. This page explains why reviews feel so difficult, what managers commonly try, and what's often missing.

Why performance reviews feel harder than they should

Most engineering managers genuinely want to write fair, accurate reviews. They care about their reports and take the responsibility seriously. And yet, when review season arrives, the process still feels broken.

The problem isn't a lack of diligence. It's that the human brain wasn't designed to retain six months of nuanced observations about multiple people working on different projects. Without a system for preserving context, managers are forced to reconstruct history from fragments: a Slack thread here, a vague memory of a standup comment there, maybe a few bullet points from a 1:1 doc that hasn't been updated since September.

What gets remembered tends to be whatever happened recently, whatever was dramatic, or whoever advocated loudest for their own work. The steady contributor who quietly shipped critical infrastructure in Q1? Easy to underweight. The engineer who had a rough week right before reviews? Easy to overweight.

There's also a structural mismatch. Performance reviews ask managers to evaluate growth, impact, and behavior over an extended period — but the information needed to do that is scattered across dozens of tools, conversations, and contexts that weren't designed for retrieval. Pull requests don't capture mentorship. Jira tickets don't reflect how someone handled a production incident at 2am. Slack messages disappear into the void.

The result is that even well-intentioned managers end up writing reviews that are more impressionistic than evidential. They know something is missing, but there's no practical way to recover it.

What managers usually try (and why it doesn't work)

When managers realize their memory isn't enough, they reach for workarounds. These approaches are reasonable — they're what anyone would try given the constraints. But each comes with tradeoffs that make reviews less fair, not more.

Spreadsheets and running docs. The most common approach is some version of "I'll just write things down as they happen." A shared doc, a personal spreadsheet, a note in the 1:1 file. In theory, this works. In practice, it requires consistent effort over months, often during the busiest periods when there's no time for documentation. The spreadsheet gets updated enthusiastically for the first few weeks, then sporadically, then not at all until two days before reviews are due. What you end up with is a record of whatever you happened to notice during the brief windows when you remembered to write things down — which is its own form of bias.

Last-minute PR scraping. When the spreadsheet fails, the backup plan is usually to pull a list of merged PRs and use that as a proxy for contribution. This is fast and feels objective — there's a number attached. But PR counts reward a specific type of work: small, frequent, code-centric changes. They penalize engineers who spend weeks on a single complex refactor, or who primarily contribute through design docs, code review, mentoring, or incident response. The engineer who merged 47 PRs looks more productive than the one who shipped one critical system that took three months of careful work. Neither number tells you much about actual impact.

Relying on memory anyway. Even with docs and data, most reviews still come down to what the manager can recall. And memory is shaped by factors that have nothing to do with performance. You remember the engineer who pushed back in a meeting last week more vividly than the one who quietly unblocked a teammate three months ago. You remember the incident that woke you up at 3am, but not the dozens of potential incidents that someone prevented through careful work you never saw. Memory is a highlight reel, not a documentary — and the highlights are chosen by cognitive biases, not by relevance.

Recency bias. This deserves its own mention because it's so pervasive. The last few weeks before reviews carry disproportionate weight simply because they're easier to recall. An engineer who struggled in Q1 but finished strong looks like they're "on a good trajectory." An engineer who had a great first half but hit a rough patch in November looks like they're "slipping." Six months of work gets compressed into whatever happened most recently, which isn't fair to anyone.

Over-weighting visible work. Some contributions are naturally visible: launching features, presenting in team meetings, responding quickly in Slack. Others are almost invisible: improving test coverage, mentoring a struggling teammate, doing the unglamorous maintenance work that keeps systems running. Without deliberate effort, reviews tend to reward visibility over value. The engineers who self-promote get credit; the ones who assume their work speaks for itself often don't.

None of these approaches are wrong, exactly. They're just incomplete. Managers use them because they're the best options available without better infrastructure for preserving context over time. The problem isn't that managers aren't trying hard enough — it's that the information they need to write fair reviews decays faster than any manual process can capture it.

Why output alone can be misleading

Most connected AI tools are excellent at analyzing artifacts. They can read pull requests, summarize tickets, map timelines, and count activity with a level of speed no manager could match.

But performance is not the same as output.

The parts of an engineer's work that matter most often live outside systems of record:

1:1 conversations where priorities shifted
Periods of intentional learning
Mentoring others without visible deliverables
Navigating ambiguous problems
Collaboration that prevented future issues

Looking only at artifacts can make two very different situations appear identical:

An engineer moving slowly because they are stuck
An engineer moving slowly because they are mastering a new domain
An engineer with low output while unblocking the rest of the team
An engineer investing in quality that prevents incidents later

From a distance, the signals look the same. The context tells a different story.

Good reviews depend on understanding trajectory, not just totals. They require knowing how someone responded to feedback, how their decision-making evolved, and how their influence changed over time. Those elements rarely show up in PR counts or ticket histories.

This is why many managers feel uneasy relying solely on dashboards. The data is real, but incomplete. Without the narrative layer that lives in notes, check-ins, and ongoing conversations, even accurate analysis can lead to unfair conclusions.

Some teams use tools like Vereda AI as a data layer to preserve the why alongside patterns and context, so review season isn't a reconstruction exercise.

What good review inputs actually look like

Good review inputs have three qualities: they're specific, they capture context alongside outcome, and they're collected continuously rather than reconstructed at the end.

Specific observations, not general impressions. "Strong communicator" is an impression. "Proactively flagged a schema migration risk during sprint planning that would have caused a data loss incident" is an observation. The difference matters because observations survive calibration conversations — impressions don't. When you go to bat for someone in a review panel, you need evidence, not adjectives.

Context alongside outcome. "Shipped the payments refactor" is an outcome. What makes it useful for a review is the context: how long it took, what complexity they navigated, whether they did it independently or with heavy support, and what tradeoffs they made along the way. Two engineers can ship the same feature with very different levels of growth and contribution. The outcome looks identical; the context tells the real story.

Coverage of the whole period.A good review reflects January through December, not October through December. That requires some system for keeping track — whether it's timestamped 1:1 notes, a standup history you can query, or a habit of writing short observations after significant events.

Multiple signal types. Code output is one signal. How someone handled conflict, whether they mentored junior engineers, how they behaved during an incident, whether they communicated proactively when things slipped — these are also evidence. A review that only looks at what someone shipped misses most of what makes an engineer effective.

The practical implication: you need inputs from at least three sources. Artifacts (PRs, tickets), written records (1:1 notes, standup updates), and direct observation (things you noticed and wrote down when they happened). None of the three is sufficient alone.

How to prepare for review season before it starts

The managers who feel least stressed during review season are rarely the ones who start preparing two weeks out. They're the ones who built lightweight habits earlier in the year. The goal isn't elaborate documentation—it's having enough context that you're not reconstructing from scratch when deadlines hit.

Start with your 1:1s. After each conversation, spend two or three minutes noting the key themes: what the engineer accomplished since your last meeting, what challenges came up, any feedback you gave or received. You don't need transcripts. A few bullet points are enough. The discipline isn't in the format—it's in doing it consistently. Those notes compound over time into a record that covers the entire review period, not just the parts you happen to remember.

Capture context, not just outcomes. When something notable happens—a project ships, an incident gets resolved, a difficult conversation goes well—write a sentence about why it mattered, not just what happened. "Led migration to new auth system" tells you less than "Led migration under tight deadline; coordinated across three teams; caught edge case that would have caused outage." The context is what makes evidence useful later. Without it, you're left with a list of accomplishments that all look roughly the same.

Track behaviors alongside results. Reviews shouldn't just assess what someone shipped—they should also reflect how they worked. Did they mentor others? Did they handle ambiguity well? Did they communicate proactively when things went sideways? These behaviors matter for growth conversations and for calibration, but they're easy to forget if you're only tracking deliverables. When you notice someone demonstrating (or struggling with) a behavior that matters for their level, note it. A few observations per month is enough to establish patterns.

Don't let recent events dominate. Recency bias is the most common distortion in reviews. Whatever happened in the last few weeks is vivid; whatever happened in March is vague. The structural fix is to ensure you have records from the beginning of the review period that are as accessible as recent ones. Before you write, go back to your notes from the first quarter. Look at projects that wrapped months ago. If you can't remember what someone did early in the cycle and have no records, that's a gap in your process—don't fill it with assumptions.

Keep the system sustainable. Any note-taking habit that requires significant effort will collapse under a busy quarter. The best systems are ones you'll actually use. Five minutes after a 1:1 is sustainable. Thirty-minute weekly documentation sessions are not. A running doc with loose structure beats a detailed spreadsheet you abandon in October. The goal is minimum viable context—enough signal to write fair reviews, captured in a way that doesn't become a burden.

Preparation isn't about creating more work. It's about distributing the work across the year so that review season becomes synthesis rather than archaeology. The managers who do this well aren't working harder—they're working at a more sustainable pace, with better information, and less anxiety when deadlines arrive.

Where bias shows up even with good intent

Most review bias doesn't come from bad managers. It comes from structural gaps — the absence of a system that makes all contributions equally visible.

Recency biasis the most common. Whatever happened in the last six weeks before reviews carries disproportionate weight. An engineer who had a rocky Q3 but finished strong looks like they're "growing." An engineer who had a great first half but hit a rough patch in Q4 looks like they're "sliding." Neither characterization reflects the full year. The fix is having records from the beginning of the period that are as accessible as recent memory — which requires a system.

Visibility bias rewards engineers who make their work visible: presenting in team meetings, responding quickly in Slack, shipping features with demos. It penalizes engineers who do critical but invisible work: improving test coverage, mentoring quietly, preventing incidents through careful design, maintaining infrastructure. Without deliberate tracking, reviews tend to measure how loudly someone communicated about their work, not the value of the work itself.

Affinity bias makes it easier to recall and characterize positively the work of engineers you interact with most. Managers who work closely with certain engineers know their context well. Managers who are more distant see the output but miss the circumstances. Reviews often overweight engineers who are in frequent contact and underweight engineers who are more heads-down.

Attribution bias in collaborative work is particularly hard to avoid. When three engineers shipped a feature together, who gets credit? The one who talked about it in the retrospective? The one whose name is on the largest PR? The one who made the key architectural decision that went undocumented? Collaborative work is hard to evaluate fairly without records of who contributed what.

How some teams preserve context over time

The teams that write the best reviews aren't working harder during review season — they're working at a more sustainable pace throughout the year. They've replaced the end-of-cycle reconstruction problem with a continuous capture habit.

Post-1:1 notes. Two minutes after each 1:1, write three bullet points: what they accomplished since last time, any challenges or feedback that came up, and anything you want to remember for their review. Do this every week and you have 25 data points per person per review cycle. Do it inconsistently and you have the same problem as before.

Async standup history. When engineers post daily updates in Slack, that history accumulates into a surprisingly rich record. Standup updates capture when someone was blocked, when they shipped something notable, when a blocker persisted for days, and when sentiment shifted. Teams that use a standup tool (rather than freeform Slack) can query this history at review time instead of trying to remember it.

Observation notes.When something notable happens — an engineer steps up during an incident, gives particularly clear feedback in a design review, or handles a difficult stakeholder conversation well — write it down immediately. Not in a review template. Just a timestamped note that you can find later. The moment is memorable now; it won't be in six months.

Goal tracking tied to daily work. When goals are connected to the actual work being done — rather than sitting in a separate system that only gets updated during check-ins — they provide a running record of progress and context that makes review conversations much easier to anchor.

The common thread: information that accumulates automatically or with minimal friction beats information that requires deliberate effort to capture. Any system that depends on managers remembering to document during their busiest weeks will fail. The best systems capture context as a byproduct of work that's already happening.

What good reviews look like in practice

A good engineering performance review does three things: it reflects the full review period, it separates impact from activity, and it provides the engineer with information they can actually act on.

It reflects the full review period.Not just the last quarter, not just the most dramatic moments. A good review acknowledges the project that wrapped in March, the period of slower output that turned out to be deep learning, the sustained work that doesn't show up in any single highlight. This requires records. Without records, you can only reflect what you remember — which is a biased sample.

It separates impact from activity. Shipping 40 PRs is activity. Shipping the authentication rewrite that unblocked three teams and reduced support tickets by 30% is impact. Good reviews make this distinction explicit. They tell the engineer what their work actually achieved, not just that they did a lot of it.

It gives the engineer something to act on.The most useful part of a review isn't the rating — it's the specific feedback about what to change, develop, or continue. Feedback like "improve communication" is not actionable. "When a project is at risk, flag it earlier — ideally in your standup update before the deadline becomes visible to stakeholders" is something an engineer can change next week.

It holds up in a calibration conversation.Most companies run calibration sessions where managers compare ratings across their reports. A review backed by specific observations and timestamped evidence is far easier to defend than a review based on general impressions. "I rated them strong because they led the payments migration — here's the standup thread from Q2 showing how they handled the complexity" is a different conversation than "they're just really solid."

The irony is that reviews written this way are also faster to write. When you have the evidence accumulated, drafting is straightforward. The time savings come from not spending hours trying to reconstruct six months of work from fragments. The investment is in the system — not in the review itself.