mth

March 2, 2017

Code Review

Note: this is taken from something I wrote to encourage discussion and thought around code reviews within my team. In the end, I thought it was worth making public, but readability without context may vary. There are also some edits to protect the innocent.

Code reviews, specifically in the form of pull requests, have long been an integral part of engineering lore and process at $DAYJOB. However, sometimes it feels like we are going through the motions without getting any real value and sometimes they even manage to make changes worse^[0]. So why do we do them? Why do we persist? Should we persist?

In this post, I want to question some of the commonly made assumptions, and present my view on the value of code reviews, hopefully changing the perspective of people putting code up for review and people reviewing that code.

Why do we even review code?

Interestingly, when I look back at the best teams I have ever worked with, there have been two things that have remained very consistent.

We cared deeply about the quality of the code we produced.
We developed directly on master/HEAD/trunk and shipped from there continuously^[1].

I don’t think these things are related or even particularly correlated, it is just an observation I have made. Interestingly, it also aligns with the earliest days of product development at $DAYJOB where the approach was to produce code in such a way (types, tests, whatever is required) that you have absolute confidence in pushing code to master every time, a part of this was also a lot more emphasis on deep test suites^[2], but regardless, there is a notable contrast to our current process.

So given my past experiences, and a weight of the existing approach at $DAYJOB, why did I push for and implement a far removed, ‘PR and review’ branch based process?

At the time, the team was a lot smaller, but we were also more distributed than we currently are. We spent almost no time in the same room ^[3]. As with a lot of our process related things it started with the first point above, we were seeing code quality issues, not in that there were a lot of issues or massive bugs breaking things, but in that there was code being produced but it wasn’t even the right thing, or there was code being produced and never being delivered. This initial reasoning for the pull-request first approach across the team was about visibility and communication of what was happening, and ensuring people weren’t spinning in circles. The problems that we deal with day-to-day can be challenging, and pull-requests give us a chance to support each other.

Despite the deeply embedded process we have now, it took quite a while from the earliest ideas and discussions around switching to a code-review centric process to actually having a process that every one had bought into and was following^[4]. But, as the process become a part of what we just did, we started to reap additional benefits. Firstly, the obvious benefits of more eyes on code, reducing faults and improving code quality. But secondly, and less recognised, PRs become the focal point for conceptual clarity^[5], which is something I value deeply (above pretty much everything else) as a contributing process of building working software.

During the post PR instigation period there was significant work and collaboration in and around Ivory, and I still think of this work as the canonical examples of code-review giving us extra-ordinarily good results along those fault-reduction and clarity lines. PRs resulted in design discussions and decisions that genuinely improved what we were delivering.

So there are reasons for PRs. These reasons cross three dimensions: visibility, clarity and fault-reduction. And, they have worked for us over a long period of time. So what gives?

The Contention of Responsibility

My biggest frustration, and one of the largest process inequalities I see day-to-day in our work is the simple idea of who is responsible that a piece of code works?

Context free, I think the majority (if not everyone) would agree (or at least find it hard to argue against) that the responsibility of delivering working code is on the person producing that code. The problem is how people think, and more importantly act, with respect to the role of code review and how it changes or diminishes this responsibility. Because when we don’t get it right, there is a big impact on clarity and fault-reduction.

Rather than try to speculate on the diverse opinions and positions I see reflected in differing approaches across the team, I would rather present how I think about responsibility and code review. The goal here is to give everyone something to reflect on, perhaps you will see something of how you approach things (or don’t) in some or all of these thoughts.

The burden of responsibility for delivering working code is on the person producing that code.

Processes such as code review are there to assist that person in making the decision whether to ship a change, or not.

Code review does not diminish the responsibility (or power to deliver) of the person producing the code.

Code reviews are not about permission to make a change they are about asking for feedback for things you might miss or might have unexpected consequences.

Reviewers are not there to relieve responsibility or give permission. Reviewers are there to provide different perspectives on the impact of a change.

Reviewers are not there to prevent people getting code through to production. Reviewers are there because the person working on the code has asked them for help to get this change to production in a way that isn’t going to cause issues.

Reviewers can have several perspectives that are useful to this process: suitability for production; likely impacts on other systems; cleanliness/maintainability of the code; likelihood of subsequent changes being made correctly to this code when it needs to be changes. All of these provide valid feedback to the review process, that ultimately must be processed and evaluated by the person responsible as to whether the feedback is relevant, necessary, and required now as opposed to later.

If you are nodding along with these, I would ask you to stop and think, make sure you really understand what these thoughts would mean for you and how they relate to how you act. If you were to write down the thoughts that go through your head every review, how well do these line up? Why / why not?

The key things I would want someone to take away from this are:

As someone producing code, code review is an opportunity to ask for help on things you may not see or know enough about or may just be challenging to get right.
As someone reviewing code, code review is an opportunity to provide feedback from your unique perspective, your code review can and should be looking for the things you worry about, but ultimately the final responsibility rests with the author. And remember you are there to help.

So what goes wrong, and what should we be looking for from code review?

When reviews go bad

I keep a notebook of big production issues, ones that cause widespread issues and ultimately cause me to lose sleep at night. For each of these issues, I do some basic root-cause analysis, a part of this is always looking at the pull-requests that contributed to the issue. In these pull-requests, there are five dominant patterns:

The big pass. A Big PR that is either too complex or the problem isn’t well understood. The negative pattern here is they often get a passing comment or two and some superficial :+1:s. We see this often where we have 1000+ line changes with one or two comments, compared to 10 line changes that have twenty comments. If we frame this in terms of what I said about responsibility above, this is on the person putting a change up to do a better job of having a path that is able to get reviewed and have useful feedback. It is also on the people ok’ing the change to not distill false confidence by being clear and assertive that this isn’t easy to review and they are not confident that they can provide the requested feedback.

The quick release. A PR is raised and several people are tagged. The negative pattern here is that the first sign of a :+1: and the change is quickly merged, compounded by the fact that the first :+1: is the quickest and often most superficial review to land. If we look at framing this in the responsibilities I outlined above, this doesn’t make sense, we are asking two, three, four people for review so they can give their different perspectives, if we are only going to wait for one of those perspectives was there any point in even asking for them?

The dilution. A PR is raised and several people are tagged or opt-in to the review where there is significant opinion. The negative pattern here is where the person putting the review up feels compelled to incorporate a huge mix of feedback, especially when some of it will be contradictory or incompatibility. Often the original PR is self-consistent and makes sense, but slowly one comment at a time, different bits of coded are twisted in different directions. The reviewers all end up happy but the code has lost any semblance of consistency or clarity.

The dog-pile. A PR is raised and several people are tagged or opt-in to the review where there is significant opinion. The negative pattern here where the person putting the review up feels overwhelmed with disproportionate level of feedback and ends up giving-up on perfectly good (and sometimes critical) changes because it all seems too hard.

The someone-should-of-said-no. I am hesitant to bring this one up (and didn’t in an earlier draft) but it happens enough it is worth mentioning. The negative pattern here is where a PR gets ok’d through attrition, basically people being too polite to say I don’t think this is good enough, more time should be spent on it. We demonstrate a comfort with negative comments in the small, on individual bits of code, but the real value is when we can have this conversation around design and the larger ways our programs work.

These five patterns contribute more issues than anything else by a significant margin. I think it is worth noting that nowhere in this list are there situations where someone producing a piece of code takes ownership and decides to push through a PR that they understand the scope of and decide to take the steps necessary to merge through a change. Yet, when ever I talk to people about issues with PRs this is what they always assume is the worst. If we go back to the responsibilities, it would seem to flow that people do this is a good thing.

So if these patterns are bad, what should code-reviews look like?

A good review

The Author

A good review starts with a good patch, and a good patch starts with a clear goal.

If we are working on a project, there needs to be purpose. What does this project do? How does it do it? What are the main moving parts? How do they hang together? What are the interfaces? If these things aren’t clear to the author and a decent pool of reviewers we shouldn’t even start writing code until they are. This type of clarity normally involves spending a fair bit of time thinking, it will involve talking to other people getting their opinions or at least their buy-in to what is going on, and it will involve recording it, whether that is in a README, a little doc, a photograph of a whiteboard or the mona-lisa of ascii art.

Once it is clear what a project is about, fingers start hitting keyboards. What happens here, is very context dependent, be it on the person making the change, the maturity of the project, or the change itself. It might involve 10 minutes doing something mechanical, it might take a day or two prototyping and trying things to work out where things are heading, it might be somewhere in between trying to carve out a small chunk to make progress on. The variety here is natural, and should be embraced.

At this point, it is worth noting something that I often struggle getting across, which is the false tension between visibility, coherent changes and meaningful reviews. Outside of code-review, there are a lot of drivers for shorter, sharper changes (PR per day etc…), there is also my constant push for reviews that matter (in the past I have often pushed the phrase ‘conceptual review’). This means that there are going to be a wide variety of changes, and if we are really following our other processes there will be a high ratio of “mechanical” PRs to “meaningful” PRs. I don’t think it is unreasonable to expect 3 or 4 mechanical PRs before each meaningful change. The false tension I am often created from the idea that the same level of review be applied to all changes, when that really doesn’t make sense, each PR is different, we should be thinking about what makes sense for each change. If 3 or 4 changes are just a heads up, then that leaves more time for the deeper review when the time comes. The deeper review will probably be accompanied by offline chatter and re-work or re-doc as well as looking at code outside of the patch to fully work through what his happening, but given the different types of PRs this makes sense. In the past I have made the mistake of pushing for heavier reviews, and have only more recently realised, I have unintentionally caused an over-emphasis on mechanical PRs without fixing the under-emphasis on meaningful PRs.

We have a change that we want to push through, now it is time to ask for help, put up a PR and get some reviewers. Is the patch in good shape to be reviewed? Would a reviewer be able to ask useful questions of the code? Should it be split? Would you want to review it? Have you reviewed it yourself? If any of the starting assumptions or designs have changed, have you updated any design notes or other docs to reflect the new world view?

When thinking about who to ask to review a change^[6], it is all about perspective. There should probably a reasonable idea of who understands the project enough to do a deep review before you even started your change, but now that something is about to be delivered, it is a good time to carefully think about the perspectives that will be most useful to the change. If it is a production service, what other services does it communicate with, who know about them? Who else knows about the design of this code base and ensure I am not breaking any previous design decisions? Who else knows about this library/technology/swizzle-stick? Does it even need a deep review, or could you just tag someone for a sanity check, making that clear it only needs a cursory glance at the moment? If you don’t feel like you know who is a good person to tag, this is a good opportunity to talk to people around you, start a conversation in the engineering room, or with the person sitting next you, try to work out who is interested and who has those skills (and take some notes in the engineering documentation for next time).

The Reviewer

Once tagged on a change, there should first be some reflection on are you the right person? What is the change about? Do you have enough context? Can you provide a useful perspective? This may involve talking to the author and getting a better understanding of the change. The author can help here by having a good PR comment, potentially mentioning the types of things they are worried about, a link to any related doc or previous PRs for context, a link to an engineering issue for how this fits into the larger system.

At this point, I feel there is currently a weakness in the process where the default reaction in the past has been if you are tagged on a review, and don’t feel like you can provide deeper feedback, then there is a tendency to revert to a surface/syntactical review rather than opt-out (although I think it has been improving a little more recently).

Now that there is a decision to proceed with the review, what is a reasonable way to proceed? Given we are all come at problems from a personal perspective, I can’t outline a reasonable solution for everyone, I can only give an idea of how I think about it. Maybe this will give you a good model, maybe you have something better that works for you, either way it is good to actively think through.

For me, I am looking to question invariants, validate code against design and documents/readmes, poke holes in the design. This process can be broken down into two steps:

Try to gather a conceptual understanding of the flow and structures required. Most of my time is spent understanding and questioning the basic concepts.
Validate that the code looks like the conceptual view of the problem. If it does not, push people to build up code, types, functions, abstractions at a higher-level so that it does. After that is in order, it is just checking that there are tests for the complicated issues and edge-cases raised during part 1. I basically don’t think I should have to work too hard at checking the minutia of the code if I follow this rigorously.

To achieve this, depending on the context of the change and the perspective I am reviewing code from I will be asking different questions.

For a data processing tool, where I am doing a review for production impacts the questions would be along the lines of:

What happens if this fails half way? If it is retried will it work?
What happens if we see bad data? Do I have operational control to decide whether it is catastrophic or whether best effort is ok? If I get a failure do I have to re-run everything or can I use the results safely?
How are we managing the input and output formats? What if they need to change? How about arguments, are they going to be hard to keep compatible in future changes?
How does this scale with small X, but big Y, what about the other way around?
If we have to run this everyday, does it get slower?

For a haskell library, where I am doing a review for code quality and usage:

How does a client do X? Is there enough examples tests to cover it? Repeat for as many values of X as I can.
How will a client consume this interface? What is the surface area, 10 imports to use 1 functions, or something better? If we get it wrong is there going to be churn or is there an obvious way to change or incorporate new ideas without having to re-imagine the interface?
Is it really a library? Or are we going to be developing it primarily as some top level tool and having to change it in lockstep.

The common pattern here is the best types of code-review comments are questions. Questions that I can try to answer myself as the reviewer by looking at the change, and if I can’t see an obvious answer ask the question on the PR. Even better if the question is structured in a way that can be clarified by better code and types, or in lieu of that better doc, or just more explanations if all that is in place. (Note I am not saying that questions are the only valid PR comments, it is just a fairly good way in many cases, sometimes a piece of code has an obvious bug or edge case that needs to be called out).

Dealing With Comments

As the author, receiving these comments (or questions as they may be), I now have to decide how and if I want to incorporate them. I am looking through the comments and classifying them into a few broad buckets:

The comments tell me my code isn’t as clear as I thought, I need to tidy it up.
Fair point, but it is something I don’t need to worry about now, nothing can break at this point, I will raise an issue or add a FIX tag for future me as a reminder, but I am going to move forward without that.
That is interesting, but it doesn’t fit with the rest of the design so I am not going to incorporate it. If I need to drastically revisit the basic approach at some point I might bring this in.
Mea-culpa; I knew I needed that test/type/whatever but I was dreading it, I should fix this now. Thanks for calling me on it.
Good catch, I hadn’t thought of that, or the code isn’t doing what I intended, I will go fix it.
Whoah, I am way off here, I didn’t understand this well enough, my solution just isn’t clear or complete. I need to close this and revisit everything.

From there I action them, work out if the changes are significant enough for a re-tag, and then push it through.

This isn’t the only process or approach that works, but I feel like it is one that gives me good results from both sides of the equation. Importantly, as an author I am maintaining control and responsibility for when and if I am pushing things through. And, as a reviewer, I am providing the best value for the lowest effort, and never instilling a false sense of confidence.

When does the team know better?

So if the responsibility falls on individuals, will someone make a bad mistake?

It is a fair question, people, especially people who program, are not great at extrapolating risk or unintended consequences of change. At some point people will make mistakes or at the very least not the best decisions, how do we deal with that?

I would first soften the issue, but pointing out that from my observations of the team working on problems, we make far more mistakes from over-confidence and over-reliance on poor or superficial code-reviews then we do from people taking ownership and control of their own PRs. That said I don’t think the we need to live in the extremes on either side of this, we should have appropriate checks and balances in place outside of reviews that give the team a chance to set minimal standards. But we should be able to do that in a way that doesn’t impact on anything I outlined about a good code-review, we need to be optimising for putting the right information in front of people so they are able to make good decisions.

Controls and expectations I think the team should have to mitigate this:

Language guidelines that set minimum standards. These should evolve to capture things we don’t want variability on over time.
Program guidelines that set minimum standards. Again, these should evolve over time.
Mandatory approvals required for critical code bases. At times there are small parts of the system, which have a disproportionate impact / risk, and may require specialised skills to review. Identifying these code bases and sticking to the mandatory rules even for trivial changes is important.
Opt-in reviews for things that individuals care about. Firstly, people should be able to control their review surface area to a large, and having people review the things they care most about is going to give good results.
A comfort to defer to the driver of a specific project. If someone has spent a lot of time in a code base, working hard on its design and clarity, authors visiting or transitioning into these code bases should take extra care and time, ensuring they are engaging with the earlier author, both in asking for reviews and spending more time with their comments.
Normalising opt-outs. There is such a wide variety of reasons for opting out, often it can be a lack of confidence or certainty, these are things that can be dealt with helped out of the code review cycle. No one should feel uncomfortable tapping out of a code review if they don’t think they can help.
Normalising honest negative comments. Where changes are not in good enough shape to provide meaningful feedback on, we should feel comfortable calling it out. To have this, we can’t dismiss the position even if it is only one person, and we need to be respectful of the authors but trust each other enough that we can clearly state what is wrong without it turning into something that is taken personally or escalates into personal tension.

Taken together, these things help prevent people making the occasional mistake, or guarding against changes we know are higher risk. They shouldn’t be used to police people who are repeatedly not using the best judgment, this too quickly gets into the preventing people from progressing, which quickly turns into preventing everyone from progressing. If there is a repeated lack of judgment it is on me to work with that person to let them know and help them improve, and it is on all of us to keep each other honest and call it out if we see it, but it isn’t something we should be working into the normal review process.

I would re-emphasise that I don’t have concerns in this area, it is the least of the issues with the reviews that happen, but if we can work on the processes outlined above we can still continue to improve.

In the End

I think that the argument for pull-requests and code review is still far stronger than without, however the current benefits we get most certainly do give me pause from time to time. Along the visibility dimension, things are quite healthy, but we need to be getting better results for clarity and fault reduction. This may happen through tools to codify better behaviours^[7], but we can also get a long way by just reinforcing and refining the process and expectations we have on each other as a group of people who want to deliver working programs, and I don’t use the phrase working program lightly. It is worth remembering that code review is a skill, and not all of this is easy, we all need to work hard to be better at it.

And in the end, we have to be careful that we don’t fall into one of the classical programmer mind-traps of chasing the metric or procedure rather than the goal of reducing problems with the software we produce. As an author make sure you are asking the right people and using the feedback in an appropriate way. As someone reviewing the code remember that you are there to help them get their change to production, so are you actually being helpful?

[0] - Adding a note, because undoubtedly someone will call this out. I am not going to cite specific examples, but if you flick through the PR history for the last 12 months I am sure you can find examples of changes watered down to the lowest-common reviewer. These changes lack the fundamental integrity of what a coherent PR should be. Bits of code, syntactically or otherwise are changed to suit review comments leaving it inconsistent with the original goal.

[1] - HEAD or trunk based development for those who aren’t familiar: HEAD based development is the idea that there is only ever 1 shared branch, sometimes referred to as “trunk” based development for the Subversion users, or “master” for the git only users. In this approach people commit their work directly to master when it is ready. In the teams that I have practiced this, there are several other practices that come along with this such as very rapid integration (max 5-10 minutes between commits hitting master when you are developing), rigorous testing regimes (they varied but things like 100% coverage from three independent test suites), very fast build/test cycles (first green light < 30 seconds, second green light < 2 minutes, …), mandatory roll-back on any failed commit and (possibly) more than anything else we probably weren’t friendly, caring individuals if it impacted the result.

[2] - Deep test suites in the past: It isn’t that testing is less important now, it is just we have more operational structures in place that we often place different emphasis on how/where they deep tests run, though that is perhaps the topic of another post), as an example almost every project had at least one test that actually started an instances on EC2, ssh’d across and tested some code, which isn’t something I would want repeated it was largely due to operational immaturity at the time, and most of the code we produce should be verifiable to work under many deployment scenarios orthogonal to the actual project under development.

[3] - No time in the same room: This probably surprises some people who think of the early days as everyone sitting together, however the reality was very different, myself and Russell were spending 40 hours a week on client sites, I was dev’ing platform related things from coffee shops before and after work.

[4] - There were many factors in this. There was a bit of convincing Ben that it wouldn’t get in the way of continuing to produce the code we had to. There was also the practicalities of getting everyone on-board and changing their process when we weren’t spending a lot of time together.

[5] - Conceptual Clarity: Is the catch phrase I have used to capture my ideal development style, it is the process of developing by first identifying the high-level concepts in the system and how they interact. Building out code from the bottom-up that allows the manipulation of these concepts at a high-level to the point where the final code looks like a clear sketch of those high-level concepts and how they hang together. If the code doesn’t look as clear as my initial design, that iterate and refine it until it does. The goal here is to have better discussions about the problems at hand as you are not dealing with irrelevant bits of code, and following on the code should be easier to read and change.

[6] - I would really love this to be unconscious, it seems so obvious that we could tool up here through jury so that this isn’t something we ever have to think about.

[7] - Codifing what we can in tools is important and something I have talked about in the past. Tooling I think can help considerably:

Pick the right reviewers. When we raise a PR jury is able to make a smart decision based on the code bases you are changing, maybe on the diff, are you editing haskell or shell or whatever.
Opt-in to the right reviews. You should be able to control the surface area of what you review.
Better manage review decisions. What type of feedback are you looking for? PRs are “signed-off” when you get that feedback. The most naive implementation of this is a build status that is red when there are assigned reviewers without approvals and it goes green when everyone has either approved or opted-out.
Better statistics and feedback around PRs that aren’t getting the attention that people want. Nag-bot. Escalations. Auto-reassign/ re-balance when people have too many reviews outstanding.