#5 Not Quite Continuous Integration

Aug 19, 2020

One of my teams is struggling with continuous integration. I’ve been trying to find a good solution to resolve the tension between continuously integrating, code review and branching workflows. So much so that it’s the only topic this time week. Let’s get to it!

CI is great, but to be honest: it’s hard. Continuous Integration is a practice, not a tool. So setting up Jenkins is not enough to “do CI” - Jenkins is an automation tool you can use to continuously integrating work. But, like any tool, you have to use it right. To paraphrase Martin Fowler’s Bliki: “Continuous Integration is when each team member integrates their work frequently, at least once daily, leading to multiple integrations per day.” That’s at the core of it: integrating work. Typically that means merging to the mainline branch.

Now why is that hard? Well it’s really a function of (1) how many people collaborate on the same code base and (2) to what extent the development culture includes collaboration. I’ve seen this work marvelously well in small teams (2-4 people) of experienced developers who habitually work in small increments. I’ve seen it fall apart in larger teams (10-20 people) with many inexperienced members, who are not used to the discipline of working in small increments. When it breaks down it is often because of large change sets (for example 10.000 lines of code in a diff) that originate in long-lived feature branch, or even just on a developer’s machine 😬. This is a problem in work style, not in tooling. Technology cannot solve human problems.

So if feature branches are the problem… why did we start using feature branches in the first place? In the seminal 2010 article that first described gitflow, a pattern that rose to popularity and made feature branching a de-facto standard in many teams, describes the feature branch as (emphasis by me):

Feature branches are used to develop new features for the upcoming or a distant future release. The essence of a feature branch is that it exists as long as the feature is in development, but will eventually be merged back into [the mainline branch] or discarded.

It doesn’t necessitate long lived feature branches, but it pretty suggests it. Another aspect it points out is that feature branches allow us to effectively separate work into different branches. Later on we decide to take it or leave it (merge / discard). Makes sense, right?

Another advantage we get from feature branches is that we have a well defined point of control - before merging to mainline we can enforce… well anything we want. In my teams we require code review, integration with latest mainline, passing tests and passing static analysis. SCM platforms facilitate that with pull/merge-request workflows (PR for short), tool integrations and discussions on diffs. These pull requests have become a cornerstone of development process in many teams.

Personally I see great benefits from code review: sharing knowledge about the system design, implementation and history enables shared ownership. Moreover every PR is a democratized teaching opportunity: every reviewer (we do peer review, i.e. everybody in the team is invited to comment) can propose different or better solutions and point out potential problems with the proposed changes.

Such control functions (code review, static analysis, testing) are great, but inherently delay integration. The more obvious slow down is code review, because we cannot automate it. Less obvious are automatable tasks, those should not really hold up integration into mainline, right? Even if the build takes 20 minutes to run - it’s only 20 minutes, right? Well… it depends: in small teams yes, but in a team with 10 developers trying to land changes into mainline there are often 5 pull requests ready to merge at any given point in time. Once PR #1 merges the developers of PRs #2-#5 have to pull in the the latest mainline, possibly fix conflicts and again prove that their changes are good - another 20 minutes waiting for the automation tool to run. After merging PR #2 merges the developers of PRs #3-#5… you get where I’m going with this.

Let’s do some middle school math: in the example each developer waits an average minimum of (20mins + 40mins + 60mins + 80mins + 100mins) / 5 developers = 90 mins/dev. In other words the average integration delay is 90 minutes for each developer. Only from running automated checks (does it compile? do tests pass? does static analysis find any problems?). Code review not included, getting coffee or taking a break while waiting for the build not included. We’re back to the good old days…

To state it as a formula, with n = number of concurrent pull requests, t = time for CI to build, with some basic math we arrive at:

This grows quadratically in number of concurrent PRs 😬. Which means bigger teams suffer significantly more from this problem - if they run in the feature-branch-pull-request mode. The formula is just a thought experiment, take it with a grain of salt, but it is very much in line with what I see, it rings true. Of course there’s an argument to be made that CI automation should never take that long, but in reality sometimes it does. In one of my teams it does. And it’s not a matter of basic build script tuning or buying bigger machines.

We’ve finally reached the point where the rubber hits the road, where I’m stuck between a rock and a hard place:

I want continuous integration, i.e. every developer integrates at least once daily
I want automated builds. That delays integration.
I want code review. that delays integration.
Our team is big, our build is slow. That amplifies the delay significantly.

We fail to continuously integrate, because the delays are too long. Our team size (large), collaboration practices (feature branches, asynchronous code review) and confidence in tests (low-ish) pull us towards delayed integration. Like gravity. Bear with me.

The theory goes: factors on the left pull you towards delayed integration, on the right towards continuous integration. For example a small team, with high confidence in tests that does trunk based development will achieve continuous integration with relative ease, even if code review is asynchronous and their leaders have a control mindset. On the other hand a large team using gitflow-style feature branches, with no tests and a control freak manager doesn’t have a chance. That integration will be delayed AF. Hello 10k LOC pull request, good bye serenity.

In my real team I’m further on the left than I’d like to be. I don’t have a cool resolution about how I solved all the problems, and saved the day, and made everybody happy, and got carried on the teams shoulders, while everybody’s chanting my name. (Maybe next time.) But at least now I know a couple of angles to tackle the problem: pair programming (= synchronous code review), trunk based development, limits on diff size, investment into fast, automated tests…
I just thought I’d share my train of thought to see if it resonates with you.

Hyperlinks

Executive Communication / SCQA

What [SCQA] as a framework does for you is it forces you to organize your thoughts.
You're going to get a lot of value out of it, and more importantly, your colleagues are going to get a lot of value out of it because it's such a high signal way to communicate.

A fantastic primer on a way to structure communication for high signal-to-noise ratio. How to write the email or report that is a delight to read and provides actionable information.

Generative Engineering Cultures

A generative engineering culture is one where nothing seems to fall through the cracks, “we should” gets prioritized and becomes reality, and original ideas and value come primarily from engineers, rather than management.

Don’t we all wish to work in such a culture?
David Kaplan gives a great summary of the what, why and how of generative engineering cultures (also on the podcast). He ends with:

I view creating a generative engineering culture as an extension of that role — multiplying force not just for who can take on pre-prescribed problems and solutions, but for how many people can identify and solve problems independent of management. In such a culture, nothing falls through the cracks.

That’s very similar to how I see my role as software architect, working across multiple teams. Amplifying the teams rather than driving everything through my own individual contribution. I hope every (aspiring) leader in my technology organization reads this.

That’s it for this week, subscribe to get the next edition directly into your mailbox 📬

Nemo on Software

#5 Not Quite Continuous Integration

Hyperlinks

Executive Communication / SCQA

Generative Engineering Cultures