In this post, I will summarize my approach to CI/CD automation. I believe that it is pragmatic enough, and blends in well with the primary focus on product, not tooling, and gradually sets-up small to medium teams for success with quality engineering operations. To learn about concrete instance of one such automation evolution - you can read Build System At Genus AI, where it sustained automation for 3 years and more.

Commands

The natural first step, before any automation, is when development tasks, like running tests, building artifacts, or deploying them, is just a sequence of commands everyone in the team knows and runs. It is simple, straightforward, and pragmatic. However, over time it can lead to problems (more on those - later).

During this phase, my focus is usually to work out the sequence of commands for each use-case - testing, building, deployment. Also, understand the differences, if any, between executing something locally vs remote. Everything should be documented in some structured format - markdown, restructured text, wiki, or notion.

Local Automation

At some point, it should feel a bit repetitive having to execute the same sequence of multiple commands over and over again. Time to bring all the different commands together as a bash script, a Makefile, an Ansible playbook, or some other multi-purpose and robust tool. This stage takes zero time because the commands are already known, and the tools to glue them together should be simple.

This is the local automation stage. It is about building one script to execute all the commands on a local machine. That is probably an excellent point to be for a small team, but it also sets this team for the next step - to run it on automation service.

Remote Automation

Depending on how the above two steps went, this one is usually the easiest. It requires understanding the differences between running commands locally vs on a CI service and parametrizing scripts. Once the scripts are ready - use the CI services, like GitHub Actions, CircleCI, AWS CodeBuild, Jenkins, or others, and invoke the same local automation commands. I want to emphasize this: the continuous integration service should run only the same script as an engineer would locally.

For a concrete example, when I was deploying my blog using CodeBuild (before the billing incident), I had Ansible playbook blog_deploy.yml that I could use from my machine. However, I only used that early on, and instead set-up AWS CodeBuild service to do it automatically when I push a new article to the repository. My CodeBuild buildspec file looked like this:

version: 0.2
phases:
   install:
      runtime-versions:
         ruby: 2.6
         python: 3.7
      commands:
      - pip3 install -r requirements.txt --no-deps  # install ansible
      - (cd blogs/seporaitis-net; bundle install)   # install jekyll
   build:
      commands:
      - (cd ansible; ./blog_deploy.yml)             # deploy

There are two phases - install dependencies and execute the same command I would use locally. That is all.

Isn’t this obvious?

It turns out - it is not. DevOps Research Report says that although automated builds are popular at 64% of the companies surveyed, this number goes down quickly with each stage of automation closer to production. Down to just 17% of companies having an automated deployment. Implementation details are not clear from the survey results, but undoubtedly some of the automation suffers from problems this article could and can solve.

Personal anecdata - in one medium sized company I worked at, the deployment process was documented as a list of commands to run. Engineers would have to sit and wait 20-30mins to go through the process, looking at the screen just watching output, then invoking the next command.

Think about it, if a team consists of four engineers, deploying one change per day each, that is 10hours of productive time per week lost to mindless staring at the screen. Not including mistakes, regressions, or the number of environments (staging, QA, etc.). Each of these further amplifying this waste.

Doing this for a year is equivalent to losing almost 2 months of productive person-time, if not more.

What you get for free

On the flip side - this approach brings some benefits for free.

This setup is not intrusive to product work and should require very little approval to implement. Commands and local automation should flow naturally out of the daily work. The remote automation with CI can be just a simple YAML file to checkout code and execute the commands (e.g. GitHub Actions, or CodeBuild as shown above).
Deployments become consistent. Based on my experience, once set and settled, this automation reaches a maturity level where material changes happen only when a big upgrade is needed or when the program significantly changes. Meanwhile, it remains untouched, and the chance of human-error running the list of commands is almost zero.
Fallback is local. If CI service is having a bad day and does not work, you can be 100% certain that running that deployment command locally will work, because it is the same command that has been exercised daily. At the same time, if some commands start failing for some reason, it will most often be simpler to debug by running them locally.
More productivity. Time spent waiting for a test-suite to finish could be spent working on some small improvement or a follow-up task. Essentially, swapping time spent watching and waiting, to product building or innovating.
Less coordination. Similar to the above example, having two engineers deploy two separate changes and coordinate the process can be turned into a productive time. Instead of waiting on each other to finish, modern CI services can ensure that all the changes go out one by one. In this case make sure that error alerts can reach the engineers automatically, e.g. by a Slack message or email.
Rollbacks are easy. Triggering deployment for a previous code version can be as trivial as reverting the commit and deploying it.
Diverse workflows. Test, build, and deploy are just the obvious ones. This model allows easily add stuff like side-deploy, canary testing, anything you can think of. As long as it is scriptable - it is automatable. This in turn enables GitOps.
Single source of truth. There is only one way to execute an action - the way documented in a single place in the code.

When does this not work?

The most common scenario when this does not work is when the CI/CD environment and local environment do not match, e.g. use a different interpreter, operating system. Or when the commands executed in CI/CD are different from those that engineers use locally. There is an implicit, and I think natural, assumption that if CI/CD is running a test suite, it is doing the same thing as an engineer would do locally. It leads to “but it works on my machine” frustration and sometimes hours wasted fighting something that is not even broken.

Conclusion

All the arguments boil down to this - CI/CD services should run the same commands an engineer would run locally. I hope this article convinced, or at least planted a seed, that by using one single source of truth for your automation, you can make yourself and your team more productive.

On Continuous Automation