Continuous Integration
Started | 2022-12-06 |
Decided | 2023-01-27 |
Last amended | 2023-04-04 |
Why
The sudden decommission of Hydra forces us to revisit our Continuous Integration (CI) setup.
Decision
We predominantly rely on Buildkite as CI system.
We build artifacts and run checks on them. Artifacts include compiled executables, but also the source code itself. Checks include unit and integration tests, but also source code linters. We specify our artifacts in `flake.nix`, and most of our checks, too.
We perform these builds and checks with different granularity β in order to keep our computing resources within reasonable limits, we donβt automatically check everything on every commit. Instead, the granularities are:
post-commit
= (at most once) after each commit; for a quick sanity check after a pushpre-merge
= before merging each pull request; for a reasonably complete, automated check that master will satisfy functional requirementspost-merge
= after merging each pull request to master; for an exhaustive check that master satisfies functional requirements on all platformsnightly
= every night; for an exhaustive, automated check of functional and non-functional requirements
The following table lists our artifacts, the checks performed on them (`.` for build), the granularity at which the check is performed, and the CI system used for doing that:
Artifact | Check | Granularity | CI System | (Status) |
---|---|---|---|---|
Source code | Code formatting style | post-commit | Buildkite | π΅ |
Documentation | . | post-commit | Github Action | π΅ |
Pull request (PR) | Mergeable to master concurrently with other PRs | pre-merge | Bors | π΅ |
Compiled modules | . | post-commit | Buildkite | π΅ |
Unit tests (linux) | post-commit | Buildkite | π΅ | |
Unit tests (macos) | post-merge | Buildkite | π΅ | |
Unit tests (windows) | nightly | Github Action | π΅ | |
Executables / Release archive | . (linux) | pre-merge | Buildkite | π‘ADP-2502 |
. (macos) | post-merge | Buildkite | π΅ | |
. (windows, cross-compiled) | pre-merge | Buildkite | π΅ | |
Executables | Integration tests (linux) | pre-merge | Buildkite | π΅ |
Buildkite | π΄ADP-2522 | |||
Github Action | π΄ADP-2517 | |||
Benchmarks | nightly | Buildkite | π΅ | |
Release archive | E2E tests | nightly | Github Action | π΅ |
Docker image | . | post-commit | Buildkite | π΅ |
Legend: Status π΅= working; π‘= needs work; π΄= not working
Details
Granularity
- Granularity refers to automatic actions taken by the CI system. It should be possible to trigger a build or check manually at any time.
- The purpose of granularity is to conserve computing resources β in a world with infinite resources, the system would perform every build and check on every commit.
- The name of the granularity βpost-commitβ was chosen for brevity β the action is performed automatically on the latest commit after a `git push`, not on the git commits in between. In other words, the action is performed at most once per commit.
- We use the βpost-mergeβ granularity for actions that
- consume scarce resources and have a high chance of failing, e.g. builds and checks on macOS
- We use the βnightlyβ granularity for actions that
- consume many resources, e.g. benchmarks
CI System
As a general rule, we choose
- Github Actions for actions that
- are very simple and do not require a nix store / environment
- run on Windows
- Buildkite otherwise
- especially for actions that require a nix store
We have a tension where we have to set up some checks (e.g. unit tests) in two different environments due to different availability of operating systems:
- Linux, macOS β in Buildkite
- Windows β in Github Actions
We hope to address this tension by requesting a Windows machine for use with Buildkite.
Platform macOS
At the time of writing, we have two mac-mini machines that act as Buildkite agents. Unfortunately, they are frequently overloaded and fail the builds or checks. Hence, we only use granularity βpost-mergeβ or βnightlyβ for them.
Company Processes
For developing and maintaining our CI, we may use DevX/SRE expertise from IOG.
- Our tribe is responsible for choosing our CI tooling
- Our tribe should have a process for getting DevX/SRE support
- Our tribesβ DevX/SRE resources can help teach us how to debug problems that arise
- Link to the SRE Chapter of IOG
Rationale
Artifacts and checks
The two main concerns of a CI pipeline are: building artifacts and running checks.
The purpose of building an artifact is to produce, say, an executable or HTML. The purpose of running a check is to check that the artifact satisfies certain properties, e.g. all unit tests pass.
Different CI systems, like Hydra, Cicero, Buildkite or Github Actions, have a different focus regarding these concerns.
- The world view of Hydra is that everything is about building artifacts. Hydra was surprisingly successful as a CI tool, because this world view can be used for running checks, too β they can be expressed as trivial artifacts, where success of the check is equivalent to success of building `()`, and failure of the check is equivalent to failure of the artifact build.
- The world view of Github Actions, Buildkite or Cicero is that everything is about running checks. The drawback is that building artifacts is more difficult and we have problems managing the build cache.
For us, the main takeaway is that we should try to separate these concerns clearly.
Our artifacts include: source code and compiled executables. We have different checks on these: Linters and style checkers on the source code, unit and integration tests on the executables.
As we are coming from Hydra, compiling executables is easiest to do through a cached nix store. At the moment, it looks like only Buildkite has good support for that; hence we choose Buildkite.
Our options for CI system
Buildkite
- Pro β Good at artifacts, working nix cache
- Pro β Good documentation, easy to write
- Con β Dependency on machine (currently provided by SRE / Samuel Leathers)
- Con β no Windows machine
- Con β Dependency on permissions (currently only SRE / Samuel Leathers has write permission)
In a pinch, the dependencies can be solved by forking the repository and providing our own machines.
Github Action
- Neutral β Good at small actions, but problems at scale
- Neutral β Good documentation, but a bit cumbersome to write
- Pro β No dependency on machine
- Pro β Windows machine
- Pro β No dependency on permissions
Cicero
- Con β Poor at artifacts, nix cache currently not working properly
- Con β Poor documentation
- Neutral β Dependency on machine (provided by SRE, but they have long-time commitment)
- Con β no Windows machine
- Pro β No dependency on permission
References
[1] G Kim, K Behr, G Spafford; The Phoenix Project; IT Revolution Press (2013). A business novel about the DevOps movement: make the flow of work visible and automate it, to an extreme of, say, 30 releases per day.
[2] Cicero on Github
Scratchbook
Random Findings
Installing Nix with the `cachix/install-nix-action` Github Action: https://github.com/input-output-hk/cardano-node/blob/db396b163af615aa89286aa985583ef8843cfcde/.github/workflows/check-mainnet-config.yml#L16-L23
Documentation Findings
Cicero
Cicero = An engine for executing actions. An βactionβ is an arbitrary program (Bash, Python, Nix, β¦) that is run in the Nomad execution environment.
Tullia = A domain specific language, embedded in the Nix language, for expressing actions to be run with Cicero. This is useful when writing Cicero actions that mainly build stuff with Nix.