Every project can benefit from build automation and there is a bunch of excellent tools nowadays that are eager to help with that. In this blog post, I will share my experiences with some of them, talk about the major problems we had along the way and explain the current solution with Bamboo and Docker.
Bamboo, like most other continuous integration tools, has a concept of remote agents. Since building projects is usually an intensive and a long lasting task, it is nice to be able to dedicate a number of build machines and let Bamboo load balance between them. Also, as a number of projects start to grow, as they evolve and branch, a number of different build requirements start to pop up. One project might require one version of Java, but another one needs a different version. One project might need Node.js 5, while another one works with Node.js 8. Sometimes it is possible to keep multiple versions of these tools installed on the same machine, but very often something will clash and the simplest way to go is to get yourself another build agent.
Working in an enterprise environment doesn’t make things any easier. Maintenance periods for software can be quite long and you never know when someone will have to make a bug fix for a five year old release. This means we have to make sure that even after that amount of time we can rebuild our old software with the same build environment that was used for the original release.
Ten years ago, every build agent was a physical machine. This meant that adding new build agents wasn’t an easy task and the total number of agents was very limited. That limited number of agents had to be shared between dozens of teams and hundreds of different projects. And, in the end, this meant that our builds were never working for more than two weeks without a problem. Usually, a random guy from another floor (let’s call him Bill) desperately needed to remount some directories to get his tests working properly, which, of course, broke all other builds that were using the same agent. Then the rest of the company would spend a number of hours trying to figure out why are the builds failing all of a sudden, and in the end, we all had to sit together to find a solution that will work for everyone. A couple of weeks later someone else would again make “a tiny benevolent change” to an agent and, of course, break a number of other builds. And the entire process had to be repeated.
We had excellent ideas to make maintenance of build agents a strict responsibility of the IT department and prevent everyone else from messing with them, but that never worked in practice. Requesting every change from IT was a huge overhead that meant waiting for days, or even weeks, for the tiniest change on the agent. And, in the end, IT would only do everything that was requested from them, thus breaking some builds. :)
Then virtual machines came. We immediately virtualized all agents, dumped the old machines and everything was great. Now we could trivially clone agents, create snapshots and recover from them. If someone needed a different build environment, IT was able to simply provide a new agent, instead of messing with the old ones.
But, the joy didn’t last for long. The number of agents quickly grew thus making the Bamboo license more and more expensive. We immediately hit the limit and we were back at square one: sharing agents between projects and constantly messing with each other’s work.
Finally, four years ago, we got Docker. After ignoring it for a while, I gave it a try and still decided to ignore it for our production deployments. However, it seemed like a helpful tool that could finally bring our build environment up to a level we all desperately needed.
Here was the idea: We replace agents running as virtual machines with agents running inside Docker containers. Instead of having a bunch of VMs running all the time, we let Bamboo start the container it needs before each build. And we allow each team maintain their own 'dockerized' agents. Voila, problem solved!
However, after some initial investigation, I was a bit disappointed to find out that Bamboo doesn’t come with any support for something like that (well, not sure what I was expecting when the cost of the license is calculated based on the number of active agents). It seemed that Jenkins had better support on that topic, but we were a bit locked in the Atlassian’s ecosystem. And we also liked the integration between Bamboo, Confluence and JIRA. So, there was no other choice; we had to do it ourselves. And here’s how we did it.
Creating a dockerized Bamboo agent
First, we needed to make a Docker image that would become our Bamboo agent. This was probably the simplest task of all; there’s even an official image of a Bamboo agent made by Atlassian that can be found on Docker Hub and used as a starting point. We just had to make sure that the image had all the right tools installed and it was ready to go.
Creating the “master build agent”
The next step was to get ourselves the master build agent. This was a beefy virtual machine with a bunch of CPU cores and even more RAM. That machine would serve as the Docker host and would be used to run all other 'dockerized' agents. This agent was the only one that was permanently online and connected to the Bamboo server. All other agents would start before the build and shut down afterwards.
Regarding Bamboo configuration, it is important to mention that this agent had a “Docker” capability, meaning it can run Docker containers.
Automatically starting the dockerized agent before the build
This was the tricky bit. First, we separated all our build plans into two stages:
1. “Start Docker container” stage
2. “Build and test” stage
The first stage, as you probably can tell from the name, was responsible for starting the container. This stage was configured with the “Docker” requirement (meaning it must run on the master agent) and is executing a bash script that is starting a container. This script reads a simple file that we called “docker.definition” and checked in together with our source code to know which Docker image it should use. Once the container is started, Bamboo agent that’s installed inside will start booting up, and within a couple of seconds it will be ready to receive tasks.
The second stage will have some real requirements; usually something like Java, Maven and NPM. The container that was started in the previous stage should provide all required capabilities and once the agent is fully up and running, it will happily pick up the second stage and finish the build.
However, this still had some problems; since a lot of build plans had similar requirements, it was possible for different jobs to steal each other’s agents. To solve this, we added a dynamic capability to each 'dockerized' agent. This capability was called “planKey” and had a value of the actual plan key. To put this to use, second stages of all build plans had to be configured with a requirement that defines a need for that specific plan key.
Injecting the dynamic capability was quite easy. If available, Bamboo agent will read a file called bamboo-capabilities.template on startup. So, the only thing that had to be done was to put a file called like that inside the Docker image and have this line in it:
The startup script simply had to forward that environment variable to the newly started container and that was it.
Stopping the container after the build
The last part of the puzzle was to stop the container after it’s not needed anymore. The first thing we tried was to have a script at the end of the second stage that would stop the Bamboo agent (thus also stopping the container that was running it). However, this didn’t work because killing the agent from within the build stage didn’t give it an opportunity to finish all the post-build tasks properly.
In the end, we came up with the following solution:
a) The startup script would write a file to the hard drive of the master agent in the format.
b) A script on that agent would constantly monitor jobs for all such files. The script takes the plan key, checks if that plan is still running (by querying the Bamboo’s REST API), stops the container once it detects it’s done and deletes the file.
You might call all this overly complicated and hacky. And you would probably be correct. I certainly didn’t expect this level of complexity when the original idea of automatic dockerized agents came to me. However, after polishing some initial quirks, I can happily say that this has been working extremely well for over a year. A full year without a single incident of Bill breaking our builds and a full year of freedom to choose any technology we like, without asking ourselves how are we going to build that on that old shared build machine and at the same time greatly reducing licensing costs.