Red/Black Deployments at PipelineDeals

Martin Fowler’s post on BlueGreenDeployment gives a name to a deployment practice that is used by many different organizations. Our deployment practice is quite similar to the process that Martin describes, with a few distinct differences.

In his post, he describes using two identical stacks, one of which is the hot stack, servicing production requests. The other is the warm stack, which has the newest build and can be quickly switched to.

Our deployment process differs slightly from what Martin describes in that we don’t keep unused instances up for staging purposes. For us, this isn’t a fantastic use of capital or resources. Because we are firm believers of immutable infrastructure, we instead fire up the new instances we need on-demand, and retire the old stack after the deployment is complete.

This practice is shared by many other teams, and I’m going to call it Red / Black deployment, which is a tribute to Netflix’s deployment setup. The way we execute this is by using a combination of tools centered around AWS and the Ansible configuration management tool.

On top of Ansible, we have a tool called Deployer, which listens to Hubot commands run in our Operations Hipchat channel, and responds accordingly.

The deployment process is a state machine, and there are 3 states the system can be in at any given time.

The hubot commands are the arrows, which transition the machine to the next state.

Why you should do this

There are a few reasons to consider utilizing this type of deployment. The first is a less jarring experience than more traditional approaches. Routing requests to separate infrastructure that is ready to receive requests is generally smoother than bouncing application servers and having requests queue up.

Another reason is the rollback option. If things go wrong, you can roll back to literally the exact same hardware you were just previously on.

Third, for a Red/black deployment you exercise your configuration management each time you want to deploy. That helps with keeping the recipes from going stale and potentially breaking over time, if they are not exercised enough.

We’re not the only engineering team pushing this practice forward. Betterment recently released slides that describe their deployment process, which is very similar to ours. Netflix essentially does the same process with two autoscaling groups.

Drawbacks

One of the biggest drawbacks of a Red/black deployment strategy is the build time, which currently takes about 8 minutes. This limits the number of deployments we can do daily, and eventually will not scale with our pace of development.

Another drawback is the complexity of the deployment machine. It takes a long time to groom up your Ansible recipes to the point where they are completely autonomous and reliable.

So if you’re ready, we’ll go through how PipelineDeals implements Red/Black deploys below.

Cruise

The first state is where we live most of the time. Cruising along, servicing requests, and there are no active deploys going on.

Build

To start the deployment process, new servers need to be provisioned:

It all starts with one simple Hubot command.

hubot deploy pld:build

This will do the following:

  1. Hubot will send an API command to our deployer app which is responsible for running the actual Ansible commands.
  2. The deployer app will spin up the instances that make up pipelinedeals: 2 app instances, 2 sidekiq instances, and 2 API instances.
  3. The instances undergo a health check.
  4. If all the instances pass the health check, they will be tagged as new. In the example above, the healthy app servers that get spun up above will be tagged as new-app-server.

Afterwards the app instances are attached to a test load balancer, and we can run any final sanity checks or tests that absolutely must require the production environment (we do our best to minimize this case, but it happens.)

This is the longest step of the process, taking about 8 minutes to complete.

The deploy build process at work

Deploy

After the build checks have run, and manual verification (if any) has been completed, then the deploy is ready to go.

Running the deploy

The deploy command is very quick, and does the following:

  1. The new servers get attached to the production load balancer
  2. The old servers immediately get removed from the production load balancer
  3. All servers get re-tagged. new-app-servers become hot-app-servers, and hot-app-servers become old-app-servers.
  4. The developers responsible will check New Relic and other sources for any anomalies in error rates or response times.

If everything looks good, then the next command, cleanup, is run.

Cleanup

Cleanup is another very fast command. It brings the deployment state back to Cruise by terminating the old app servers.

Whoops! Rollback!

Ruh roh. On the rare occasion where we detect a problem after deploying, we execute the rollback command. This will immediately back out the deploy and put us back into the built state, where we can do further analysis into what happened.

Going forward

If you’re already using configuration management and practicing immutable infrastructure, this deployment strategy might help. Not only will it keep your recipes in shape and well tested, it helps to ensure deployments are smooth.

The future is looking good for this deployment method. As tools like Docker become more mature, it should allow us to reduce the build time to a matter of seconds, rather than minutes.