The bus that couldn't slow down

Consider a solitary gold miner. 99% of the time spent in mining for gold is at the face, making incremental progress. A vein followed here, a dead end routed around there. Then there are those rare moments when the only way to make further progress is to make a lot at once - blow the face away to discover what lies beyond the blocked shaft.

Too tortured a metaphor? Perhaps. But keeping a software product going is a lot like this. Most of the time you’ll make progress in bits and pieces, and once in a while you’ll take a bigger jump. The following is a discussion of how the happy band of hackers at PipelineDeals took one of these jumps recently, and how our infrastructure and deployment setup made that possible without any customers noticing.

The road to here

A year and a bit ago, we were in a very different place. PipelineDeals was running Ruby 1.9.3, Rails 2.3 and we were using Jammit for asset compilation. Being on so ancient a version of Rails was the most pressing pain point - it locked us in to older versions of the gems we depend on, as well as stopping us ditching Jammit for the Rails asset pipeline. What we had worked well enough, but we knew we were living on borrowed time.

We had one goal:

Set up and use a repeatable process for major upgrades of pieces of our software stack, such that customers don’t even notice we did it.

That means no ‘Log out and login again’, no ‘clear your cache, please!’, and certainly no downtime (scheduled or otherwise). We decided that upgrading to Rails 3.0 will be our first big bite out of the technical debt sandwich, and it’s that instance I’ll be covering here.

Deployments: A new hope

We’ve blogged about our deployment strategy before - we love it and it gives us a ton of flexibility. Turns out that this flexibility is crucial to accomplishing what we set out to do. Because our infrastructure and deployment logic is just code, this repeatable process starts out with a pull request against our Ansible playbook repository.

Step 0: two lanes

We run two app servers behind our load balancer in production, pointing at app.pipelinedeals.com

pipelinedeals production

Our infrastructure PR changes our deployment process so that our setup looks like this:

pipelinedeals rails3

We’re now at a place where we can test rails3 under production by using rails3.pipelinedeals.com instead of app.pipelinedeals.com, noticing and fixing errors that occur exclusively under rails3 (using NewRelic), while regular deployments are not affected. Step 0 achieved. This feedback loop where we can dogfood our upgrade, making it available to ourselves and the rest of the company to harden before a real customer sees it is the heart of this process.

Step 1: get a job

Our app does a lot of background processing. Imports, exports, sending emails, bulk actions, etc. - there’s a lot going on outside of a user’s request-response cycle. Alongside our 2 app servers, our deployments also stand up 2 queue servers running sidekiq, which share a single redis instance to allow jobs to be executed on any machine. To be confident in our upgrade, we’ll need to send some jobs to a rails3 queue server to see what breaks. We add to our PR above, so that a build includes a queue server running the upgrade rails3 branch.

pipelinedeals queue servers rails3

We now have a queue server in play that we fully expect to start erroring on some jobs. 2 things stop the customer noticing this: no error page is returned to users since this is asynchronous processing, and sidekiq jobs are automatically retried several times (with ~66% chance that the retry will occur on a safe queue server). At this point, we’re running through our feedback loop again, fixing errors and redeploying as we go along.

Step 2: (partial) showtime

We’ve de-risked the deploy to customers as much as we can, and now it’s time to go live. We do this with a small change: on deploy, we move the rails3 app server to the production load balancer alongside our other 2 app servers.

From now on, customers will have a 1 in 3 chance of their next request being served by a rails3 server. This is where the rubber meets the road, and we find out how good a job we did weeding out the bugs. Since every step in our deployment process is just a method call (invoking an ansible playbook under the hood), it only takes a few seconds to yank the rails 3 app server out of rotation if things go very wrong.

Step 3: Nothing to see here

By this time, we’ve had a few days to observe our upgrade under live traffic. This is where we’ll notice any lingering errors that occur infrequently or in our cron jobs. Once we’ve fixed all we’ve found and the error rates have fallen off, it’s time to party!

The cleanup is uneventful - we merge our branch into master and revert our ansible PR, taking us back to our single deployment path. A little automation goes a long way, and in our case gave us the flexibility to bite off only as much as we could chew.