Global Software Delivery: Best Practices
Jun 16, 2017 by DROdio
Andy Glover recently gave a Spinnaker Deep Dive presentation at an LA DevOps Meetup at Tinder HQ. Here are some of the global software delivery best practices Andy shared in his presentation:
Favor regional deployments over global ones
Even though Spinnaker (the Continuous Delivery & Infrastructure platform open-sourced by Netflix) facilitates simultaneous regional deployments, they don’t recommend doing them.
In the event of a bad deployment — be it a software bug or infrastructure issue — keeping deployments focused to a region means your blast radius is limited (rather than it affecting the entire globe).
How to leverage Spinnaker to employ this best practice: The screenshot below shows a deployment pipeline in Spinnaker for a Netflix app called “api,” which is dedicated to deploying to US-East-1. What’s more, this particular pipeline also make use of an additional best practice (described below) by using a deployment window.
You can also see a separate pipeline for a different region, US-West-2.
NOTE: Spinnaker enables one pipeline to kick a second pipeline off — so in this example, an application developer could easily trigger the US-West-2 pipeline to run once the US-East-1 pipeline completes successfully.
The take-aways:
- Limit blast radius and impact to your users
- In a fast-moving environment, it’s tempting to push software globally all at once. Don’t do it! Take advantage of AWS regionality, and limit your blast radius. Deploy to one region at a time and ensure that region is property functioning before moving to the next.
Red/black deployments are your friend
Red/Black (also commonly referred to as “blue/green”) deployments take advantage of the key concept of immutable infrastructure and AWS elasticity, where the next version of a particular software package is stood up in a new Auto Scaling Group (ASG). Once health checks pass for this new ASG, traffic is routed to it and the previous ASG is disabled. The benefit of this style of deployment is that you can easily and rapidly rollback to the previous ASG in the event of a failure.
How to leverage Spinnaker to employ this best practice: In this screenshot, Spinnaker has Red/Blacked a new ASG into EU-West-1, and moved the previous to a disabled state.
In the event of an issue with an ASG, it accordingly becomes extremely easy with Spinnaker to rollback either via automation or even manually:
The take-away:
- Red/Black deployments, as opposed to rolling deployments (where you’re deploying over a previous version of your application — losing the ability to roll back to that previous version), offer the most rapid and reliable means to back out of a bad deployment.
Use deployment windows
Netflix regional traffic is fairly cyclic. For the most part, people in any particular region tend to watch more Netflix during the evening hours. For instance, in the evenings when US-East is beginning to see peak traffic, US-West traffic is still largely in its trough (as most people on the west coast are at work). Accordingly, service teams at Netflix can take advantage of the trough in any particular region and deploy during that time.
How to leverage Spinnaker to employ this best practice: In this case, you can see that this particular pipeline in Spinnaker will only deploy to US-East from 10am to 2pm Pacific time.
The take-aways:
- Deploying during off hours limits how many people would be affected should a deployment go bad.
- Deployment windows can ensure automation occurs during working hours when maximum coverage is available in case of an issue.
Use Automated Canary Analysis as a last line of defense
Netflix has a sophisticated telemetry platform that allows them to compare two different versions of running software. They call this Automated Canary Analysis (ACA). With its ACA platform, Netflix can compare two different versions of software taking production traffic. It’s a great last line of defense to ensure things are working well before opening the flood gates.
How to leverage Spinnaker to employ this best practice: The open-source version of Spinnaker does not yet have ACA built in, but it’s a priority that the community is working to deliver. Spinnaker does, however, enable pipelines to be configured with canaries out of the box.
Below is a screenshot to illustrate how ACA works within Netflix, and the kind of functionality that will be coming to the open-source version of Spinnaker. The pipeline below shows a production push that includes an ACA step. In this case, the ACA for a newly deployed service was scored at 93. ACA scores are between 1 to 100, and Spinnaker allows the service owner to define the “go/no-go” threshold for the pipeline to continue running or stop automatically. If ACA reports a score below that threshold, that pipeline stage is considered a failure and the overall pipeline is halted. The service owner can also pre-define a failure path in Spinnaker to rollback a deployment, for instance.
Part of the value of having a “paved path” via Spinnaker is deep integration. In the case of ACA, the service owner can easily view a detailed ACA report should s/he want to get more information as to why a particular ACA was scored:
The take-aways:
- Automated Canary Analysis provides the most realistic testing possible
- The data combinatorics in a complex microservice architecture (like what Netflix has) make it cost prohibitive to test every possible path before releasing something to production. Consequently, ACA allows Netflix to verify things are working well in production with live traffic, and Spinnaker makes it easy to back things out should this last gate uncover issues.
Use pipelines for non-typical deployments
Invariably, there are one-off infrastructure management tasks that need to be done from time-to-time. These could be emergency fixes or even occasional updates to infrastructure, for example. It’s easy to overlook automating these tasks — and that’s bad, because manual tasks tend to create towers of knowledge. By codifying this automation in pipelines, anyone can run them with the benefit of consistency.
How to leverage Spinnaker to employ this best practice: With Spinnaker, service owners can create pipelines that can be used in an emergency situation to deploy to any region (via a parameter).
Here’s another pipeline in Spinnaker that can set a runtime environmental property, again based upon a parameter fed into this pipeline:
The take-aways:
- Automate all the things
- Automation is no-brainer! Anytime a task is manually executed more than once, consider it an opportunity to create a pipeline. Otherwise, you risk creating towers of knowledge that become problems when those individuals aren’t around to execute those tasks.
No one reads email anymore
Email is a terrible way to get someone’s attention. This means that a company’s “code red” emails requiring urgent action can get inadvertently missed. When it comes to notifying people of important events or actions being required (such as deployments or manual stages) Netflix recommends using alternate channels, like Slack.
How to leverage Spinnaker to employ this best practice: In the screenshot below, a pipeline in Spinnaker uses email and Slack to notify relevant people when it starts, when it completes, and most importantly, if it fails. Notice in the event of a failure, this pipeline is specifically configured to notify a support channel in Slack.
Here’s a notification that requires a human to take action — In this case, to approve a deployment to a particular environment. Below that is a Slack notification that a deployment into production completed successfully.
The take-away:
- Email wasn’t ever intended to be a real-time communication mechanism. Consequently, if particular events require immediate action to be taken, use alternate channels like Slack or SMS.
Sometimes the humans are required
The machines haven’t completely replaced all of us! Sometimes the judgement of a human is needed in a deployment pipeline.
How to leverage Spinnaker to employ this best practice: Pipelines in Spinnaker can leverage a specific stage called “Manual Judgement” where a human must manually initiate a positive or negative acknowledgment before the pipeline continues.
The screenshot below shows a Manual Judgement stage following an Automated Canary Analysis stage. In the manual judgement stage, there’s an actual button for either stopping the pipeline, or continuing it. This particular pipeline is using this stage to pause the pipeline and let a human review it before it proceeds to a production deployment.
The take-aways:
- Humans understand nuance and have a gut
- What differentiates humans from machines is that we understand nuance and frankly, we have a gut. Netflix trusts its engineers to use that gut. Manual Judgement can be a powerful gate that makes sense in certain situations.
Don’t assume
Things change — entropy is a thing. Therefore, guard against it.
How to leverage Spinnaker to employ this best practice: The epitome of recognizing entropy and not assuming reliability is Netflix’s Chaos Monkey, a resiliency tool that helps applications tolerate random instance failures by randomly killing instances. Chaos Monkey is tightly integrated into Spinnaker but also into the ethos of Netflix’s culture that demands service reliability.
Another way to “not assume” is to only trigger pipelines when you know people will be in the office, ready to handle any issues. Don’t assume that by deploying at 9pm on Friday night, your team will be happy to get paged at 3am in the morning on Saturday to handle a failure scenario — because they most definitely won’t.
Spinnaker pipelines can be triggered using a cron expression — but note, in this case, the cron expression excludes the weekend. This particular pipeline is deploying into production, and this team wants to be sure someone is around in the event of a problem (and they want to enjoy their weekends!).
Here’s a pipeline taking advantage of a precondition stage within Spinnaker. The execution of a pipeline can take a long time, especially if it’s waiting on a deployment window. And in those scenarios, the underlying cloud infrastructure might have changed. For example, someone might have manually deployed a new ASG for that application. Consequently, in those cases, it’s prudent to use a precondition stage to ensure things are as they should be before taking some action.
Netflix believes strongly in the notion of immutable infrastructure. Consequently, when service owners create an AMI for deployment, they do it once and promote that same AMI through environments rather than creating one for each. This ensures consistency. Consequently, in Spinnaker you can use a “Find AMI” stage to pull an AMI from a test environment and push it forward into a production one.
Here’s another “Find AMI” before an ACA:
The take-aways:
- The only constant is change.
- Embrace it and construct pipelines accordingly. Otherwise, you run the risk that things will break when you least expect it — and usually in the worst possible situation, like in production during your peak hours.
Make it easy to escalate
Even if you follow every best practice here, failure will eventually happen. So you might as well make it as easy as possible to reduce the time to fix the issue.
How to leverage Spinnaker to employ this best practice: Spinnaker tightly integrates with PagerDuty. In the event of an issue with a particular application, Spinnaker makes it it easy to page that application’s on-call person. In fact, a PagerDuty key can be required when defining an application.
Spinnaker’s tight integration with PageDuty makes it super easy to link your application with a PagerDuty key:
And since manually entering in a PagerDuty Service key is error prone, Spinnaker exposes the linkage between PagerDuty service keys and apps, allowing service owners to select the corresponding service name:
The take-aways:
- Enable rapid resolution
- Tight integration with PagerDuty reduces the time to fix an issue by making it easy to get ahold of the proper people.
Other Global Software Delivery best practices:
- Advocate guard rails, not gates: The cost to the organization of implementing gates as a culture (for example, Manual Judgements on every pipeline, requiring a human to OK it before it goes to production) is exceptionally high, especially in stifling innovation and lowering employee productivity and happiness. Accept that accidents will still happen. As such, use a standardized deployment platform like Spinnaker, and use it to make it easy to get ahold of accountable people who can solve the issue quickly.
- Blameless Postmortems: Providing guard rails to avoid major disasters means the organization must accept that failures will happen and that people will make mistakes. As such, conduct blameless postmortems to learn from mistakes to not repeat them. In these postmortems, there’s no finger pointing! Many of the best practices here regarding how to effectively move fast in a multi region landscape have come out of these postmortems.
In summary, if you want to rapidly deliver software with confidence across multiple AWS regions:
- Avoid simultaneous regional deployments
- Use Red/Black deployments
- Use deployment windows, especially if you have cyclic traffic patterns
- Leverage automated testing as a part of a pipeline
- Automate one-off manual tasks]
Want to see Andy share these best-practices in person? Watch the video of Andy’s talk.