Global Software Delivery: Best Practices

Jun 16, 2017 by DROdio

Andy Glover recently gave a Spinnaker Deep Dive presentation at an LA DevOps Meetup at Tinder HQ. Here are some of the global software delivery best practices Andy shared in his presentation:

Favor regional deployments over global ones

Even though Spinnaker (the Continuous Delivery & Infrastructure platform open-sourced by Netflix) facilitates simultaneous regional deployments, they don’t recommend doing them.

In the event of a bad deployment — be it a software bug or infrastructure issue — keeping deployments focused to a region means your blast radius is limited (rather than it affecting the entire globe).

How to leverage Spinnaker to employ this best practice: The screenshot below shows a deployment pipeline in Spinnaker for a Netflix app called “api,” which is dedicated to deploying to US-East-1. What’s more, this particular pipeline also make use of an additional best practice (described below) by using a deployment window.

You can also see a separate pipeline for a different region, US-West-2.

NOTE: Spinnaker enables one pipeline to kick a second pipeline off — so in this example, an application developer could easily trigger the US-West-2 pipeline to run once the US-East-1 pipeline completes successfully.

The take-aways:

Red/black deployments are your friend

Red/Black (also commonly referred to as “blue/green”) deployments take advantage of the key concept of immutable infrastructure and AWS elasticity, where the next version of a particular software package is stood up in a new Auto Scaling Group (ASG). Once health checks pass for this new ASG, traffic is routed to it and the previous ASG is disabled. The benefit of this style of deployment is that you can easily and rapidly rollback to the previous ASG in the event of a failure.

How to leverage Spinnaker to employ this best practice: In this screenshot, Spinnaker has Red/Blacked a new ASG into EU-West-1, and moved the previous to a disabled state.

In the event of an issue with an ASG, it accordingly becomes extremely easy with Spinnaker to rollback either via automation or even manually:

The take-away:

Use deployment windows

Netflix regional traffic is fairly cyclic. For the most part, people in any particular region tend to watch more Netflix during the evening hours. For instance, in the evenings when US-East is beginning to see peak traffic, US-West traffic is still largely in its trough (as most people on the west coast are at work). Accordingly, service teams at Netflix can take advantage of the trough in any particular region and deploy during that time.

How to leverage Spinnaker to employ this best practice: In this case, you can see that this particular pipeline in Spinnaker will only deploy to US-East from 10am to 2pm Pacific time.

The take-aways:

Use Automated Canary Analysis as a last line of defense

Netflix has a sophisticated telemetry platform that allows them to compare two different versions of running software. They call this Automated Canary Analysis (ACA). With its ACA platform, Netflix can compare two different versions of software taking production traffic. It’s a great last line of defense to ensure things are working well before opening the flood gates.

How to leverage Spinnaker to employ this best practice: The open-source version of Spinnaker does not yet have ACA built in, but it’s a priority that the community is working to deliver. Spinnaker does, however, enable pipelines to be configured with canaries out of the box.

Below is a screenshot to illustrate how ACA works within Netflix, and the kind of functionality that will be coming to the open-source version of Spinnaker. The pipeline below shows a production push that includes an ACA step. In this case, the ACA for a newly deployed service was scored at 93. ACA scores are between 1 to 100, and Spinnaker allows the service owner to define the “go/no-go” threshold for the pipeline to continue running or stop automatically. If ACA reports a score below that threshold, that pipeline stage is considered a failure and the overall pipeline is halted. The service owner can also pre-define a failure path in Spinnaker to rollback a deployment, for instance.

Part of the value of having a “paved path” via Spinnaker is deep integration. In the case of ACA, the service owner can easily view a detailed ACA report should s/he want to get more information as to why a particular ACA was scored:

The take-aways:

Use pipelines for non-typical deployments

Invariably, there are one-off infrastructure management tasks that need to be done from time-to-time. These could be emergency fixes or even occasional updates to infrastructure, for example. It’s easy to overlook automating these tasks — and that’s bad, because manual tasks tend to create towers of knowledge. By codifying this automation in pipelines, anyone can run them with the benefit of consistency.

How to leverage Spinnaker to employ this best practice: With Spinnaker, service owners can create pipelines that can be used in an emergency situation to deploy to any region (via a parameter).

Here’s another pipeline in Spinnaker that can set a runtime environmental property, again based upon a parameter fed into this pipeline:

The take-aways:

No one reads email anymore

Email is a terrible way to get someone’s attention. This means that a company’s “code red” emails requiring urgent action can get inadvertently missed. When it comes to notifying people of important events or actions being required (such as deployments or manual stages) Netflix recommends using alternate channels, like Slack.

How to leverage Spinnaker to employ this best practice: In the screenshot below, a pipeline in Spinnaker uses email and Slack to notify relevant people when it starts, when it completes, and most importantly, if it fails. Notice in the event of a failure, this pipeline is specifically configured to notify a support channel in Slack.

Here’s a notification that requires a human to take action — In this case, to approve a deployment to a particular environment. Below that is a Slack notification that a deployment into production completed successfully.

The take-away:

Sometimes the humans are required

The machines haven’t completely replaced all of us! Sometimes the judgement of a human is needed in a deployment pipeline.

How to leverage Spinnaker to employ this best practice: Pipelines in Spinnaker can leverage a specific stage called “Manual Judgement” where a human must manually initiate a positive or negative acknowledgment before the pipeline continues.

The screenshot below shows a Manual Judgement stage following an Automated Canary Analysis stage. In the manual judgement stage, there’s an actual button for either stopping the pipeline, or continuing it. This particular pipeline is using this stage to pause the pipeline and let a human review it before it proceeds to a production deployment.

The take-aways:

Don’t assume

Things change — entropy is a thing. Therefore, guard against it.

How to leverage Spinnaker to employ this best practice: The epitome of recognizing entropy and not assuming reliability is Netflix’s Chaos Monkey, a resiliency tool that helps applications tolerate random instance failures by randomly killing instances. Chaos Monkey is tightly integrated into Spinnaker but also into the ethos of Netflix’s culture that demands service reliability.

Another way to “not assume” is to only trigger pipelines when you know people will be in the office, ready to handle any issues. Don’t assume that by deploying at 9pm on Friday night, your team will be happy to get paged at 3am in the morning on Saturday to handle a failure scenario — because they most definitely won’t.

Spinnaker pipelines can be triggered using a cron expression — but note, in this case, the cron expression excludes the weekend. This particular pipeline is deploying into production, and this team wants to be sure someone is around in the event of a problem (and they want to enjoy their weekends!).

Here’s a pipeline taking advantage of a precondition stage within Spinnaker. The execution of a pipeline can take a long time, especially if it’s waiting on a deployment window. And in those scenarios, the underlying cloud infrastructure might have changed. For example, someone might have manually deployed a new ASG for that application. Consequently, in those cases, it’s prudent to use a precondition stage to ensure things are as they should be before taking some action.

Netflix believes strongly in the notion of immutable infrastructure. Consequently, when service owners create an AMI for deployment, they do it once and promote that same AMI through environments rather than creating one for each. This ensures consistency. Consequently, in Spinnaker you can use a “Find AMI” stage to pull an AMI from a test environment and push it forward into a production one.

Here’s another “Find AMI” before an ACA:

The take-aways:

Make it easy to escalate

Even if you follow every best practice here, failure will eventually happen. So you might as well make it as easy as possible to reduce the time to fix the issue.

How to leverage Spinnaker to employ this best practice: Spinnaker tightly integrates with PagerDuty. In the event of an issue with a particular application, Spinnaker makes it it easy to page that application’s on-call person. In fact, a PagerDuty key can be required when defining an application.

Spinnaker’s tight integration with PageDuty makes it super easy to link your application with a PagerDuty key:

And since manually entering in a PagerDuty Service key is error prone, Spinnaker exposes the linkage between PagerDuty service keys and apps, allowing service owners to select the corresponding service name:

The take-aways:

Other Global Software Delivery best practices:

In summary, if you want to rapidly deliver software with confidence across multiple AWS regions:

Want to see Andy share these best-practices in person? Watch the video of Andy’s talk.

Recently Published Posts

Introducing Pipelines-as-Code Plugin for Open Source Spinnaker

Jul 21, 2023

Easily Scale and Automate with Version Control in Git Developers choose best-of-breed version control systems like GitHub for a reason: they need the ability to collaborate and improve code together.  But a broken Spinnaker deployment pipeline can often be the last thing standing in the way of getting your application to market.  Until now. Armory’s […]

Read more

What is FedRAMP and Why It Matters

Jun 8, 2023

What’s FedRAMP? Federal Risk and Authorization Management Program (FedRAMP) is a government-wide program that provides a standardized approach to security assessment, authorization, and continuous monitoring for cloud products and services. FedRAMP is important since it’s the gold standard for assessing cloud service providers (CSP) within the government. Under this program, authorized FedRAMP cloud service providers […]

Read more

New Spinnaker Operator Updates Now available for the Spinnaker Community

Mar 15, 2023

Stay up-to-date with the latest Kubernetes release with Spinnaker. The Armory crew has worked diligently the past several weeks to release a new stable version of OSS Operator (1.3.0). This is the first release in just over 18 months and is now available for the open source community.  What Changed? The Spinnaker Operator is the […]

Read more