Deep Dive into Clouddriver

Jun 20, 2019 by Nicolas Cohen

A lot of questions we get from customers are really about Clouddriver: how to scale it, how to diagnose errors or performance issues. We’re sharing an overview of the service (no code, I promise) and some tips to operate Clouddriver at scale in the hope it will help the Spinnaker community. This is the first in a series of posts on Clouddriver.

What is Clouddriver used for?

When deploying your app, Clouddriver will create server groups, change load balancers, and inform the rest of the services of what’s out there. It is the service that discovers the state of the world and changes it.

Clouddriver works by polling your cloud infrastructure on a regular interval and storing the result in a shared cache (more on that later).

It is used by the following services:

Clouddriver itself initiates communication with:

How Clouddriver works

Clouddriver defines cloud providers (such as AWS, Azure, GCP, CloudFoundry, Oracle, DC/OS, Kubernetes, Docker). Each provider can have accounts (such as a Kubernetes cluster or an AWS account).

There are two main functional areas in Clouddriver: caching and mutating operations.

Caching in detail

Caching agents query your cloud infrastructure for resources and store the results in a cache store. Each provider has its own set of caching agents that are instantiated per account and sometimes per region. Each caching agent is specialized in one type of resource such as server groups, load balancers, security groups, instances, etc.

In reality, the number of caching agents varies greatly between providers and with your Clouddriver configuration.

For instance, AWS might have between 16 and 20 agents per region, performing tasks such as caching the status of IAM roles, instances, and VPCs as well as some agents operating globally for tasks such as cleaning up detached instances. And Kubernetes (v2) might have a few agents per cluster, caching things like custom resources and Kubernetes manifests. We’ll go over some of these specifics in a later post.

The cache store is where Clouddriver stores cloud resources. It comes in different flavors:

All these stores – with the exception of the in-memory store – work across multiple Clouddriver instances.

The agent scheduler is in charge of running caching agents at regular intervals across all Clouddriver instances. There are 5 types of schedulers:

Note that the cache store does not dictate the type of agent scheduler. For instance, you could use the SQL cache store along with the Redis-backed scheduler.

If you read Clouddriver source code, you’ll see references to cats (aka Cache All The Stuff), which is the framework that manages agent scheduler + agents + cache store.

Putting it all together

Now that we have all the primitives, the startup sequence should be intuitive: Clouddriver inspects its configuration and instantiates the cache store and the agent scheduler. For each provider enabled, agents are instantiated per account/region and added to the scheduler.

When the scheduler runs:

Operations in detail

Clouddriver has the concept of atomic operations – a single unit of work. Spinnaker pipeline tasks trigger these operations to mutate cloud resources.

There are more than 200 atomic operations available in Clouddriver, such as creating a server group, terminating EC2 instances, or deploying Kubernetes manifests.

Operations statuses are saved in a task repository, that can be backed by: Redis, SQL, in memory, or a “dual” repository to migrate from one store to the next seamlessly.

Operation Execution

Note that atomic operations that are sent together are immediately executed together in the same thread.

Atomic operations vary greatly in their complexity. They generally try to be atomic but not always (e.g. deploying multiple Kubernetes manifests). We won’t cover atomic operations implementation here but if you’re interested, check out Clouddriver’s code.

Looking into Clouddriver tasks

From a user perspective, Clouddriver tasks are not very visible. You can however spot these tasks in the source link of stages:

Each stage and tasks will contain the history of Clouddriver executions under the kato.tasks key:

"context": {  ...
  "kato.last.task.id": {
    "id": "1f24bc99-2b96-451a-807e-0c459fed12eb"
  },
  "kato.task.firstNotFoundRetry": -1,
  "kato.task.notFoundRetryCount": 0,
  "kato.tasks": [{
    "history": [{
      "phase": "ORCHESTRATION",
      "status": "Initializing Orchestration Task..."
     }, {
      "phase": "ORCHESTRATION",
      "status": "Processing op: DisableAsgAtomicOperation"
     }, {
       "phase": "DISABLE_ASG",
       "status": "Initializing Disable ASG operation for [us-west-2:deploy-preprod-v015]..."
     },
     ...],
     "id": "1f24bc99-2b96-451a-807e-0c459fed12eb",
     "resultObjects": [],
     "status": {
        "completed": true,
        "failed": false
     }
  }],
  ...

The history contains tasks repeated with their status changes as well as any output. It’s quite useful to understand what Spinnaker is actually doing under the hood and troubleshoot potential issues.

On-demand caching agents

We now have the main pieces of the puzzle:

However, most cloud mutating operations are not synchronous. For instance, when Clouddriver sends a request to AWS to launch a new EC2 instance, the API call will return successfully but the instance will take a while before it’s ready. Even in Kubernetes, sending a manifest is accepted but it can take a few seconds before the resource is considered ready. This is when Spinnaker uses on-demand caching agents.

On-demand caching agents are – as their name implies – created on demand by the client (Orca) in tasks such as Force Cache Refresh or Wait for Up Instances. They are used to ensure cache freshness and know when a resource is created or effectively deleted.

The main gotcha is that when using a cache store that works across multiple Clouddrivers (like Redis), Clouddriver will wait for the next regular caching agent of the same type to run before declaring the cache consistent. It gives the cache store one more chance to replicate its state (to other replicas in the case of Redis).

Other operations

Clouddriver handles a couple more important functions that aren’t described above:

And voilà! We’re now equipped to understand potential bottlenecks and troubleshoot issues. We’ll cover that in the next post.

Recently Published Posts

Introducing Pipelines-as-Code Plugin for Open Source Spinnaker

Jul 21, 2023

Easily Scale and Automate with Version Control in Git Developers choose best-of-breed version control systems like GitHub for a reason: they need the ability to collaborate and improve code together.  But a broken Spinnaker deployment pipeline can often be the last thing standing in the way of getting your application to market.  Until now. Armory’s […]

Read more

What is FedRAMP and Why It Matters

Jun 8, 2023

What’s FedRAMP? Federal Risk and Authorization Management Program (FedRAMP) is a government-wide program that provides a standardized approach to security assessment, authorization, and continuous monitoring for cloud products and services. FedRAMP is important since it’s the gold standard for assessing cloud service providers (CSP) within the government. Under this program, authorized FedRAMP cloud service providers […]

Read more

New Spinnaker Operator Updates Now available for the Spinnaker Community

Mar 15, 2023

Stay up-to-date with the latest Kubernetes release with Spinnaker. The Armory crew has worked diligently the past several weeks to release a new stable version of OSS Operator (1.3.0). This is the first release in just over 18 months and is now available for the open source community.  What Changed? The Spinnaker Operator is the […]

Read more