Kubernetes Cluster Autoscaler in Action

Pavan Kumar
Nerd For Tech
Published in
8 min readOct 10, 2021

--

Effective cost savings with Kubernetes Cluster Autoscaler

Are you running your Kubernetes clusters in Production? Great, how many nodes are you running your cluster with? Personally, I’ve been running a Kubernetes cluster to host a simple self-service application with ~40 worker nodes initially.

We determined this was necessary due to the higher loads we’d get during peak hours, but we noticed most worker nodes were left idle during lower load periods like nights and weekends, thus wasting our budget.

Enter Kubernetes Cluster Autoscaler, thanks to Autoscaler we managed to reduce our resource usage by ~50 percent, all while maintaining a performant and responsive application.

Kubernetes Cluster Autoscaler automatically resizes the number of worker nodes in a given cluster, based on the demands of your workloads. You don’t need to manually add or remove nodes or over-provision your cluster. Instead, you specify a minimum and maximum size for your cluster and the scaling is automatic and is done by the cluster autoscaler on its own.

Cluster Autoscaler in Action

What is the entire story all about? (TLDR)

  1. Getting to know about Cluster Autoscaler.
  2. Implementing the cluster autoscaler on a Kubernetes Cluster on GCE.

Prerequisites

  1. A GCP account ( You can get a free-tier account with 300$ free credits ).
  2. Helm binary installed in your machine.

Story Resources

  1. GitHub Link: https://github.com/pavan-kumar-99/medium-manifests
  2. GitHub Branch: cluster-autoscaler-gce

Cluster Autoscaler

1) Scaling-Up

Let us first understand how Cluster Autoscaler works while scaling the nodes up. CA ( Cluster Autoscaler ) checks for any unschedulable pods every 10 seconds. If there are any items in the unschedulable pod's list, Cluster Autoscaler tries to find a new place to run them. CA assumes that the underlying nodes are a part of auto scaling groups and try to scale them based on that. It may take some time before the created nodes appear in Kubernetes. It almost entirely depends on the cloud provider and the speed of node provisioning.

2) Scaling Down

Cluster Autoscaler checks which nodes are unneeded by performing the following calculations:

a) The sum of CPU and memory requests of all pods running on this node is smaller than 50% of the node’s allocatable.

b) If the Pod’s scheduled on the nodes has the following conditions

  • Pods having annotation "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" cannot be evicted.
  • Pods with local storage.
  • Pods that are not backed by a controller object.
  • kube-system pods that don’t have any PDB’s set.
  • Pods that have restrictive PDB’s (Like 100% PDB).

c) Kubernetes Worker nodes annotated with the following annotation

"cluster-autoscaler.kubernetes.io/scale-down-disabled": "true"

CA checks all the aforementioned conditions and then performs the termination of the node if it’s unneeded for more than 10 minutes.

Expanders

Well, I hope the scaling logic is very obvious. But now what if you want to perform some math before you scale an instance. For example, when there are multiple instance groups attached you might want to scale the nodes from a selected auto scaling groups. You might want to select the node group that will cost the least and, at the same time, whose machines would match the cluster size. Here come Expanders into the Picture. Expanders provide different strategies for selecting the node group to which new nodes will be added. As of now, the CA has 5 different types of Expanders. The expanders are:

  1. random: This is the default expander. The selection of the instance from the group is random and scales randomly.
  2. most-pods: The CA elects the node group that would schedule most of the pending pods when scaling up.
  3. least-waste: The CA elects the node group that will have the least idle CPU after scale-up.
  4. price: The CA elects the node group that will cost the least and, at the same time, whose machines would match the cluster size ( Currently works for GCE, GKE ).
  5. priority: The CA elects the node group based on the custom priority given by the user. You can lean on how to configure a custom priority here.

Custom Parameters

Apart from creating the Cluster Autosclaer with the default values, one can also override the values explicitly. For example, you can configure How long a node should be unneeded before it is eligible for scale down, How long after scale-up that scale down evaluation resumes, etc. All these parameters to the CA can be found here.

Parameters to the Cluster Autoscaler

Well enough of the talk now. Let’s now get into action. As a part of this article, I have created the cluster using kOps on GCE. As of now CA supports the following Kubernetes cluster installation method

  1. GCE ( Installation of Kubernetes Cluster on GCE nodes )

2. AWS

3. Azure AKS

4. OpenStack Magnum.

The Cluster Autoscaler can be installed using the Helm Chart here.

Before we install the autoscaler, let us first create a Kubernetes cluster using kOps. There are many other ways to create a Kubernetes cluster however, for the scope of this article I am using kOps. Let us also install the Prometheus-Operator with grafana so that we visualize the scaling in a beautiful dashboard here. Once the Prometheus stack is also installed, you should also import this dashboard to Grafana and you should find the dashboard already showing data.

Create Cluster using Kops

Once the cluster is ready, let us first configure our CA helm chart with the Instance Groups prefix. We will now go to the GCP UI and then fetch the prefix of the instance group.

GCP Cloud Console

In my case the name of the worker node Instance group is a-nodes-us-central1-a-medium-k8s-local. Here is my values.yaml file. Let us explore the options present there.

  1. autoDiscovery.clusterName: The name of your Kubernetes cluster.
  2. cloudProvider: The name of the Cloud Provider. The value should be either gce, aws, azure, or magnum.
  3. autoscalingGroupsnamePrefix.name: Prefix of the Worker node Instance Group.
    minSize: The min number of nodes for the K8s cluster.
    maxSize: The maximum number of nodes that the cluster can scale to.
  4. extraArgs: Additional container arguments
    skip-nodes-with-local-storage= false: This is only for development purposes and not applicable for Production environments. This will replace the pod scheduled on a node to be scaled down to another node even if it has local storage. The default value is true.
    scale-down-delay-after-add=2m: The number of minutes the CA should wait to start the process of evaluating the scaling down of the worker nodes down after adding it to the cluster. This is also for development purposes to assist us in faster scaling. For production consider putting this to a higher value.
    scale-down-unneeded-time=2m: The number of minutes to scale the node down after it is marked unneeded. This is also for development purposes to assist us in faster scaling. For production consider putting this to a higher value.
  5. serviceMonitor: The ServiceMonitor has a label selector to select Services and their underlying Endpoint objects. It describes the set of targets to be monitored by Prometheus.
  6. prometheusRule: The Kubernetes Prometheus rules to be created.

Alright, let us now install the Cluster Autoscaler helm chart.

$ git clone https://github.com/pavan-kumar-99/medium-manifests.git \
-b cluster-autoscaler-gce
$ helm repo add autoscaler https://kubernetes.github.io/autoscaler$ helm upgrade -i medium-ca-gce autoscaler/cluster-autoscaler -f as.yaml

From the logs, you should see that the Managed Instance group should be automatically discovered.

kubectl logs medium-ca-gce-cluster-autoscaler-568f44fc67–675dx

Highlighted the auto-discovery logs

kubectl get no | awk ‘{print $1}’

NAME

master-us-central1-a-l918

nodes-us-central1-a-q9ns

This is our cluster topology now with 1 master node and 1 worker node. Let us create a simple nginx deployment and scale it to 100 replicas.

kubectl create deploy nginx — image=nginx

kubectl scale deploy nginx — replicas=40

As my pods come to a pending state the CA calculates the number of nodes needed and triggers the scaling of the new nodes.

Pods being scaled up

After a minute or so, you should find the new nodes added to the cluster. And all my pods are now in running state.

kubectl get no

Let us now check the dashboard in grafana.

Cluster Autoscaler dashboard

Now let us scale our nginx deployment to 1.

kubectl scale deploy nginx — replicas=1

After a couple of minutes, the CA evaluates the nodes which are unneeded and will reschedule the pods on the unneeded node to another node and terminates the node gracefully.

Pods being rescheduled to another node

Once the node is considered unneeded by the CA, the CA adds the taints on the unneeded node so that the new pods are not scheduled on this.

ToBeDeletedByClusterAutoscaler=1620921499:NoSchedule

DeletionCandidateOfClusterAutoscaler=1620921376:PreferNoSchedule

If there are any PDB’s on the node, the eviction will only happen only after the PDB constraints are satisfied.

Nodes being scaled down
Grafana dashboard showing the scaling activity

And now once all the activity is completed by CA, we should have the actual count of master and data nodes in the cluster.

kubectl get no

Clean Up

Well, let’s delete the whole cluster now.

kops delete cluster medium.k8s.local — yes

Conclusion

With this, we have understood how Cluster Autoscaler works and helps us reduce the cost by terminating the unneeded resources. In my upcoming articles, I will be writing an article on how to utilize the Cluster Autoscaler with Spot Instances ( Preemptale nodes ). Feel free to reach out to me for any new ideas or questions. Also, feel free to comment your thoughts in the comment section.

Until next time………..

Recommended

--

--

Pavan Kumar
Nerd For Tech

Senior Cloud DevOps Engineer || CKA | CKS | CSA | CRO | AWS | ISTIO | AZURE | GCP | DEVOPS Linkedin:https://www.linkedin.com/in/pavankumar1999/