Horizontal Pod Autoscaler in Kubernetes (Part 1) — Simple Autoscaling using Metrics Server

The Horizontal Pod Autoscaler (HPA) is a fundamental feature of Kubernetes. It enables automatic scale-up and scale-down of containerized applications based on CPU usage, memory usage, or custom metrics.

Traditionally, when scaling software, we first think of vertical scaling: the CPU and the RAM are increased so the application consuming them can perform better. While this seems like a flawless mechanism on paper, it actually comes with many drawbacks.

First, upgrading the CPU or RAM on a physical machine (or VM) requires downtime and unless a Pod Disruption Budget (PDB) is used to handle disruptions, all pods will be evicted and recreated in the new resized node.

Nodes resource usage is also not optimized, as scaling vertically means requiring sufficient resources in a single node, while horizontal scaling may have the same amount of resources distributed across multiple nodes.

Additionally, vertical scaling is not as resilient as horizontal scaling, as fewer replicas mean higher risks of disruptions in case of node failure.

Finally, reaching a certain threshold, scaling only vertically becomes very expensive and most importantly, isn’t limitless. In fact, there is only so much CPU and RAM a physical machine(or VM) alone can handle.
This is where horizontal scaling comes into play!

Eventually, it is more efficient to duplicate an instance, than increase its resources.

🎬 Hi there, I’m Jean!

In this 2 parts series, we’re going to explore several ways to scale services horizontally in Kubernetes, and the first one is…
🥁
… using Metrics Server! 🎊

Requirements

Before we start, make sure you have the following tools installed:

Note: for MacOS users or Linux users using Homebrew, simply run:
brew install kind kubectl helm k6

All set? Let’s go! 🏁

Creating Kind Cluster

Kind is a tool for running local Kubernetes clusters using Docker container “nodes”. It was primarily designed for testing Kubernetes itself, but may be used for local development or CI.

I don’t expect you to have a demo project in handy, so I built one for you.

git clone https://github.com/jhandguy/horizontal-pod-autoscaler.git
cd horizontal-pod-autoscaler

Alright, let’s spin up our Kind cluster! 🚀

➜ kind create cluster --image kindest/node:v1.27.3 --config=kind/cluster.yaml
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.27.3) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Have a question, bug, or feature request? Let us know! https://kind.sigs.k8s.io/#community 🙂

Installing NGINX Ingress Controller

NGINX Ingress Controller is one of the many available Kubernetes Ingress Controllers, which acts as a load balancer and satisfies routing rules specified in Ingress resources, using the NGINX reverse proxy.

NGINX Ingress Controller can be installed via its Helm chart.

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install ingress-nginx/ingress-nginx --name-template ingress-nginx --create-namespace -n ingress-nginx --values kind/ingress-nginx-values.yaml --version 4.8.3 --wait

Now, if everything goes according to plan, you should be able to see the ingress-nginx-controller Deployment running.

➜ kubectl get deploy -n ingress-nginx
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
ingress-nginx-controller   1/1     1            1           4m35s

Installing Metrics Server

Metrics Server is a source of container resource metrics, which collects them from Kubelets and exposes them in Kubernetes API server through Metrics API for use by Horizontal Pod Autoscaler and Vertical Pod Autoscaler.

Metrics Server can be installed via its Helm chart.

helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server
helm install metrics-server/metrics-server --name-template metrics-server --create-namespace -n metrics-server --values kind/metrics-server-values.yaml --version 3.11.0 --wait

Now, if everything goes according to plan, you should be able to see the metrics-server Deployment running.

➜ kubectl get deploy -n metrics-server
NAME             READY   UP-TO-DATE   AVAILABLE   AGE
metrics-server   1/1     1            1           38s

Horizontal Pod Autoscaler

Before we dive in, let’s quickly remind ourselves of what a Horizontal Pod Autoscaler in Kubernetes actually is:

A HorizontalPodAutoscaler (HPA for short) automatically updates a workload resource (such as a Deployment or StatefulSet), with the aim of automatically scaling the workload to match demand.
Horizontal scaling means that the response to increased load is to deploy more Pods. This is different from vertical scaling, which for Kubernetes would mean assigning more resources (for example: memory or CPU) to the Pods that are already running for the workload.
If the load decreases, and the number of Pods is above the configured minimum, the HorizontalPodAutoscaler instructs the workload resource (the Deployment, StatefulSet, or other similar resource) to scale back down.

Now that we know what an HPA is, let’s get started, shall we? 🧐

Autoscaling Native Services based on CPU and Memory Usage

A native service is a piece of software that does not require a virtual environment in order to run across different OSs and CPU architectures. This is the case for C/C++, Golang, Rust, and more: those are languages that compile into a binary, that is directly executable by the Pod.
This means that native services can utilize all of the CPU and memory available from the Pod they run in, without an intermediary environment.

Let’s try it out with a Golang service!

helm install golang-sample-app/helm-chart --name-template sample-app --create-namespace -n sample-app --wait

If everything goes fine, you should eventually see one Deployment with the READY state.

➜ kubectl get deploy -n sample-app
NAME         READY   UP-TO-DATE   AVAILABLE   AGE
sample-app   2/2     2            2           44s

Once the Pods are running, Metrics Server will start collecting the Pods resource metrics from the node’s Kubelet and expose them in Kubernetes through Metrics API for use by Horizontal Pod Autoscaler and Vertical Pod Autoscaler.

Let’s see what the resource usage for those Pods currently is!

➜ kubectl top pods -n sample-app
NAME                          CPU(cores)   MEMORY(bytes)
sample-app-6bcbfc8b49-j6xmq   1m           1Mi
sample-app-6bcbfc8b49-wtd8g   1m           1Mi

Pretty low, right? 🤔
This is obviously expected since our Go service currently isn’t handling any load.

Alright, now let’s have a look at the HPA!

➜ kubectl describe hpa -n sample-app
...
Metrics: ( current / target )
  resource memory on pods (as a percentage of request):  8% / 50%
  resource cpu on pods (as a percentage of request):     10% / 50%
Min replicas:                                            2
Max replicas:                                            8
...

As you can see, this HPA is configured to scale the service based on both CPU and memory, with average utilization of 50% each.

This means that as soon as either the CPU or memory utilization breaches the 50% threshold, the HPA will trigger an upscale.

Under minimal load, the HPA will still retain a replica count of 2, while the maximum amount of Pods the HPA is allowed to spin up under high load is 8.

Note: in a production environment, it is recommended to have a minimum replica count of at least 3, to guarantee maintained availability in the case of Pod Disruption.

Now, this is the moment you’ve certainly expected… It’s Load Testing time! 😎

For Load Testing, I really recommend k6 from the Grafana Labs team. It is a dead-simple yet super powerful tool with very extensive documentation.

See for yourself!

k6 run k6/script.js

While the load test is running, I suggest watching the HPA in a separate tab.

kubectl get hpa -n sample-app -w

As the load test progresses and the 2 starting Pods struggle to handle incoming requests, you should see both CPU and memory targets increasing, and ultimately, the replica count reaching its maximum!

Deployment/sample-app   16%/50%, 10%/50%   2         8         2
Deployment/sample-app   16%/50%, 15%/50%   2         8         2
Deployment/sample-app   17%/50%, 40%/50%   2         8         2
Deployment/sample-app   18%/50%, 50%/50%   2         8         2
Deployment/sample-app   19%/50%, 60%/50%   2         8         2
Deployment/sample-app   22%/50%, 75%/50%   2         8         3
Deployment/sample-app   27%/50%, 85%/50%   2         8         3
Deployment/sample-app   24%/50%, 80%/50%   2         8         4
Deployment/sample-app   27%/50%, 80%/50%   2         8         5
Deployment/sample-app   22%/50%, 72%/50%   2         8         5
Deployment/sample-app   23%/50%, 70%/50%   2         8         6
Deployment/sample-app   25%/50%, 64%/50%   2         8         7
Deployment/sample-app   24%/50%, 61%/50%   2         8         7
Deployment/sample-app   25%/50%, 61%/50%   2         8         7
Deployment/sample-app   27%/50%, 60%/50%   2         8         8
Deployment/sample-app   28%/50%, 60%/50%   2         8         8
Deployment/sample-app   27%/50%, 57%/50%   2         8         8

When relying on multiple targets for a single HPA, you can find out which of those have triggered the up/downscale by consulting Kubernetes events.

➜ kubectl get events -n sample-app
...
New size: 3; reason: cpu resource utilization (percentage of request) above target
Scaled up replica set sample-app-6bcbfc8b49 to 3
New size: 4; reason: cpu resource utilization (percentage of request) above target
Scaled up replica set sample-app-6bcbfc8b49 to 4
New size: 5; reason: cpu resource utilization (percentage of request) above target
Scaled up replica set sample-app-6bcbfc8b49 to 5
New size: 6; reason: cpu resource utilization (percentage of request) above target
Scaled up replica set sample-app-6bcbfc8b49 to 6
New size: 7; reason: cpu resource utilization (percentage of request) above target
Scaled up replica set sample-app-6bcbfc8b49 to 7
New size: 8; reason: cpu resource utilization (percentage of request) above target
Scaled up replica set sample-app-6bcbfc8b49 to 8
...

Now, let’s quickly have a look at the Load Test summary and the result of the http_req_duration metric in particular!

          /\      |‾‾| /‾‾/   /‾‾/
     /\  /  \     |  |/  /   /  /
    /  \/    \    |     (   /   ‾‾\
   /          \   |  |\  \ |  (‾)  |
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: k6/script.js
     output: -

  scenarios: (100.00%) 1 scenario, 100 max VUs, 5m30s max duration (incl. graceful stop):
           * load: Up to 100.00 iterations/s for 5m0s over 2 stages (maxVUs: 100, gracefulStop: 30s)


     ✓ status code is 200
     ✓ node is kind-control-plane
     ✓ namespace is sample-app
     ✓ pod is sample-app-*

   ✓ checks.........................: 100.00% ✓ 60360    ✗ 0
     data_received..................: 3.5 MB  12 kB/s
     data_sent......................: 1.7 MB  5.8 kB/s
     http_req_blocked...............: avg=17.57µs  min=1µs      med=9µs    max=5.48ms  p(90)=17µs    p(95)=21µs
     http_req_connecting............: avg=4.36µs   min=0s       med=0s     max=5.26ms  p(90)=0s      p(95)=0s
   ✓ http_req_duration..............: avg=17.63ms  min=496µs    med=3.16ms max=1.76s   p(90)=11.38ms p(95)=51.18ms
       { expected_response:true }...: avg=17.63ms  min=496µs    med=3.16ms max=1.76s   p(90)=11.38ms p(95)=51.18ms
     http_req_failed................: 0.00%   ✓ 0        ✗ 15090
     http_req_receiving.............: avg=107.72µs min=10µs     med=78µs   max=7.62ms  p(90)=164µs   p(95)=214µs
     http_req_sending...............: avg=53.87µs  min=5µs      med=35µs   max=15.33ms p(90)=73µs    p(95)=95µs
     http_req_tls_handshaking.......: avg=0s       min=0s       med=0s     max=0s      p(90)=0s      p(95)=0s
     http_req_waiting...............: avg=17.47ms  min=423µs    med=2.99ms max=1.76s   p(90)=11.08ms p(95)=51.02ms
     http_reqs......................: 15090   50.29863/s
     iteration_duration.............: avg=18.11ms  min=626.66µs med=3.64ms max=1.76s   p(90)=12.03ms p(95)=51.78ms
     iterations.....................: 15090   50.29863/s
     vus............................: 0       min=0      max=18
     vus_max........................: 100     min=100    max=100


running (5m00.0s), 000/100 VUs, 15090 complete and 0 interrupted iterations
load ✓ [======================================] 000/100 VUs  5m0s  000.65 iters/s

As you can observe, our Golang service has performed very well under heavy load, with a Success Share of 100%, a median latency of ~3ms, and a 95th percentile latency of ~50ms!

We have the HPA to thank for that, as it scaled the Deployment from 2 to 8 Pods swiftly and automatically, based on the Pods resource usage!

We definitely would not have had the same results without an HPA… Actually, why don’t you try it yourself? 😉

Just delete the HPA (kubectl delete hpa sample-app -n sample-app), run the load test again (k6 run k6/script.js) and see what happens! (spoiler alert: it’s not pretty 😬)

Once you are done, don’t forget to uninstall the Helm release! (we won’t be needing this one anymore)

helm uninstall sample-app -n sample-app

Autoscaling JVM Services based on CPU Usage

While native services run as executable binaries, JVM services need an extra environment layer in order to run on various OSs and CPU architectures: the Java Virtual Machine (JVM).

This creates an issue for Pod Autoscaling, as the JVM pre-allocates more memory than it actually needs from the Pod’s resources, for its Garbage Collector (GC). This makes memory altogether an unreliable metric to use for autoscaling a JVM-based service in Kubernetes via Metrics Server.

Thus, in the case of Java or other JVM-based services, when utilizing Metrics Server for HPA, one can only rely on the CPU metric for autoscaling.

Let’s experience it with a Kotlin/JVM service!

helm install kotlin-sample-app/helm-chart --name-template sample-app --create-namespace -n sample-app --wait

If everything goes fine, you should eventually see one Deployment with the READY state.

➜ kubectl get deploy -n sample-app
NAME         READY   UP-TO-DATE   AVAILABLE   AGE
sample-app   2/2     2            2           52s

Let’s see what the resource usage for those Pods running a JVM currently is!

➜ kubectl top pods -n sample-app
NAME                         CPU(cores)   MEMORY(bytes)
sample-app-8df8cfcd4-lg9s8   7m           105Mi
sample-app-8df8cfcd4-r9fjh   7m           105Mi

Interesting! As you can see, while being idle, both pods consume ~100Mi (~104Mb) of memory, which is almost 50% of the Pods memory limit! 😱
As previously stated, this is due to the JVM pre-allocating memory for its Garbage Collector (GC).

Alright, now let’s have a look at the HPA!

➜ kubectl describe hpa -n sample-app
...
Metrics: ( current / target )
  resource cpu on pods (as a percentage of request):  10% / 50%
Min replicas:                                         2
Max replicas:                                         8
...

As announced, this time the HPA only relies on one resource metric: the CPU.

Alright, let’s give our favorite Load Testing tool another go! 🚀

k6 run k6/script.js

As previously mentioned, I suggest watching the HPA in a separate tab.

kubectl get hpa -n sample-app -w

As the load test progresses and the 2 starting Pods struggle to handle incoming requests, you should see the CPU target increasing, and ultimately, the replica count reaching its maximum!

Deployment/sample-app   10%/50%   2         8         2
Deployment/sample-app   15%/50%   2         8         2
Deployment/sample-app   36%/50%   2         8         2
Deployment/sample-app   64%/50%   2         8         2
Deployment/sample-app   37%/50%   2         8         3
Deployment/sample-app   41%/50%   2         8         3
Deployment/sample-app   51%/50%   2         8         3
Deployment/sample-app   99%/50%   2         8         3
Deployment/sample-app   56%/50%   2         8         6
Deployment/sample-app   50%/50%   2         8         6
Deployment/sample-app   76%/50%   2         8         6
Deployment/sample-app   74%/50%   2         8         8
Deployment/sample-app   61%/50%   2         8         8
Deployment/sample-app   58%/50%   2         8         8

Now, let’s quickly have a look at the Load Test summary and the result of the http_req_duration metric in particular!

          /\      |‾‾| /‾‾/   /‾‾/
     /\  /  \     |  |/  /   /  /
    /  \/    \    |     (   /   ‾‾\
   /          \   |  |\  \ |  (‾)  |
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: k6/script.js
     output: -

  scenarios: (100.00%) 1 scenario, 100 max VUs, 5m30s max duration (incl. graceful stop):
           * load: Up to 100.00 iterations/s for 5m0s over 2 stages (maxVUs: 100, gracefulStop: 30s)


     ✓ status code is 200
     ✓ node is kind-control-plane
     ✓ namespace is sample-app
     ✓ pod is sample-app-*

   ✓ checks.........................: 100.00% ✓ 60360     ✗ 0
     data_received..................: 3.3 MB  11 kB/s
     data_sent......................: 1.7 MB  5.8 kB/s
     http_req_blocked...............: avg=18.56µs  min=1µs    med=9µs    max=2.37ms p(90)=17µs    p(95)=20µs
     http_req_connecting............: avg=5.52µs   min=0s     med=0s     max=1.65ms p(90)=0s      p(95)=0s
   ✓ http_req_duration..............: avg=13.4ms   min=864µs  med=3.7ms  max=1.96s  p(90)=11.43ms p(95)=43.16ms
       { expected_response:true }...: avg=13.4ms   min=864µs  med=3.7ms  max=1.96s  p(90)=11.43ms p(95)=43.16ms
     http_req_failed................: 0.00%   ✓ 0         ✗ 15090
     http_req_receiving.............: avg=101.68µs min=10µs   med=79µs   max=5.31ms p(90)=167µs   p(95)=217µs
     http_req_sending...............: avg=47.68µs  min=4µs    med=37µs   max=5.87ms p(90)=70µs    p(95)=88µs
     http_req_tls_handshaking.......: avg=0s       min=0s     med=0s     max=0s     p(90)=0s      p(95)=0s
     http_req_waiting...............: avg=13.25ms  min=803µs  med=3.54ms max=1.96s  p(90)=11.25ms p(95)=43.05ms
     http_reqs......................: 15090   50.306331/s
     iteration_duration.............: avg=13.84ms  min=1.06ms med=4.17ms max=1.96s  p(90)=12.02ms p(95)=43.55ms
     iterations.....................: 15090   50.306331/s
     vus............................: 0       min=0       max=13
     vus_max........................: 100     min=100     max=100


running (5m00.0s), 000/100 VUs, 15090 complete and 0 interrupted iterations
load ✓ [======================================] 000/100 VUs  5m0s  000.65 iters/s

As you can observe, our Kotlin/JVM service has performed very well under heavy load, with a Success Share of 100%, a median latency of ~3ms, and a 95th percentile latency of ~50ms!

Once again, the HPA was able to scale the Deployment from 2 to 8 Pods swiftly and automatically, based on the Pods CPU usage alone!

Note: if you keep the Deployment idle for a few minutes, you should see the HPA gradually scaling back down to 2 Pods, due to low CPU usage.

Wrapping up

That’s it! You can now stop and delete your Kind cluster.

kind delete cluster

To summarize, using Metrics Server we were able to:

Autoscale horizontally our native service written in Golang, based on both CPU and memory usage;
Autoscale horizontally our JVM service written in Kotlin/JVM, based on CPU usage.

Was it worth it? Did that help you understand how to implement Horizontal Pod Autoscaler in Kubernetes using Metrics Server?

If so, follow me on Twitter, I’ll be happy to answer any of your questions and you’ll be the first one to know when a new article comes out! 👌

See you next month, for Part 2 of my series Horizontal Pod Autoscaler in Kubernetes!

Bye-bye! 👋