Horizontal Pod Autoscaler in Kubernetes (Part 1) — Simple Autoscaling using Metrics Server
Table of Contents
Horizontal Pod Autoscaler in Kubernetes (Part 1) — Simple Autoscaling using Metrics Server
Table of Contents
The Horizontal Pod Autoscaler (HPA) is a fundamental feature of Kubernetes. It enables automatic scale-up and scale-down of containerized applications based on CPU usage, memory usage, or custom metrics.
Traditionally, when scaling software, we first think of vertical scaling: the CPU and the RAM are increased so the application consuming them can perform better. While this seems like a flawless mechanism on paper, it actually comes with many drawbacks.
First, upgrading the CPU or RAM on a physical machine (or VM) requires downtime and unless a Pod Disruption Budget (PDB) is used to handle disruptions, all pods will be evicted and recreated in the new resized node.
Nodes resource usage is also not optimized, as scaling vertically means requiring sufficient resources in a single node, while horizontal scaling may have the same amount of resources distributed across multiple nodes.
Additionally, vertical scaling is not as resilient as horizontal scaling, as fewer replicas mean higher risks of disruptions in case of node failure.
Finally, reaching a certain threshold, scaling only vertically becomes very expensive and most importantly, isn’t limitless. In fact, there is only so much CPU and RAM a physical machine(or VM) alone can handle.
This is where horizontal scaling comes into play!
Eventually, it is more efficient to duplicate an instance, than increase its resources.
🎬 Hi there, I’m Jean!
In this 2 parts series, we’re going to explore several ways to scale services horizontally in Kubernetes, and the first one is…
🥁
… using Metrics Server! 🎊
Before we start, make sure you have the following tools installed:
Note: for MacOS users or Linux users using Homebrew, simply run:
brew install kind kubectl helm k6
All set? Let’s go! 🏁
Kind is a tool for running local Kubernetes clusters using Docker container “nodes”. It was primarily designed for testing Kubernetes itself, but may be used for local development or CI.
I don’t expect you to have a demo project in handy, so I built one for you.
git clone https://github.com/jhandguy/horizontal-pod-autoscaler.git
cd horizontal-pod-autoscaler
Alright, let’s spin up our Kind cluster! 🚀
➜ kind create cluster --image kindest/node:v1.27.3 --config=kind/cluster.yaml
Creating cluster "kind" ...
✓ Ensuring node image (kindest/node:v1.27.3) 🖼
✓ Preparing nodes 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing CNI 🔌
✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:
kubectl cluster-info --context kind-kind
Have a question, bug, or feature request? Let us know! https://kind.sigs.k8s.io/#community 🙂
NGINX Ingress Controller is one of the many available Kubernetes Ingress Controllers, which acts as a load balancer and satisfies routing rules specified in Ingress resources, using the NGINX reverse proxy.
NGINX Ingress Controller can be installed via its Helm chart.
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install ingress-nginx/ingress-nginx --name-template ingress-nginx --create-namespace -n ingress-nginx --values kind/ingress-nginx-values.yaml --version 4.8.3 --wait
Now, if everything goes according to plan, you should be able to see the ingress-nginx-controller Deployment running.
➜ kubectl get deploy -n ingress-nginx
NAME READY UP-TO-DATE AVAILABLE AGE
ingress-nginx-controller 1/1 1 1 4m35s
Metrics Server is a source of container resource metrics, which collects them from Kubelets and exposes them in Kubernetes API server through Metrics API for use by Horizontal Pod Autoscaler and Vertical Pod Autoscaler.
Metrics Server can be installed via its Helm chart.
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server
helm install metrics-server/metrics-server --name-template metrics-server --create-namespace -n metrics-server --values kind/metrics-server-values.yaml --version 3.11.0 --wait
Now, if everything goes according to plan, you should be able to see the metrics-server Deployment running.
➜ kubectl get deploy -n metrics-server
NAME READY UP-TO-DATE AVAILABLE AGE
metrics-server 1/1 1 1 38s
Before we dive in, let’s quickly remind ourselves of what a Horizontal Pod Autoscaler in Kubernetes actually is:
A HorizontalPodAutoscaler (HPA for short) automatically updates a workload resource (such as a Deployment or StatefulSet), with the aim of automatically scaling the workload to match demand.
Horizontal scaling means that the response to increased load is to deploy more Pods. This is different from vertical scaling, which for Kubernetes would mean assigning more resources (for example: memory or CPU) to the Pods that are already running for the workload.
If the load decreases, and the number of Pods is above the configured minimum, the HorizontalPodAutoscaler instructs the workload resource (the Deployment, StatefulSet, or other similar resource) to scale back down.
Now that we know what an HPA is, let’s get started, shall we? 🧐
A native service is a piece of software that does not require a virtual environment in order to run across different OSs and CPU architectures. This is the case for C/C++, Golang, Rust, and more: those are languages that compile into a binary, that is directly executable by the Pod.
This means that native services can utilize all of the CPU and memory available from the Pod they run in, without an intermediary environment.
Let’s try it out with a Golang service!
helm install golang-sample-app/helm-chart --name-template sample-app --create-namespace -n sample-app --wait
If everything goes fine, you should eventually see one Deployment with the READY state.
➜ kubectl get deploy -n sample-app
NAME READY UP-TO-DATE AVAILABLE AGE
sample-app 2/2 2 2 44s
Once the Pods are running, Metrics Server will start collecting the Pods resource metrics from the node’s Kubelet and expose them in Kubernetes through Metrics API for use by Horizontal Pod Autoscaler and Vertical Pod Autoscaler.
Let’s see what the resource usage for those Pods currently is!
➜ kubectl top pods -n sample-app
NAME CPU(cores) MEMORY(bytes)
sample-app-6bcbfc8b49-j6xmq 1m 1Mi
sample-app-6bcbfc8b49-wtd8g 1m 1Mi
Pretty low, right? 🤔
This is obviously expected since our Go service currently isn’t handling any load.
Alright, now let’s have a look at the HPA!
➜ kubectl describe hpa -n sample-app
...
Metrics: ( current / target )
resource memory on pods (as a percentage of request): 8% / 50%
resource cpu on pods (as a percentage of request): 10% / 50%
Min replicas: 2
Max replicas: 8
...
As you can see, this HPA is configured to scale the service based on both CPU and memory, with average utilization of 50% each.
This means that as soon as either the CPU or memory utilization breaches the 50% threshold, the HPA will trigger an upscale.
Under minimal load, the HPA will still retain a replica count of 2, while the maximum amount of Pods the HPA is allowed to spin up under high load is 8.
Note: in a production environment, it is recommended to have a minimum replica count of at least 3, to guarantee maintained availability in the case of Pod Disruption.
Now, this is the moment you’ve certainly expected… It’s Load Testing time! 😎
For Load Testing, I really recommend k6 from the Grafana Labs team. It is a dead-simple yet super powerful tool with very extensive documentation.
See for yourself!
k6 run k6/script.js
While the load test is running, I suggest watching the HPA in a separate tab.
kubectl get hpa -n sample-app -w
As the load test progresses and the 2 starting Pods struggle to handle incoming requests, you should see both CPU and memory targets increasing, and ultimately, the replica count reaching its maximum!
Deployment/sample-app 16%/50%, 10%/50% 2 8 2
Deployment/sample-app 16%/50%, 15%/50% 2 8 2
Deployment/sample-app 17%/50%, 40%/50% 2 8 2
Deployment/sample-app 18%/50%, 50%/50% 2 8 2
Deployment/sample-app 19%/50%, 60%/50% 2 8 2
Deployment/sample-app 22%/50%, 75%/50% 2 8 3
Deployment/sample-app 27%/50%, 85%/50% 2 8 3
Deployment/sample-app 24%/50%, 80%/50% 2 8 4
Deployment/sample-app 27%/50%, 80%/50% 2 8 5
Deployment/sample-app 22%/50%, 72%/50% 2 8 5
Deployment/sample-app 23%/50%, 70%/50% 2 8 6
Deployment/sample-app 25%/50%, 64%/50% 2 8 7
Deployment/sample-app 24%/50%, 61%/50% 2 8 7
Deployment/sample-app 25%/50%, 61%/50% 2 8 7
Deployment/sample-app 27%/50%, 60%/50% 2 8 8
Deployment/sample-app 28%/50%, 60%/50% 2 8 8
Deployment/sample-app 27%/50%, 57%/50% 2 8 8
When relying on multiple targets for a single HPA, you can find out which of those have triggered the up/downscale by consulting Kubernetes events.
➜ kubectl get events -n sample-app
...
New size: 3; reason: cpu resource utilization (percentage of request) above target
Scaled up replica set sample-app-6bcbfc8b49 to 3
New size: 4; reason: cpu resource utilization (percentage of request) above target
Scaled up replica set sample-app-6bcbfc8b49 to 4
New size: 5; reason: cpu resource utilization (percentage of request) above target
Scaled up replica set sample-app-6bcbfc8b49 to 5
New size: 6; reason: cpu resource utilization (percentage of request) above target
Scaled up replica set sample-app-6bcbfc8b49 to 6
New size: 7; reason: cpu resource utilization (percentage of request) above target
Scaled up replica set sample-app-6bcbfc8b49 to 7
New size: 8; reason: cpu resource utilization (percentage of request) above target
Scaled up replica set sample-app-6bcbfc8b49 to 8
...
Now, let’s quickly have a look at the Load Test summary and the result of the http_req_duration
metric in particular!
/\ |‾‾| /‾‾/ /‾‾/
/\ / \ | |/ / / /
/ \/ \ | ( / ‾‾\
/ \ | |\ \ | (‾) |
/ __________ \ |__| \__\ \_____/ .io
execution: local
script: k6/script.js
output: -
scenarios: (100.00%) 1 scenario, 100 max VUs, 5m30s max duration (incl. graceful stop):
* load: Up to 100.00 iterations/s for 5m0s over 2 stages (maxVUs: 100, gracefulStop: 30s)
✓ status code is 200
✓ node is kind-control-plane
✓ namespace is sample-app
✓ pod is sample-app-*
✓ checks.........................: 100.00% ✓ 60360 ✗ 0
data_received..................: 3.5 MB 12 kB/s
data_sent......................: 1.7 MB 5.8 kB/s
http_req_blocked...............: avg=17.57µs min=1µs med=9µs max=5.48ms p(90)=17µs p(95)=21µs
http_req_connecting............: avg=4.36µs min=0s med=0s max=5.26ms p(90)=0s p(95)=0s
✓ http_req_duration..............: avg=17.63ms min=496µs med=3.16ms max=1.76s p(90)=11.38ms p(95)=51.18ms
{ expected_response:true }...: avg=17.63ms min=496µs med=3.16ms max=1.76s p(90)=11.38ms p(95)=51.18ms
http_req_failed................: 0.00% ✓ 0 ✗ 15090
http_req_receiving.............: avg=107.72µs min=10µs med=78µs max=7.62ms p(90)=164µs p(95)=214µs
http_req_sending...............: avg=53.87µs min=5µs med=35µs max=15.33ms p(90)=73µs p(95)=95µs
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...............: avg=17.47ms min=423µs med=2.99ms max=1.76s p(90)=11.08ms p(95)=51.02ms
http_reqs......................: 15090 50.29863/s
iteration_duration.............: avg=18.11ms min=626.66µs med=3.64ms max=1.76s p(90)=12.03ms p(95)=51.78ms
iterations.....................: 15090 50.29863/s
vus............................: 0 min=0 max=18
vus_max........................: 100 min=100 max=100
running (5m00.0s), 000/100 VUs, 15090 complete and 0 interrupted iterations
load ✓ [======================================] 000/100 VUs 5m0s 000.65 iters/s
As you can observe, our Golang service has performed very well under heavy load, with a Success Share of 100%, a median latency of ~3ms, and a 95th percentile latency of ~50ms!
We have the HPA to thank for that, as it scaled the Deployment from 2 to 8 Pods swiftly and automatically, based on the Pods resource usage!
We definitely would not have had the same results without an HPA… Actually, why don’t you try it yourself? 😉
Just delete the HPA (kubectl delete hpa sample-app -n sample-app
), run the load test again (k6 run k6/script.js
) and see what happens! (spoiler alert: it’s not pretty 😬)
Once you are done, don’t forget to uninstall the Helm release! (we won’t be needing this one anymore)
helm uninstall sample-app -n sample-app
While native services run as executable binaries, JVM services need an extra environment layer in order to run on various OSs and CPU architectures: the Java Virtual Machine (JVM).
This creates an issue for Pod Autoscaling, as the JVM pre-allocates more memory than it actually needs from the Pod’s resources, for its Garbage Collector (GC). This makes memory altogether an unreliable metric to use for autoscaling a JVM-based service in Kubernetes via Metrics Server.
Thus, in the case of Java or other JVM-based services, when utilizing Metrics Server for HPA, one can only rely on the CPU metric for autoscaling.
Let’s experience it with a Kotlin/JVM service!
helm install kotlin-sample-app/helm-chart --name-template sample-app --create-namespace -n sample-app --wait
If everything goes fine, you should eventually see one Deployment with the READY state.
➜ kubectl get deploy -n sample-app
NAME READY UP-TO-DATE AVAILABLE AGE
sample-app 2/2 2 2 52s
Let’s see what the resource usage for those Pods running a JVM currently is!
➜ kubectl top pods -n sample-app
NAME CPU(cores) MEMORY(bytes)
sample-app-8df8cfcd4-lg9s8 7m 105Mi
sample-app-8df8cfcd4-r9fjh 7m 105Mi
Interesting! As you can see, while being idle, both pods consume ~100Mi (~104Mb) of memory, which is almost 50% of the Pods memory limit! 😱
As previously stated, this is due to the JVM pre-allocating memory for its Garbage Collector (GC).
Alright, now let’s have a look at the HPA!
➜ kubectl describe hpa -n sample-app
...
Metrics: ( current / target )
resource cpu on pods (as a percentage of request): 10% / 50%
Min replicas: 2
Max replicas: 8
...
As announced, this time the HPA only relies on one resource metric: the CPU.
Alright, let’s give our favorite Load Testing tool another go! 🚀
k6 run k6/script.js
As previously mentioned, I suggest watching the HPA in a separate tab.
kubectl get hpa -n sample-app -w
As the load test progresses and the 2 starting Pods struggle to handle incoming requests, you should see the CPU target increasing, and ultimately, the replica count reaching its maximum!
Deployment/sample-app 10%/50% 2 8 2
Deployment/sample-app 15%/50% 2 8 2
Deployment/sample-app 36%/50% 2 8 2
Deployment/sample-app 64%/50% 2 8 2
Deployment/sample-app 37%/50% 2 8 3
Deployment/sample-app 41%/50% 2 8 3
Deployment/sample-app 51%/50% 2 8 3
Deployment/sample-app 99%/50% 2 8 3
Deployment/sample-app 56%/50% 2 8 6
Deployment/sample-app 50%/50% 2 8 6
Deployment/sample-app 76%/50% 2 8 6
Deployment/sample-app 74%/50% 2 8 8
Deployment/sample-app 61%/50% 2 8 8
Deployment/sample-app 58%/50% 2 8 8
Now, let’s quickly have a look at the Load Test summary and the result of the http_req_duration
metric in particular!
/\ |‾‾| /‾‾/ /‾‾/
/\ / \ | |/ / / /
/ \/ \ | ( / ‾‾\
/ \ | |\ \ | (‾) |
/ __________ \ |__| \__\ \_____/ .io
execution: local
script: k6/script.js
output: -
scenarios: (100.00%) 1 scenario, 100 max VUs, 5m30s max duration (incl. graceful stop):
* load: Up to 100.00 iterations/s for 5m0s over 2 stages (maxVUs: 100, gracefulStop: 30s)
✓ status code is 200
✓ node is kind-control-plane
✓ namespace is sample-app
✓ pod is sample-app-*
✓ checks.........................: 100.00% ✓ 60360 ✗ 0
data_received..................: 3.3 MB 11 kB/s
data_sent......................: 1.7 MB 5.8 kB/s
http_req_blocked...............: avg=18.56µs min=1µs med=9µs max=2.37ms p(90)=17µs p(95)=20µs
http_req_connecting............: avg=5.52µs min=0s med=0s max=1.65ms p(90)=0s p(95)=0s
✓ http_req_duration..............: avg=13.4ms min=864µs med=3.7ms max=1.96s p(90)=11.43ms p(95)=43.16ms
{ expected_response:true }...: avg=13.4ms min=864µs med=3.7ms max=1.96s p(90)=11.43ms p(95)=43.16ms
http_req_failed................: 0.00% ✓ 0 ✗ 15090
http_req_receiving.............: avg=101.68µs min=10µs med=79µs max=5.31ms p(90)=167µs p(95)=217µs
http_req_sending...............: avg=47.68µs min=4µs med=37µs max=5.87ms p(90)=70µs p(95)=88µs
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...............: avg=13.25ms min=803µs med=3.54ms max=1.96s p(90)=11.25ms p(95)=43.05ms
http_reqs......................: 15090 50.306331/s
iteration_duration.............: avg=13.84ms min=1.06ms med=4.17ms max=1.96s p(90)=12.02ms p(95)=43.55ms
iterations.....................: 15090 50.306331/s
vus............................: 0 min=0 max=13
vus_max........................: 100 min=100 max=100
running (5m00.0s), 000/100 VUs, 15090 complete and 0 interrupted iterations
load ✓ [======================================] 000/100 VUs 5m0s 000.65 iters/s
As you can observe, our Kotlin/JVM service has performed very well under heavy load, with a Success Share of 100%, a median latency of ~3ms, and a 95th percentile latency of ~50ms!
Once again, the HPA was able to scale the Deployment from 2 to 8 Pods swiftly and automatically, based on the Pods CPU usage alone!
Note: if you keep the Deployment idle for a few minutes, you should see the HPA gradually scaling back down to 2 Pods, due to low CPU usage.
That’s it! You can now stop and delete your Kind cluster.
kind delete cluster
To summarize, using Metrics Server we were able to:
Was it worth it? Did that help you understand how to implement Horizontal Pod Autoscaler in Kubernetes using Metrics Server?
If so, follow me on Twitter, I’ll be happy to answer any of your questions and you’ll be the first one to know when a new article comes out! 👌
See you next month, for Part 2 of my series Horizontal Pod Autoscaler in Kubernetes!
Bye-bye! 👋