Thanks! We'll be in touch in the next 12 hours
Oops! Something went wrong while submitting the form.

Autoscaling in Kubernetes using HPA and VPA

Autoscaling, a key feature of Kubernetes, lets you improve the resource utilization of your cluster by automatically adjusting the application’s resources or replicas depending on the load at that time.

This blog talks about Pod Autoscaling in Kubernetes and how to set up and configure autoscalers to optimize the resource utilization of your application.

Horizontal Pod Autoscaling

What is the Horizontal Pod Autoscaler?

The Horizontal Pod Autoscaler (HPA) scales the number of pods of a replica-set/ deployment/ statefulset based on per-pod metrics received from resource metrics API ( provided by metrics-server, the custom metrics API (, or the external metrics API (

Horizontal Pod Autoscaling
Fig:- Horizontal Pod Autoscaling


Verify that the metrics-server is already deployed and running using the command below, or deploy it using instructions here.


HPA using Multiple Resource Metrics

HPA fetches per-pod resource metrics (like CPU, memory) from the resource metrics API and calculates the current metric value based on the mean values of all targeted pods. It compares the current metric value with the target metric value specified in the HPA spec and produces a ratio used to scale the number of desired replicas.

A. Setup: Create a Deployment and HPA resource

In this blog post, I have used the config below to create a deployment of 3 replicas, with some memory load defined by “--vm-bytes", "850M”.


NOTE: It’s recommended not to use HPA and VPA on the same pods or deployments.


Lets create an HPA resource for this deployment with multiple metric blocks defined. The HPA will consider each metric one-by-one and calculate the desired replica counts based on each of the metrics, and then select the one with the highest replica count.


  • We have defined the minimum number  of replicas HPA can scale down to as 1 and the maximum number that it can scale up to as 10.
  • Target Average Utilization and Target Average Values implies that the HPA should scale the replicas up/down to keep the Current Metric Value equal or closest to Target Metric Value.

B. Understanding the HPA Algorithm


  • HPA calculates pod utilization as total usage of all containers in the pod divided by total request. It looks at all containers individually and returns if container doesn't have request.
  • The calculated  Current Metric Value for memory, i,e., 894188202666m, is higher than the Target Average Value of 500Mi, so the replicas need to be scaled up.
  • The calculated  Current Metric Value for CPU i.e., 36%, is lower than the Target Average Utilization of 50, so  hence the replicas need to be scaled down.
  • Replicas are calculated based on both metrics and the highest replica count selected. So, the replicas are scaled up to 6 in this case.

HPA using Custom metrics

We will use the prometheus-adapter resource to expose custom application metrics to, which are retrieved by HPA. By defining our own metrics through the adapter’s configuration, we can let HPA perform scaling based on our custom metrics.

A. Setup: Install Prometheus Adapter

Create prometheus-adapter.yaml with the content below:



Once the charts are deployed, verify the metrics are exposed at


You can see the metrics value of all the replicas in the output.

B. Understanding Prometheus Adapter Configuration

The adapter considers metrics defined with the parameters below:

1. seriesQuery tells the Prometheus Metric name to the adapter

2. resources tells which Kubernetes resources each metric is associated with or which labels does the metric include, e.g., namespace, pod etc.

3. metricsQuery is the actual Prometheus query that needs to be performed to calculate the actual values.

4. name with which the metric should be exposed to the custom metrics API

For instance, if we want to calculate the rate of container_network_receive_packets_total, we will need to write this query in Prometheus UI:

sum(rate(container_network_receive_packets_total{namespace="autoscale-tester",pod=~"autoscale-tester.*"}[10m])) by (pod)

This query is represented as below in the adapter configuration:

metricsQuery: 'sum(rate(<<.series>>{<<.labelmatchers>>}10m])) by (<<.groupby>>)'</.groupby></.labelmatchers></.series>

C. Create an HPA resource

Now, let's create an HPA resource with the pod metric packets_in using the config below, and then describe the HPA resource.



Here, the current calculated metric value is 18666m. The m represents milli-units. So, for example, 18666m means 18.666 which is what we expect ((33 + 11 + 10 )/3 = 18.666). Since it's less than the target average value (i.e., 50), the HPA scales down the replicas to make the Current Metric Value : Target Metric Value ratio closest to 1. Hence, replicas are scaled down to 2 and later to 1.

Fig.2 Prometheus :container_network_receive_packets_total{namespace=”autoscale-tester}
Fig:- container_network_receive_packets_total

Prometheus: Ratio of avg(container_network_receive_packets_total{namespace=”autoscale-tester}) :Target Average Value
Fig:- Ratio to Target value

Vertical Pod Autoscaling

What is Vertical Pod Autoscaler?

Vertical Pod autoscaling (VPA) ensures that a container’s resources are not under- or over-utilized. It recommends optimized CPU and memory requests/limits values, and can also automatically update them for you so that the cluster resources are efficiently used.

Fig:- Vertical Pod Autoscaling


VPA consists of 3 components:

  • VPA admission controller
    Once you deploy and enable the Vertical Pod Autoscaler in your cluster, every pod submitted to the cluster goes through this webhook, which checks whether a VPA object is referencing it.
  • VPA recommender
    The recommender pulls the current and past resource consumption (CPU and memory) data for each container from metrics-server running in the cluster and provides optimal resource recommendations based on it, so that a container uses only what it needs.
  • VPA updater
    The updater checks at regular intervals if a pod is running within the recommended range. Otherwise, it accepts it for update, and the pod is evicted by the VPA updater to apply resource recommendation.


If you are on Google Cloud Platform, you can simply enable vertical-pod-autoscaling:


To install it manually follow below steps:

  • Verify that the metrics-server deployment is running, or deploy it using instructions here.


  • Also, verify the API below is enabled:


  • Clone the kubernetes/autoscaler GitHub repository, and then deploy the Vertical Pod Autoscaler with the following command.


Verify that the Vertical Pod Autoscaler pods are up and running:


VPA using Resource Metrics

A. Setup: Create a Deployment and VPA resource

Use the same deployment config to create a new deployment with "--vm-bytes", "850M". Then create a VPA resource in Recommendation Mode with updateMode : Off


  • minAllowed is an optional parameter that specifies the minimum CPU request and memory request allowed for the container. 
  • maxAllowed is an optional parameter that specifies the maximum CPU request and memory request allowed for the container.

B. Check the Pod’s Resource Utilization

Check the resource utilization of the pods. Below, you can see only ~50 Mi memory is being used out of 1000Mi and only ~30m CPU out of 1000m. This clearly indicates that the pod resources are underutilized.


If you describe the VPA resource, you can see the Recommendations provided. (It may take some time to show them.)


C. Understand the VPA recommendations

Target: The recommended CPU request and memory request for the container that will be applied to the pod by VPA.

Uncapped Target: The recommended CPU request and memory request for the container if you didn’t configure upper/lower limits in the VPA definition. These values will not be applied to the pod. They’re used only as a status indication.

Lower Bound: The minimum recommended CPU request and memory request for the container. There is a --pod-recommendation-min-memory-mb flag that determines the minimum amount of memory the recommender will set—it defaults to 250MiB.

Upper Bound: The maximum recommended CPU request and memory request for the container.  It helps the VPA updater avoid eviction of pods that are close to the recommended target values. Eventually, the Upper Bound is expected to reach close to target recommendation.


D. VPA processing with Update Mode Off/Auto

Now, if you check the logs of vpa-updater, you can see it's not processing VPA objects as the Update Mode is set as Off.


VPA allows various Update Modes, detailed here.

Let's change the VPA updateMode to “Auto” to see the processing.

As soon as you do that, you can see vpa-updater has started processing objects, and it's terminating all 3 pods.


You can also check the logs of vpa-admission-controller:


NOTE: Ensure that you have more than 1 running replicas. Otherwise, the pods won’t be restarted, and vpa-updater will give you this warning:


Now, describe the new pods created and check that the resources match the Target recommendations:


The Target Recommendation can not go below the minAllowed defined in the VPA spec.

 Prometheus: Memory Usage Ratio
Fig:- Prometheus: Memory Usage Ratio

E. Stress Loading Pods

Let’s recreate the deployment with memory request and limit set to 2000Mi and "--vm-bytes", "500M".

Gradually stress load one of these pods to increase its memory utilization.
You can login to the pod and run stress --vm 1 --vm-bytes 1400M --timeout 120000s.


Prometheus Memory Utilized by each Replica
Fig:- Prometheus memory utilized by each Replica

You will notice that the VPA recommendation is also calculated accordingly and applied to all replicas.


Limits v/s Request
VPA always works with the requests defined for a container and not the limits. So, the VPA recommendations are also applied to the container requests, and it maintains a limit to request ratio specified for all containers.

For example, if the initial container configuration defines a 100m Memory Request and 300m Memory Limit, then when the VPA target recommendation is 150m Memory, the container Memory Request will be updated to 150m and Memory Limit to 450m.

Selective Container Scaling

If you have a pod with multiple containers and you want to opt-out some of them, you can use the "Off" mode to turn off recommendations for a container.

You can also set containerName: "*" to include all containers.



Both the Horizontal Pod Autoscaler and the Vertical Pod Autoscaler serve different purposes and one can be more useful than the other depending on your application’s requirement.

The HPA can be useful when, for example, your application is serving a large number of lightweight (low resource-consuming) requests. In that case, scaling number of replicas can distribute the workload on each of the pod. The VPA, on the other hand, can be useful when your application serves heavyweight requests, which requires higher resources.

Related Articles:

1. A Practical Guide to Deploying Multi-tier Applications on Google Container Engine (GKE)

2. Know Everything About Spinnaker & How to Deploy Using Kubernetes Engine

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings