Contents
One of the benefits in adopting a system like Kubernetes is facilitating burstable and scalable workload. Horizontal application scaling involves adding or removing instances of an application to match demand. Kubernetes Horizontal Pod Autoscaler enables automated pod scaling based on demand. This is cool, however can lead to unpredictable load on the cluster, which may put the cluster into an overcommitted state. Fortunately, with a goal of squeezing every bit of CPU and memory from a cluster, overcommitment may not only be ok but desirable.
The following image represents a three node cluster that runs three applications. Pink is the most critical. Red is burst-able and durable. This means if we need to stop a few instances of red, things will be ok. Blue is non-critical. I have also tried to depict in this image a cluster that is a fully maxed out state. There are no more resources available for additional workload.
Imaging now that a scale out operation is needed on the pink application. This puts the cluster in an overcommitted state with critical workload requiring scheduling. How can Kubernetes facilitate this critical request in an overcommitted state? One option is to use Pod Priority and Preemption, which allows a priority weight to be added to a scheduling request. In the event of overcommitment, priority is evaluated, and lower priority workload is restarted (preemption) to allow for scheduling of the priority workload.
Pod Priority and Preemption tutorial
In this article, we will walk through an end-to-end demonstration of using Pod Priority and Pre-emption to ensure critical workload has priority to cluster resources.
In order to complete this tutorial, you need a Kubernetes cluster that consists of three nodes. I’ve included steps for deploying an appropriately sized Azure Kubernetes cluster. If you need an Azure Subscription or would like to read up on additional operational practices for Azure Kubernetes Service, see the following links.
Create an Azure Kubernetes Service Cluster
First things first, ensure you have an appropriately sized Kubernetes cluster for this tutorial (three nodes).
Create a resource group.
1 2 | az group create --name AKSOperationsDemos --location eastus |
Create the cluster. Note, the Azure CLI defaults are suitable for this demo.
1 2 | az aks create --resource-group AKSOperationsDemos --name AKSOperationsDemos --kubernetes-version 1.11.3 |
Connect to the cluster as cluster admin.
1 2 | az aks get-credentials --resource-group AKSOperationsDemos --name AKSOperationsDemos --admin |
Create a priority class for critical workload
Create an instance of a Pod Priority Class with a weight of 1000000
. This can be used to ensure that high priority workload is given priority to cluster resource.
To do so, create a file names pc.yml
and copy in the following yaml.
1 2 3 4 5 6 7 | apiVersion: scheduling.k8s.io/v1beta1 kind: PriorityClass metadata: name: high-priority value: 1000000 globalDefault: false |
Create the priority class.
1 2 | kubectl create -f pc.yml |
Consume all CPU cores
Run some workload to consume all CPU cores in the cluster. In the following example, a deployment consisting of three replicas is started with a CPU request of one core each. This will effectively consume the available CPU resources of the cluster.
Create a file named slam-cpu.yml
and copy in the following yaml.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | apiVersion: apps/v1 kind: Deployment metadata: name: consume-cpu spec: replicas: 3 selector: matchLabels: app: consume-cpu template: metadata: labels: app: consume-cpu spec: - name: nepetersv1 image: neilpeterson/aks-helloworld:v1 resources: requests: cpu: 1 memory: 128Mi limits: cpu: 1 memory: 128Mi |
Run the deployment.
1 2 | kubectl create -f slam-cpu.yml |
Start low-priority workload
Now start another pod without specifying a priority class.
Create a file named pod-no-priority.yml
and copy in the following YAML.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | apiVersion: v1 kind: Pod metadata: name: pod-no-priority spec: containers: - name: pod-no-priority image: neilpeterson/aks-helloworld:v1 resources: requests: cpu: 1 memory: 128Mi limits: cpu: 1 memory: 256Mi |
Run the pod.
1 2 | kubectl create -f pod-no-priority.yml |
At this point, what you will find is that the new pod cannot be scheduled due to lack of CPU resources. To see this, list the pods on the cluster and note that the pod-no-priority
is in a Pending
state.
1 2 3 4 5 6 7 | kubectl get pods consume-cpu-6c8d576684-gf5sk 0/1 ContainerCreating 0 52s consume-cpu-6c8d576684-mtvmn 0/1 ContainerCreating 0 52s consume-cpu-6c8d576684-pnkff 0/1 ContainerCreating 0 52s pod-no-priority 0/1 Pending 0 10s |
Return a list of events for the pod to see the actual issue.
1 2 | kubectl describe pod pod-no-priority |
Parsing the output you should see that the pod cannot be scheduled to insufficient cpu.
1 2 3 4 5 | Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 1s (x18 over 53s) default-scheduler 0/3 nodes are available: 3 Insufficient cpu. |
Run high priority workload
Finally run another pod, however this time assign the high-priority class to the pod.
Create a file named pod-priority.yml
and copy in the following yaml. Take note that the pod spec includes the priority class created in a previous step.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | apiVersion: v1 kind: Pod metadata: name: pod-with-priority spec: containers: - name: pod-with-priority image: neilpeterson/nepetersv1 resources: requests: cpu: 1 memory: 128Mi limits: cpu: 1 memory: 256Mi priorityClassName: high-priority |
Run the pod.
1 2 | kubectl create -f pod-priority.yaml |
Now return a list of pods. If done quickly you may be able to catch one of the lower priority pods being terminated.
1 2 3 4 5 6 7 8 9 10 | kubectl get pods NAME READY STATUS RESTARTS AGE consume-cpu-6c8d576684-gf5sk 1/1 Running 0 7m consume-cpu-6c8d576684-mtvmn 1/1 Running 0 7m consume-cpu-6c8d576684-p7tqx 0/1 Pending 0 3s consume-cpu-6c8d576684-pnkff 1/1 Terminating 0 7m pod-no-priority 0/1 Pending 0 6m pod-with-priority 0/1 Pending 0 3s |
Once the lower priority pod has been terminated, the pod with priority is started in its place.
1 2 3 4 5 6 7 8 9 | kubectl get pods NAME READY STATUS RESTARTS AGE consume-cpu-6c8d576684-gf5sk 1/1 Running 0 8m consume-cpu-6c8d576684-mtvmn 1/1 Running 0 8m consume-cpu-6c8d576684-p7tqx 0/1 Pending 0 1m pod-no-priority 0/1 Pending 0 8m pod-with-priority 1/1 Running 0 1m |
Very cool indeed. Feel free to contact me on Twitter (@nepeters) or comment below for discussion on the topic.