Guaranteed QoS Class Pods and Troubleshooting

For pods in a QoS class other than `Guaranteed`, the Memory Manager provides default topology hints to the Topology Manager. The following excerpts from pod manifests assign a pod to the `Guaranteed` QoS class. Pod with integer CPU(s) runs in the `Guaranteed` QoS class, when `requests` are equal to `limits`: ```yaml spec: containers: - name: nginx image: nginx resources: limits: memory: "200Mi" cpu: "2" example.com/device: "1" requests: memory: "200Mi" cpu: "2" example.com/device: "1" ``` Also, a pod sharing CPU(s) runs in the `Guaranteed` QoS class, when `requests` are equal to `limits`. ```yaml spec: containers: - name: nginx image: nginx resources: limits: memory: "200Mi" cpu: "300m" example.com/device: "1" requests: memory: "200Mi" cpu: "300m" example.com/device: "1" ``` Notice that both CPU and memory requests must be specified for a Pod to lend it to Guaranteed QoS class. ## Troubleshooting The following means can be used to troubleshoot the reason why a pod could not be deployed or became rejected at a node: - pod status - indicates topology affinity errors - system logs - include valuable information for debugging, e.g., about generated hints - state file - the dump of internal state of the Memory Manager (includes [Node Map and Memory Maps][2]) - starting from v1.22, the [device plugin resource API](#device-plugin-resource-api) can be used to retrieve information about the memory reserved for containers ### Pod status (TopologyAffinityError) {#TopologyAffinityError} This error typically occurs in the following situations: * a node has not enough resources available to satisfy the pod's request * the pod's request is rejected due to particular Topology Manager policy constraints The error appears in the status of a pod: ```shell kubectl get pods ``` ```none NAME READY STATUS RESTARTS AGE guaranteed 0/1 TopologyAffinityError 0 113s ``` Use `kubectl describe pod <id>` or `kubectl get events` to obtain detailed error message: ```none Warning TopologyAffinityError 10m kubelet, dell8 Resources cannot be allocated with Topology locality ``` ### System logs Search system logs with respect to a particular pod. The set of hints that Memory Manager generated for the pod can be found in the logs. Also, the set of hints generated by CPU Manager should be present in the logs. Topology Manager merges these hints to calculate a single best hint. The best hint should be also present in the logs. The best hint indicates where to allocate all the resources. Topology Manager tests this hint against its current policy, and based on the verdict, it either admits the pod to the node or rejects it. Also, search the logs for occurrences associated with the Memory Manager, e.g. to find out information about `cgroups` and `cpuset.mems` updates. ### Examine the memory manager state on a node Let us first deploy a sample `Guaranteed` pod whose specification is as follows: ```yaml apiVersion: v1 kind: Pod metadata: name: guaranteed spec: containers: - name: guaranteed image: consumer imagePullPolicy: Never resources: limits: cpu: "2" memory: 150Gi requests: cpu: "2" memory: 150Gi command: ["sleep","infinity"] ``` Next, let us log into the node where it was deployed and examine the state file in `/var/lib/kubelet/memory_manager_state`: ```json { "policyName":"Static", "machineState":{ "0":{ "numberOfAssignments":1,

This section details how the Memory Manager provides default topology hints for non-Guaranteed QoS class pods. It presents YAML examples for assigning a pod to the Guaranteed QoS class by ensuring that resource requests equal resource limits for both integer and shared CPUs, while emphasizing the necessity of specifying both CPU and memory requests. It also covers troubleshooting steps, including checking pod status for TopologyAffinityError, examining system logs for hint generation and cgroup updates, and analyzing the memory manager state file on the node to understand resource allocation.