Kubernetes Performance Measurements and Roadmap (2015)

--- title: " Kubernetes Performance Measurements and Roadmap " date: 2015-09-10 slug: kubernetes-performance-measurements-and url: /blog/2015/09/Kubernetes-Performance-Measurements-And author: > Wojciech Tyczynski (Google) --- No matter how flexible and reliable your container orchestration system is, ultimately, you have some work to be done, and you want it completed quickly. For big problems, a common answer is to just throw more machines at the problem. After all, more compute = faster, right? Interestingly, adding more nodes is a little like the [tyranny of the rocket equation][4] \- in some systems, adding more machines can actually make your processing slower. However, unlike the rocket equation, we can do better. Kubernetes in v1.0 version supports clusters with up to 100 nodes. However, we have a goal to 10x the number of nodes we will support by the end of 2015. This blog post will cover where we are and how we intend to achieve the next level of performance. ##### What do we measure? The first question we need to answer is: “what does it mean that Kubernetes can manage an N-node cluster?” Users expect that it will handle all operations “reasonably quickly,” but we need a precise definition of that. We decided to define performance and scalability goals based on the following two metrics: 1. 1.*“API-responsiveness”*: 99% of all our API calls return in less than 1 second 2. 2.*“Pod startup time”*: 99% of pods (with pre-pulled images) start within 5 seconds Note that for “pod startup time” we explicitly assume that all images necessary to run a pod are already pre-pulled on the machine where it will be running. In our experiments, there is a high degree of variability (network throughput, size of image, etc) between images, and these variations have little to do with Kubernetes’ overall performance. The decision to choose those metrics was made based on our experience spinning up 2 billion containers a week at Google. We explicitly want to measure the latency of user-facing flows since that’s what customers will actually care about. ##### How do we measure? To monitor performance improvements and detect regressions we set up a continuous testing infrastructure. Every 2-3 hours we create a 100-node cluster from [HEAD][5] and run our scalability tests on it. We use a GCE n1-standard-4 (4 cores, 15GB of RAM) machine as a master and n1-standard-1 (1 core, 3.75GB of RAM) machines for nodes. In scalability tests, we explicitly focus only on the full-cluster case (full N-node cluster is a cluster with 30 * N pods running in it) which is the most demanding scenario from a performance point of view. To reproduce what a customer might actually do, we run through the following steps: * Populate pods and replication controllers to fill the cluster * Generate some load (create/delete additional pods and/or replication controllers, scale the existing ones, etc.) and record performance metrics * Stop all running pods and replication controllers * Scrape the metrics and check whether they match our expectations It is worth emphasizing that the main parts of the test are done on full clusters (30 pods per node, 100 nodes) - starting a pod in an empty cluster, even if it has 100 nodes will be much faster. To measure pod startup latency we are using very simple pods with just a single container running the “gcr.io/google_containers/pause:go” image, which starts and then sleeps forever. The container is guaranteed to be already pre-pulled on nodes (we use it as the so-called pod-infra-container). ##### Performance data The following table contains percentiles (50th, 90th and 99th) of pod startup time in 100-node clusters which are 10%, 25%, 50% and 100% full. | | 10%-full |25%-full | 50%-full | 100%-full | | ------------ | ------------ | ------------ | ------------ | ------------ | |50th percentile | .90s | 1.08s | 1.33s | 1.94s | |90th percentile | 1.29s | 1.49s | 1.72s | 2.50s |

This blog post from 2015 discusses Kubernetes performance measurements and the roadmap for improvement. It outlines the metrics used to measure performance: API responsiveness (99% of calls under 1 second) and pod startup time (99% under 5 seconds with pre-pulled images). The post details the continuous testing infrastructure used to monitor performance, involving the creation of 100-node clusters and running scalability tests. Performance data is presented, showing pod startup time percentiles in clusters with varying degrees of fullness.