Some resources only appear on certain graphs, based on what was running during that operation (e.g. no namespace was put at that time).
As you can see in the results, we are ahead of target for our 100-node cluster with pod startup time even in a fully-packed cluster occurring 14% faster in the 99th percentile than 5 seconds. It’s interesting to point out that LISTing pods is significantly slower than any other operation. This makes sense: in a full cluster there are 3000 pods and each of pod is roughly few kilobytes of data, meaning megabytes of data that need to processed for each LIST.
#####Work done and some future plans
The initial performance work to make 100-node clusters stable enough to run any tests on them involved a lot of small fixes and tuning, including increasing the limit for file descriptors in the apiserver and reusing tcp connections between different requests to etcd.
However, building a stable performance test was just step one to increasing the number of nodes our cluster supports by tenfold. As a result of this work, we have already taken on significant effort to remove future bottlenecks, including:
* Rewriting controllers to be watch-based: Previously they were relisting objects of a given type every few seconds, which generated a huge load on the apiserver.
* Using code generators to produce conversions and deep-copy functions: Although the default implementation using Go reflections are very convenient, they proved to be extremely slow, as much as 10X in comparison to the generated code.
* Adding a cache to apiserver to avoid deserialization of the same data read from etcd multiple times
* Reducing frequency of updating statuses: Given the slow changing nature of statutes, it only makes sense to update pod status only on change and node status only every 10 seconds.
* Implemented watch at the apiserver instead of redirecting the requests to etcd: We would prefer to avoid watching for the same data from etcd multiple times, since, in many cases, it was filtered out in apiserver anyway.
Looking further out to our 1000-node cluster goal, proposed improvements include:
* Moving events out from etcd: They are more like system logs and are neither part of system state nor are crucial for Kubernetes to work correctly.
* Using better json parsers: The default parser implemented in Go is very slow as it is based on reflection.
* Rewriting the scheduler to make it more efficient and concurrent
* Improving efficiency of communication between apiserver and Kubelets: In particular, we plan to reduce the size of data being sent on every update of node status.
This is by no means an exhaustive list. We will be adding new elements (or removing existing ones) based on the observed bottlenecks while running the existing scalability tests and newly-created ones. If there are particular use cases or scenarios that you’d like to see us address, please join in!
- We have weekly meetings for our Kubernetes Scale Special Interest Group on Thursdays 11am PST where we discuss ongoing issues and plans for performance tracking and improvements.
- If you have specific performance or scalability questions before then, please join our scalability special interest group on Slack: https://kubernetes.slack.com/messages/sig-scale
- General questions? Feel free to join our Kubernetes community on Slack: https://kubernetes.slack.com/messages/kubernetes-users/
- Submit a pull request or file an issue! You can do this in our GitHub repository. Everyone is also enthusiastically encouraged to contribute with their own experiments (and their result) or PR contributions improving Kubernetes.