Kubernetes Scalability: etcd v3 Migration Challenges and Protobuf Data Format

- Changing storage in a backward incompatible way, as is in the case for the etcd v2 to v3 migration, is a big change, and thus one for which we needed a strong justification. We found this justification in September when we determined that we would not be able to scale to 5000-node clusters if we continued to use etcd v2 ([kubernetes/32361](https://github.com/kubernetes/kubernetes/issues/32361) contains some discussion about it). In particular, what didn’t scale was the watch implementation in etcd v2. In a 5000-node cluster, we need to be able to send at least 500 watch events per second to a single watcher, which wasn’t possible in etcd v2. - Once we had the strong incentive to actually update to etcd v3, we started thoroughly testing it. As you might expect, we found some issues. There were some minor bugs in Kubernetes, and in addition we requested a performance improvement in etcd v3’s watch implementation (watch was the main bottleneck in etcd v2 for us). This led to the 3.0.10 etcd patch release. - Once those changes had been made, we were convinced that _new_ Kubernetes clusters would work with etcd v3. But the large challenge of migrating _existing_ clusters remained. For this we needed to automate the migration process, thoroughly test the underlying CoreOS etcd upgrade tool, and figure out a contingency plan for rolling back from v3 to v2. But finally, we are confident that it should work. **Switching storage data format to protobuf** In the Kubernetes 1.3 release, we enabled [protobufs](https://developers.google.com/protocol-buffers/) as the data format for Kubernetes components to communicate with the API server (in addition to maintaining support for JSON). This gave us a huge performance improvement. However, we were still using JSON as a format in which data was stored in etcd, even though technically we were ready to change that. The reason for delaying this migration was related to our plans to migrate to etcd v3. Now you are probably wondering how this change was depending on migration to etcd v3. The reason for it was that with etcd v2 we couldn’t really store data in binary format (to workaround it we were additionally base64-encoding the data), whereas with etcd v3 it just worked. So to simplify the transition to etcd v3 and avoid some non-trivial transformation of data stored in etcd during it, we decided to wait with switching storage data format to protobufs until migration to etcd v3 storage backend is done. **Other optimizations** We made tens of optimizations throughout the Kubernetes codebase during the last three releases, including:

The text discusses the challenges of migrating existing Kubernetes clusters to etcd v3, including automating the migration process, testing the etcd upgrade tool, and creating a rollback plan. It also covers the switch to protobuf as the data format for storing data in etcd, which was delayed until the etcd v3 migration due to limitations with etcd v2. Finally, it mentions other optimizations made in the Kubernetes codebase.