The Beta release of PMK currently uses [flannel with UDP back-end](https://github.com/coreos/flannel) for the network layer. In a Kubernetes cluster, many infrastructure services need to communicate across nodes using a variety of ports (443, 4001, etc..) and protocols (TCP and UDP). Often, customer nodes intentionally or unintentionally block some or all of the traffic, or run existing services that conflict with the required ports, resulting in non-obvious failures. To address this, we try to detect configuration problems early and inform the administrator immediately. PMK runs a “preflight” check on all nodes participating in a cluster before installing the Kubernetes software. This means running small test programs on each node to verify that (1) the required ports are available for binding and listening; and (2) nodes can connect to each other using all required ports and protocols. These checks run in parallel and take less than a couple of seconds before cluster initialization.
**Monitoring**
One of the values of a SaaS-managed private cloud is constant monitoring and early detection of problems by the SaaS team. Issues that can be addressed without intervention by the customer are handled automatically, while others trigger proactive communication with the customer via UI alerts, email, or real-time channels. Kubernetes monitoring is a huge topic worthy of its own blog post, so we’ll just briefly touch upon it. We broadly classify the problem into layers: (1) hardware & OS, (2) Kubernetes core (e.g. API server, controllers and kubelets), (3) add-ons (e.g. [SkyDNS](https://github.com/skynetservices/skydns) & [ServiceLoadbalancer](https://github.com/kubernetes/contrib/tree/master/service-loadbalancer)) and (4) applications. We are currently focused on layers 1-3. A major source of issues we’ve seen is add-on failures. If either DNS or the ServiceLoadbalancer reverse http proxy (soon to be upgraded to an [Ingress Controller](https://github.com/kubernetes/contrib/tree/master/ingress/controllers)) fails, application services will start failing. One way we detect such failures is by monitoring the components using the Kubernetes API itself, which is proxied into the SaaS controller, allowing the PMK support team to monitor any cluster resource. To detect service failure, one metric we pay attention to is _pod restarts_. A high restart count indicates that a service is continually failing.
**Future topics**
We faced complex challenges in other areas that deserve their own posts: (1) _Authentication and authorization with [Keystone](http://docs.openstack.org/developer/keystone/)_, the identity manager used by Platform9 products; (2) _Software upgrades_, i.e. how to make them brief and non-disruptive to applications; and (3) _Integration with customer’s external load-balancers_ (in the absence of good automation APIs).
**Conclusion**
[Platform9 Managed Kubernetes](https://platform9.com/products/docker/) uses a SaaS-managed model to try to hide the complexity of deploying, operating and maintaining bare-metal Kubernetes clusters in customers’ data centers. These requirements led to the development of a unique cluster deployment and management architecture, which in turn led to unique technical challenges.This article described an overview of some of those challenges and how we solved them. For more information on the motivation behind PMK, feel free to view Madhura Maskasky's [blog post](https://platform9.com/blog/containers-as-a-service-kubernetes-docker/).