Error Handling in PKI Processing

| | Figure1: Block Diagram of the PKI flow | If an application needs a new certificate, the application owner explicitly adds a [Custom Resource Definition](/docs/tasks/access-kubernetes-api/extend-api-custom-resource-definitions/) (CRD) to the application’s Kubernetes config [1]. This CRD specifies parameters for the SSL certificate: _name, common name, and others_. A microservice in the control plane watches CRDs and triggers some processing for SSL certificate generation [2]. Once the certificate is ready, the same control plane service sends it to the API server in a Kubernetes [Secret](/docs/concepts/configuration/secret/) [3]. After that, the application containers access their certificates using Kubernetes [Secret VolumeMounts](/docs/concepts/storage/volumes/#secret) [4]. You can see a working demo of this system in our [example application](https://github.com/box/error-reporting-with-kubernetes-events) on GitHub. The rest of this post covers the error scenarios in this “triggered” processing in the control plane. In particular, we are especially concerned with user input errors. Because the SSL certificate parameters come from the application’s config file in a CRD format, what should happen if there is an error in that CRD specification? Even a typo results in a failure of the SSL certificate creation. The error information is available in the control plane even though the root cause is most probably inside the application’s config file. The application owner does not have access to the control plane’s state or logs. Providing the right diagnosis to the application owner so she can fix the mistake becomes a serious productivity problem at scale. Box’s rapid migration to microservices results in several new deployments every week. Numerous first time users, who do not know every detail of the infrastructure, need to succeed in deploying their services and troubleshooting problems easily. As the owners of the infrastructure, we do not want to be the bottleneck while reading the errors from the control plane logs and passing them on to application owners. If something in an owner’s configuration causes an error somewhere else, owners need a fully empowering diagnosis. This error data must flow automatically without any human involvement.

This section discusses how to handle error scenarios during PKI processing in the control plane, focusing on user input errors in the CRD specification. A typo in the CRD can cause SSL certificate creation to fail. The challenge is providing the application owner with the correct diagnosis, as they lack access to the control plane's state or logs. Box needs to empower users to troubleshoot easily, especially with rapid microservice deployments and numerous first-time users.