Fluid EDL Testing and Resource Distribution

In both scenarios, PaddlePaddle jobs are tolerant to a process spikes and decreases. We achieved this by implementing the new design, which introduces a master process in addition to the old PaddlePaddle architecture as described in a [previous blog post](https://kubernetes.io/blog/2017/02/run-deep-learning-with-paddlepaddle-on-kubernetes). In the new design, as long as there are three processes left in a job, it continues. In extreme cases where all processes are killed, the job can be restored and resume. We tested Fluid EDL for two use cases: 1) the Kubernetes cluster runs only PaddlePaddle jobs; and 2) the cluster runs PaddlePaddle and Nginx jobs. In the first test, we started up to 20 PaddlePaddle jobs one by one with a 10-second interval. Each job has 60 trainers and 10 parameter server processes, and will last for hours. We repeated the experiment 20 times: 10 with FluidEDL turned off and 10 with FluidEDL turned on. In Figure one, solid lines correspond to the first 10 experiments and dotted lines the rest. In the upper part of the figure, we see that the number of pending jobs increments monotonically without EDL. However, when EDL is turned on, resources are evenly distributed to all jobs. Fluid EDL kills some existing processes to make room for new jobs and jobs coming in at a later point in time. In both cases, the cluster is equally utilized (see lower part of figure). | [![](https://1.bp.blogspot.com/-sp_sVZvhMbU/WiYgXMLQKuI/AAAAAAAAAIM/uc_3iT9BZmAtQGiGGSErgueHK71uWMBCACEwYBhgL/s640/figure-1.png)](https://1.bp.blogspot.com/-sp_sVZvhMbU/WiYgXMLQKuI/AAAAAAAAAIM/uc_3iT9BZmAtQGiGGSErgueHK71uWMBCACEwYBhgL/s1600/figure-1.png) | | _Figure 1. Fluid EDL evenly distributes resource among jobs._

PaddlePaddle jobs are designed to tolerate process fluctuations, continuing as long as three processes remain. The system can also restore jobs even if all processes are killed. Tests were conducted in Kubernetes clusters running only PaddlePaddle jobs and clusters running both PaddlePaddle and Nginx. Results showed that without Fluid EDL, the number of pending jobs increased steadily, while with Fluid EDL, resources were distributed evenly by terminating processes to accommodate new jobs, maintaining consistent cluster utilization.