Apache Spark 2.3 with Native Kubernetes Support

--- title: "Apache Spark 2.3 with Native Kubernetes Support" date: 2018-03-06 slug: apache-spark-23-with-native-kubernetes url: /blog/2018/03/Apache-Spark-23-With-Native-Kubernetes author: > Anirudh Ramanathan (Google), Palak Bhatia (Google) --- ### Kubernetes and Big Data The open source community has been working over the past year to enable first-class support for data processing, data analytics and machine learning workloads in Kubernetes. New extensibility features in Kubernetes, such as [custom resources][1] and [custom controllers][2], can be used to create deep integrations with individual applications and frameworks. Traditionally, data processing workloads have been run in dedicated setups like the YARN/Hadoop stack. However, unifying the control plane for all workloads on Kubernetes simplifies cluster management and can improve resource utilization. "Bloomberg has invested heavily in machine learning and NLP to give our clients a competitive edge when it comes to the news and financial information that powers their investment decisions. By building our Data Science Platform on top of Kubernetes, we're making state-of-the-art data science tools like Spark, TensorFlow, and our sizable GPU footprint accessible to the company's 5,000+ software engineers in a consistent, easy-to-use way." - Steven Bower, Team Lead, Search and Data Science Infrastructure at Bloomberg ### Introducing Apache Spark + Kubernetes [Apache Spark 2.3][3] with native Kubernetes support combines the best of the two prominent open source projects — Apache Spark, a framework for large-scale data processing; and Kubernetes. Apache Spark is an essential tool for data scientists, offering a robust platform for a variety of applications ranging from large scale data transformation to analytics to machine learning. Data scientists are adopting containers en masse to improve their workflows by realizing benefits such as packaging of dependencies and creating reproducible artifacts. Given that Kubernetes is the de facto standard for managing containerized environments, it is a natural fit to have support for Kubernetes APIs within Spark. Starting with Spark 2.3, users can run Spark workloads in an existing Kubernetes 1.7+ cluster and take advantage of Apache Spark's ability to manage distributed data processing tasks. Apache Spark workloads can make direct use of Kubernetes clusters for multi-tenancy and sharing through [Namespaces][4] and [Quotas][5], as well as administrative features such as [Pluggable Authorization][6] and [Logging][7]. Best of all, it requires no changes or new installations on your Kubernetes cluster; simply [create a container image][8] and set up the right [RBAC roles][9] for your Spark Application and you're all set. Concretely, a native Spark Application in Kubernetes acts as a [custom controller][2], which creates Kubernetes resources in response to requests made by the Spark scheduler. In contrast with [deploying Apache Spark in Standalone Mode][10] in Kubernetes, the native approach offers fine-grained management of Spark Applications, improved elasticity, and seamless integration with logging and monitoring solutions. The community is also exploring advanced use cases such as managing streaming workloads and leveraging service meshes like [Istio][11]. ![](https://1.bp.blogspot.com/-hl4pnOqiH4M/Wp4w9QmzghI/AAAAAAAAAL4/jcWoDOKEp3Y6lCzGxzTOlbvl2Mq1-2YeQCK4BGAYYCw/s1600/Screen%2BShot%2B2018-03-05%2Bat%2B10.10.14%2BPM.png)

This blog post introduces Apache Spark 2.3, which now includes native Kubernetes support. This integration allows data scientists to run Spark workloads in existing Kubernetes clusters, simplifying cluster management and improving resource utilization. By leveraging Kubernetes' features like Namespaces and Pluggable Authorization, Spark applications can be managed with improved elasticity and seamless integration with logging and monitoring solutions.