For the last few weeks, I’ve been deploying a Spark cluster on Kubernetes (K8s). I want to share the challenges, architecture, and solution details I’ve discovered with you.
At Empathy, all code running in production must be cloud-agnostic. As of this publication date, Empathy has overcome a previous dependency on cloud providers by using Spark solutions, according to the Cloud provider: EMR (AWS scenario), Dataproc (GCP scenario), and HDInsight (Azure scenario).
The different solutions for these cloud providers offer an easy and simple method to deploy Spark on the cloud. However, some limitations arise when a company scales up, leading to several key questions:
These are common questions when trying to execute Spark jobs. Solving them with Kubernetes can save effort and provide a better experience.
Running Apache Spark on K8s offers us the following benefits:
The benefits are the same as Empathy’s solution for Apache Flink running on Kubernetes, as I explored in my previous article.
Apache Spark is a unified analytics engine for big data processing, particularly handy for distributed processing. Spark is used for machine learning and is currently one of the biggest trends in technology.
Spark Submit can be used to submit a Spark Application directly to a Kubernetes cluster. The flow would be as follows:
Spark Submit Flowchart
You can schedule a Spark Application using Spark Submit (vanilla way) or using Spark Operator.
Spark Submit is a script used to submit a Spark Application and launch the application on the Spark cluster. Some nice features include:
The SparkOperator project was developed by Google and is now an open-source project. It uses Kubernetes Custom Resource for specifying, running, and surfacing the status of Spark Applications. Some nice features include:
The image above shows the main commands of Spark Submit vs Spark Operator.
Empathy’s solution prefers Spark Operator because it allows for faster iterations than Spark Submit, where you have to create custom Kubernetes manifests for each use case.
To solve the questions posed in the Challenges section, ArgoCD and Argo Workflows can help you, along with the support of CNCF projects. For instance, you can schedule your favorite Spark Applications workloads from Kubernetes using ArgoCD to create Argo Workflows and define sequential jobs.
The flowchart would be as follows:
ArgoCD is a GitOps continuous delivery tool for Kubernetes. The main benefits are:
More detailed information can be found in their official documentation.
Argo Workflows is a workflow solution for Kubernetes. The main benefits are:
More detailed information can be found in their official documentation.
Once Prometheus scrapes the metrics, some Grafana Dashboards are needed. The custom Grafana Dashboards for Apache Spark is based on the following community dashboards:
Empathy chooses Spark Operator, ArgoCD, and Argo Workflows to create a Spark Application Workflow solution on Kubernetes and uses GitOps to propagate the changes. The setup illustrated in this article has been used in production environments for about one month, and the feedback is great! Everyone is happy with the workflow — having a single workflow that’s valid for any cloud provider, thus getting rid of individual cloud provider solutions.
To test it for yourself, follow these hands-on samples and enjoy deploying some Spark Applications from localhost, with all the setup described in this guide: Hands-on Empathy Repo.
I’ve also drawn upon my presentation for Kubernetes Days Spain 2021.
Though the journey was long, we’ve learned a lot along the way. I hope our innovations will help you become more cloud-agnostic too.
Topics:
spark, kubernetes, cloud agnostic, big data, analysis
或是邮件反馈可也:
askdama[AT]googlegroups.com
订阅 substack 体验古早写作:
关注公众号, 持续获得相关各种嗯哼: