Run Spark Job on Kubernetes using Spark-on-k8s-Operator

Suchit Gupta
4 min readMay 5, 2020

--

A step by step guide for running spark jobs on Kubernetes using spark-on-k8s-operator

Version:

  1. Spark: 2.4.5
  2. Kubernetes v1.15.5
  3. spark-on-k8s-operator: sparkoperator.k8s.io/v1beta2

Step 1: Install Docker on mac/window

Install from https://docs.docker.com/docker-for-mac/install/

Step 2: Resource allocation

Minimum Requirement: Allocate 4 CPUs and 8 GB of Memory to your docker.

Docker Resource for Spark Jobs

Step 3: Enable Kubernetes.

It might take some time so be patient……….

Enable Kubernetes
Kubernetes getting installed
Kubernetes is now running

Step 4: Kubernetes dashboard setup

Follow my another article on setting up the dashboard

Step 5: Check the installation version for kubectl

kubectl version
Kubectl version

Step 6: Create a namespace with the name ‘spark-operator’

Create a file named namespaces-spark.yaml with the following content and save the file inside $pwd/files

apiVersion: v1
kind: Namespace
metadata:
name: spark-operator

Execute the following command:

kubectl create -f files/namespaces-spark.yaml
kubectl create -f files/namespaces-spark.yaml o/p
Kubernetes Dashboard with ‘spark-operator’ namespace

Step 7: Create a service account with the name ‘spark’. The account would be used to run/edit pods

kubectl create serviceaccount spark --namespace=default
kubectl create serviceaccount spark — namespace=default o/p
Service Account:- spark

Step 8: Create RBAC role for the service account ‘spark’. Give the required permissions so that it can create/edit pods.

kubectl create clusterrolebinding spark-operator-role --clusterrole=cluster-admin --serviceaccount=default:spark --namespace=default
o/p

Step 9: Install helm using the command

brew install helm

Check the version

helm version
helm version

Step 10: Add helm repo.

A Repository is a place where charts can be collected and shared.

helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
helm repo add

Step 11: Install the incubator/sparkoperator package

helm install spark-operator/spark-operator --namespace spark-operator --set sparkJobNamespace=default --set webhook.enable=true --generate-name
helm install spark incubator/sparkoperator — namespace spark-operator — set sparkJobNamespace=default
Kubernetes Dashboard

Note: Switch the workspace to “spark-operator”

spark-operator pod in namespace spark-operator
spark-operator deployment in namespace spark-operator

Yay! Infrastructure Ready!!!!! Let's deploy the App.

Spark Application: Now, let’s run a spark job

Create a file named spark-pi.yaml with the following content and save the file inside $pwd/examples

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind
: SparkApplication
metadata:
name: spark-pi-v1
namespace: default
spec:
type: Scala
mode: cluster
image: "gcr.io/spark-operator/spark:v2.4.5"
imagePullPolicy
: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar"
sparkVersion
: "2.4.5"
restartPolicy
:
type: Never
volumes:
- name: "test-volume"
hostPath
:
path: "/tmp"
type
: Directory
driver:
cores: 1
coreLimit: "1200m"
memory
: "512m"
labels
:
version: 2.4.5
serviceAccount: spark
volumeMounts:
- name: "test-volume"
mountPath
: "/tmp"
executor
:
cores: 1
instances: 1
memory: "512m"
labels
:
version: 2.4.5
volumeMounts:
- name: "test-volume"
mountPath
: "/tmp"

Now execute:

kubectl apply -f examples/spark-pi.yaml
kubectl apply -f examples/spark-pi.yaml o/p
Spark Driver running
Spark executor is coming up
Spark executor is now running

Spark Scheduled Application Now, let’s run a spark batch job

Create a file named spark-schedule-pi.yaml with the following content and save the file inside $pwd/examples

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind
: ScheduledSparkApplication
metadata:
name: spark-pi-scheduled
namespace: default
spec:
schedule: "@every 1m"
concurrencyPolicy
: Allow
template:
type: Scala
mode: cluster
image: "gcr.io/spark-operator/spark:v2.4.5"
imagePullPolicy
: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar"
sparkVersion
: "2.4.5"
restartPolicy
:
type: Never
driver:
cores: 1
coreLimit: "1200m"
memory
: "512m"
labels
:
version: 2.4.5
serviceAccount: spark
executor:
cores: 1
instances: 1
memory: "512m"
labels
:
version: 2.4.5

Now execute:

kubectl apply -f examples/spark-schedule-pi.yaml
kubectl apply -f examples/spark-schedule-pi.yaml o/p
The scheduled job is now running

Congratulations! You did it.

In the next article, we will package and deploy our own Spark Scala Code.

--

--

Responses (1)