Run Spark Job on Kubernetes using Spark-on-k8s-Operator

Suchit Gupta

4 min readMay 5, 2020

A step by step guide for running spark jobs on Kubernetes using spark-on-k8s-operator

Version:

Spark: 2.4.5
Kubernetes v1.15.5
spark-on-k8s-operator: sparkoperator.k8s.io/v1beta2

Step 1: Install Docker on mac/window

Install from https://docs.docker.com/docker-for-mac/install/

Step 2: Resource allocation

Minimum Requirement: Allocate 4 CPUs and 8 GB of Memory to your docker.

Step 3: Enable Kubernetes.

It might take some time so be patient……….

Step 4: Kubernetes dashboard setup

Follow my another article on setting up the dashboard

Running Kubernetes Dashboard On Docker-On-Mac

A step by step guide for running the Kubernetes dashboard on Docker on Mac

medium.com

Step 5: Check the installation version for kubectl

kubectl version

Kubectl version

Step 6: Create a namespace with the name ‘spark-operator’

Create a file named namespaces-spark.yaml with the following content and save the file inside $pwd/files

apiVersion: v1
kind: Namespace
metadata:
  name: spark-operator

Execute the following command:

kubectl create -f files/namespaces-spark.yaml

kubectl create -f files/namespaces-spark.yaml o/p

Kubernetes Dashboard with ‘spark-operator’ namespace

Step 7: Create a service account with the name ‘spark’. The account would be used to run/edit pods

kubectl create serviceaccount spark --namespace=default

kubectl create serviceaccount spark — namespace=default o/p

Step 8: Create RBAC role for the service account ‘spark’. Give the required permissions so that it can create/edit pods.

kubectl create clusterrolebinding spark-operator-role --clusterrole=cluster-admin --serviceaccount=default:spark --namespace=default

o/p

Step 9: Install helm using the command

brew install helm

Check the version

helm version

helm version

Step 10: Add helm repo.

A Repository is a place where charts can be collected and shared.

helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator

helm repo add

Step 11: Install the incubator/sparkoperator package

helm install spark-operator/spark-operator --namespace spark-operator --set sparkJobNamespace=default --set webhook.enable=true --generate-name

helm install spark incubator/sparkoperator — namespace spark-operator — set sparkJobNamespace=default

Note: Switch the workspace to “spark-operator”

spark-operator pod in namespace spark-operator

spark-operator deployment in namespace spark-operator

Yay! Infrastructure Ready!!!!! Let's deploy the App.

Spark Application: Now, let’s run a spark job

Create a file named spark-pi.yaml with the following content and save the file inside $pwd/examples

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi-v1
  namespace: default
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.5"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar"
  sparkVersion: "2.4.5"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 2.4.5
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.5
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Now execute:

kubectl apply -f examples/spark-pi.yaml

kubectl apply -f examples/spark-pi.yaml o/p

Spark Scheduled Application Now, let’s run a spark batch job

Create a file named spark-schedule-pi.yaml with the following content and save the file inside $pwd/examples

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: ScheduledSparkApplication
metadata:
  name: spark-pi-scheduled
  namespace: default
spec:
  schedule: "@every 1m"
  concurrencyPolicy: Allow
  template:
    type: Scala
    mode: cluster
    image: "gcr.io/spark-operator/spark:v2.4.5"
    imagePullPolicy: Always
    mainClass: org.apache.spark.examples.SparkPi
    mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar"
    sparkVersion: "2.4.5"
    restartPolicy:
      type: Never
    driver:
      cores: 1
      coreLimit: "1200m"
      memory: "512m"
      labels:
        version: 2.4.5
      serviceAccount: spark
    executor:
      cores: 1
      instances: 1
      memory: "512m"
      labels:
        version: 2.4.5

Now execute:

kubectl apply -f examples/spark-schedule-pi.yaml

kubectl apply -f examples/spark-schedule-pi.yaml o/p

Congratulations! You did it.

In the next article, we will package and deploy our own Spark Scala Code.

Package Spark Scala Code and Deploy it on Kubernetes using Spark-on-k8s-Operator

A step by step guide on packaging your spark scala code and deploying it on Kubernetes using spark-on-k8s-operator.