Run Spark Job on Kubernetes using Spark-on-k8s-Operator
A step by step guide for running spark jobs on Kubernetes using spark-on-k8s-operator
Version:
- Spark: 2.4.5
- Kubernetes v1.15.5
- spark-on-k8s-operator: sparkoperator.k8s.io/v1beta2
Step 1: Install Docker on mac/window
Install from https://docs.docker.com/docker-for-mac/install/
Step 2: Resource allocation
Minimum Requirement: Allocate 4 CPUs and 8 GB of Memory to your docker.
Step 3: Enable Kubernetes.
It might take some time so be patient……….
Step 4: Kubernetes dashboard setup
Follow my another article on setting up the dashboard
Step 5: Check the installation version for kubectl
kubectl version
Step 6: Create a namespace with the name ‘spark-operator’
Create a file named namespaces-spark.yaml with the following content and save the file inside $pwd/files
apiVersion: v1
kind: Namespace
metadata:
name: spark-operator
Execute the following command:
kubectl create -f files/namespaces-spark.yaml
Step 7: Create a service account with the name ‘spark’. The account would be used to run/edit pods
kubectl create serviceaccount spark --namespace=default
Step 8: Create RBAC role for the service account ‘spark’. Give the required permissions so that it can create/edit pods.
kubectl create clusterrolebinding spark-operator-role --clusterrole=cluster-admin --serviceaccount=default:spark --namespace=default
Step 9: Install helm using the command
brew install helm
Check the version
helm version
Step 10: Add helm repo.
A Repository is a place where charts can be collected and shared.
helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
Step 11: Install the incubator/sparkoperator package
helm install spark-operator/spark-operator --namespace spark-operator --set sparkJobNamespace=default --set webhook.enable=true --generate-name
Note: Switch the workspace to “spark-operator”
Yay! Infrastructure Ready!!!!! Let's deploy the App.
Spark Application: Now, let’s run a spark job
Create a file named spark-pi.yaml with the following content and save the file inside $pwd/examples
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-pi-v1
namespace: default
spec:
type: Scala
mode: cluster
image: "gcr.io/spark-operator/spark:v2.4.5"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar"
sparkVersion: "2.4.5"
restartPolicy:
type: Never
volumes:
- name: "test-volume"
hostPath:
path: "/tmp"
type: Directory
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 2.4.5
serviceAccount: spark
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 2.4.5
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
Now execute:
kubectl apply -f examples/spark-pi.yaml
Spark Scheduled Application Now, let’s run a spark batch job
Create a file named spark-schedule-pi.yaml with the following content and save the file inside $pwd/examples
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: ScheduledSparkApplication
metadata:
name: spark-pi-scheduled
namespace: default
spec:
schedule: "@every 1m"
concurrencyPolicy: Allow
template:
type: Scala
mode: cluster
image: "gcr.io/spark-operator/spark:v2.4.5"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar"
sparkVersion: "2.4.5"
restartPolicy:
type: Never
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 2.4.5
serviceAccount: spark
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 2.4.5
Now execute:
kubectl apply -f examples/spark-schedule-pi.yaml
Congratulations! You did it.
In the next article, we will package and deploy our own Spark Scala Code.