Package Spark Scala Code and Deploy it on Kubernetes using Spark-on-k8s-Operator

Suchit Gupta
3 min readMay 5, 2020

--

A step by step guide on packaging your spark scala code and deploying it on Kubernetes using spark-on-k8s-operator.

Versions:

  1. Spark: 3.0.0
  2. Scala: 2.12
  3. SBT: 1.3.13
  4. Docker On Mac: 2.2.0.0
  5. Kubernetes: v1.15.5
  6. spark-on-k8s-operator: sparkoperator.k8s.io/v1beta2

Step 1: Set up Kubernetes:

Please follow my earlier post to set up Kubernetes

Step 2: The Project can be cloned from Github:

The project structure is as below:

Step 3: Set up for packaging:

  1. The plugins.sbt file inside the project folder is required for building a flat jar.
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.10")

2. The build.properties file inside the project folder is required for providing the sbt version.

sbt.version=1.3.13

3. In build.sbt the task for merge strategy is required to resolve any conflicts during packaging

assemblyMergeStrategy in assembly := {
case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat
case PathList("META-INF", xs@_*) => MergeStrategy.discard
case "application.conf" => MergeStrategy.concat
case x => MergeStrategy.first
}

Step 4: Build the base docker image which has Hadoop, Spark, and SBT

This docker image would provide the required env for executing your application code.

Docker file:

ARG SPARK_IMAGE=gcr.io/spark-operator/spark:v3.0.0
FROM ${SPARK_IMAGE}

ENV SBT_VERSION 1.3.13


# Switch to user root so we can add additional jars, packages and configuration files.
USER root

RUN apt-get -y update && apt-get install -y curl

USER ${spark_uid}

WORKDIR /app

#Install SBT
RUN curl -fsL https://github.com/sbt/sbt/releases/download/v$SBT_VERSION/sbt-$SBT_VERSION.tgz | tar xfz - -C /usr/local
ENV PATH /usr/local/sbt/bin:${PATH}

RUN sbt update

ENTRYPOINT ["/opt/entrypoint.sh"]

Create a docker image by executing:

docker build -t test/spark-operator:latest .

Step 5: Build the docker image with the flat jar

The docker image is built from the image that we created in Step 4. It creates a flat jar for your code.

Docker file:

FROM test/spark-operator:latest

# Add project files
ADD build.sbt /app/
ADD project/plugins.sbt /app/project/
ADD project/build.properties /app/project/
ADD src/. /app/src/


#Build the projects
RUN sbt clean assembly

ENTRYPOINT ["/opt/entrypoint.sh"]

Create a docker image by executing:

docker build -f Dockerfile-app -t test/spark-scala-k8-app:latest .

Step 6: Create yaml file to deploy/run the code on K8

The file is available in examples/spark-scala-k8-app.yaml

Important elements in the yaml file

  1. spec.image: “test/spark-scala-k8-app:latest .Provide the image name from Step 5
  2. spec.imagePullPolicy: Never . Since we are pulling image from local.
  3. spec.mainClass: com.AppK8Demo . Provide the path to your job class. AppK8Demo is a sample Job class from my Github repo.
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-scala-k8-app
namespace: default
spec:
type: Scala
mode: cluster
image: "test/spark-scala-k8-app:latest"
imagePullPolicy: Never
mainClass: com.AppK8Demo
mainApplicationFile: "local:///app/target/scala-2.12/spark-scala-k8-app-assembly-0.1.jar"
sparkVersion: "3.0.0"
restartPolicy:
type: Never
volumes:
- name: "test-volume"
hostPath:
path: "/tmp"
type: Directory
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 3.0.0
serviceAccount: spark
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 3.0.0
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"

Execute to deploy/run the job

kubectl apply -f examples/spark-scala-k8-app.yaml
Kubernetes-Dashboard
App-Logs

Congratulations! Hoped this worked for you too

--

--