Package Spark Scala Code and Deploy it on Kubernetes using Spark-on-k8s-Operator
A step by step guide on packaging your spark scala code and deploying it on Kubernetes using spark-on-k8s-operator.
Versions:
- Spark: 3.0.0
- Scala: 2.12
- SBT: 1.3.13
- Docker On Mac: 2.2.0.0
- Kubernetes: v1.15.5
- spark-on-k8s-operator: sparkoperator.k8s.io/v1beta2
Step 1: Set up Kubernetes:
Please follow my earlier post to set up Kubernetes
Step 2: The Project can be cloned from Github:
The project structure is as below:
Step 3: Set up for packaging:
- The plugins.sbt file inside the project folder is required for building a flat jar.
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.10")
2. The build.properties file inside the project folder is required for providing the sbt version.
sbt.version=1.3.13
3. In build.sbt the task for merge strategy is required to resolve any conflicts during packaging
assemblyMergeStrategy in assembly := {
case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat
case PathList("META-INF", xs@_*) => MergeStrategy.discard
case "application.conf" => MergeStrategy.concat
case x => MergeStrategy.first
}
Step 4: Build the base docker image which has Hadoop, Spark, and SBT
This docker image would provide the required env for executing your application code.
Docker file:
ARG SPARK_IMAGE=gcr.io/spark-operator/spark:v3.0.0
FROM ${SPARK_IMAGE}
ENV SBT_VERSION 1.3.13
# Switch to user root so we can add additional jars, packages and configuration files.
USER root
RUN apt-get -y update && apt-get install -y curl
USER ${spark_uid}
WORKDIR /app
#Install SBT
RUN curl -fsL https://github.com/sbt/sbt/releases/download/v$SBT_VERSION/sbt-$SBT_VERSION.tgz | tar xfz - -C /usr/local
ENV PATH /usr/local/sbt/bin:${PATH}
RUN sbt update
ENTRYPOINT ["/opt/entrypoint.sh"]
Create a docker image by executing:
docker build -t test/spark-operator:latest .
Step 5: Build the docker image with the flat jar
The docker image is built from the image that we created in Step 4. It creates a flat jar for your code.
Docker file:
FROM test/spark-operator:latest
# Add project files
ADD build.sbt /app/
ADD project/plugins.sbt /app/project/
ADD project/build.properties /app/project/
ADD src/. /app/src/
#Build the projects
RUN sbt clean assembly
ENTRYPOINT ["/opt/entrypoint.sh"]
Create a docker image by executing:
docker build -f Dockerfile-app -t test/spark-scala-k8-app:latest .
Step 6: Create yaml file to deploy/run the code on K8
The file is available in examples/spark-scala-k8-app.yaml
Important elements in the yaml file
- spec.image: “test/spark-scala-k8-app:latest .Provide the image name from Step 5
- spec.imagePullPolicy: Never . Since we are pulling image from local.
- spec.mainClass: com.AppK8Demo . Provide the path to your job class. AppK8Demo is a sample Job class from my Github repo.
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-scala-k8-app
namespace: default
spec:
type: Scala
mode: cluster
image: "test/spark-scala-k8-app:latest"
imagePullPolicy: Never
mainClass: com.AppK8Demo
mainApplicationFile: "local:///app/target/scala-2.12/spark-scala-k8-app-assembly-0.1.jar"
sparkVersion: "3.0.0"
restartPolicy:
type: Never
volumes:
- name: "test-volume"
hostPath:
path: "/tmp"
type: Directory
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 3.0.0
serviceAccount: spark
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 3.0.0
volumeMounts:
- name: "test-volume"
mountPath: "/tmp"
Execute to deploy/run the job
kubectl apply -f examples/spark-scala-k8-app.yaml
Congratulations! Hoped this worked for you too