Spark 3.0.0 history server with minIO

Suchit Gupta
3 min readOct 22, 2020

A step by step guide on setting up Spark history server backed by minIO in Kubernetes

Set up:

  1. Docker on Mac: 2.2.0
  2. Kubernetes: 1.15.5
  3. macOS: 10.15.6
  4. Spark: 3.0.0 with hadoop 3.2
  5. Spark Google operator
  6. minIO: latest image

Step 1: Enable K8 and set up Dashboard on local.

Step 2: Set up the spark google operator

Step 3: Get the minIO in the docker-compose and push the stack

docker-compose.yaml

version: '3.0'
services:
minio:
image: minio/minio
container_name: minio
ports:
- 9000:9000
environment:
MINIO_ACCESS_KEY: AKIAIOSFODNN7EXAMPLE
MINIO_SECRET_KEY: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
entrypoint: sh
command: -c 'mkdir -p /export/spark-events && mkdir -p /export/spark-events/logs && /usr/bin/minio server /export'
healthcheck:
test: [ "CMD", "curl", "-f", "http://localhost:9000/minio/health/live" ]
interval: 30s
timeout: 20s
retries: 3

Command to push the stack:

docker stack deploy --compose-file docker-compose-local-setup.yaml --namespace default --orchestrator kubernetes spark-minio 

After the push, UI should be accessible at http://localhost:9000/minio/

Step 4: Set up the spark history server

Dockerfile:

ARG SPARK_IMAGE=gcr.io/spark-operator/spark:v3.0.0-hadoop3
FROM ${SPARK_IMAGE}
# Switch to user root so we can add additional jars, packages and configuration files.
USER root

RUN apt-get -y update && apt-get install -y coreutils

RUN rm -rf $SPARK_HOME/jars/httpclient-4.5.6.jar

# Add dependency for hadoop-aws
ADD https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-core/1.11.883/aws-java-sdk-core-1.11.883.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-dynamodb/1.11.883/aws-java-sdk-dynamodb-1.11.883.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.11.883/aws-java-sdk-s3-1.11.883.jar $SPARK_HOME/jars

ADD https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.0/hadoop-aws-3.2.0.jar $SPARK_HOME/jars

ADD https://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.5.9/httpclient-4.5.9.jar $SPARK_HOME/jars

#Required for minIO service discovery by name.
ADD /spark-history/spark-defaults.conf /opt/spark/conf/

ENTRYPOINT ["/opt/entrypoint.sh"]

spark-defaults.conf

spark.hadoop.fs.s3a.path.style.access  true

Build the docker image

docker build -f spark-history/Dockerfile -t spark-history-server:latest .

Create secret for S3

kubectl create secret generic aws-secrets --from-file=secret/aws-access-key --from-file=secret/aws-secret-key

Install stable helm repo

helm repo add stable https://kubernetes-charts.storage.googleapis.com

install the helm chart

helm install stable/spark-history-server --namespace default -f examples/spark-history-server-manifest.yaml --generate-name

spark-history-server-manifest.yaml


#s3 config
image:
repository: spark-history-server
tag: latest
pullPolicy: Never
pvc:
enablePVC: false
existingClaimName: nfs-pvc
eventsDir: “/”
nfs:
enableExampleNFS: false
pvName: nfs-pv
pvcName: nfs-pvc
s3:
enableS3: true
enableIAM: false
secret: aws-secrets
accessKeyName: aws-access-key
secretKeyName: aws-secret-key
logDirectory: s3a://spark-events/logs/
endpoint: http://minio-published:9000

The server should be accessible at http://localhost:18080/

Note: Have seen some issues with caching so will recommend incognito mode.

Step 5: Submit the spark job

kubectl apply -f examples/spark-pi.yaml

The event is minIO

The spark history server

--

--