Spark 3.0.0 history server with minIO
3 min readOct 22, 2020
A step by step guide on setting up Spark history server backed by minIO in Kubernetes
Set up:
- Docker on Mac: 2.2.0
- Kubernetes: 1.15.5
- macOS: 10.15.6
- Spark: 3.0.0 with hadoop 3.2
- Spark Google operator
- minIO: latest image
Step 1: Enable K8 and set up Dashboard on local.
Step 2: Set up the spark google operator
Step 3: Get the minIO in the docker-compose and push the stack
docker-compose.yaml
version: '3.0'
services:
minio:
image: minio/minio
container_name: minio
ports:
- 9000:9000
environment:
MINIO_ACCESS_KEY: AKIAIOSFODNN7EXAMPLE
MINIO_SECRET_KEY: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
entrypoint: sh
command: -c 'mkdir -p /export/spark-events && mkdir -p /export/spark-events/logs && /usr/bin/minio server /export'
healthcheck:
test: [ "CMD", "curl", "-f", "http://localhost:9000/minio/health/live" ]
interval: 30s
timeout: 20s
retries: 3
Command to push the stack:
docker stack deploy --compose-file docker-compose-local-setup.yaml --namespace default --orchestrator kubernetes spark-minio
After the push, UI should be accessible at http://localhost:9000/minio/
Step 4: Set up the spark history server
Dockerfile:
ARG SPARK_IMAGE=gcr.io/spark-operator/spark:v3.0.0-hadoop3
FROM ${SPARK_IMAGE}
# Switch to user root so we can add additional jars, packages and configuration files.
USER root
RUN apt-get -y update && apt-get install -y coreutils
RUN rm -rf $SPARK_HOME/jars/httpclient-4.5.6.jar
# Add dependency for hadoop-aws
ADD https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-core/1.11.883/aws-java-sdk-core-1.11.883.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-dynamodb/1.11.883/aws-java-sdk-dynamodb-1.11.883.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3/1.11.883/aws-java-sdk-s3-1.11.883.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.0/hadoop-aws-3.2.0.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.5.9/httpclient-4.5.9.jar $SPARK_HOME/jars
#Required for minIO service discovery by name.
ADD /spark-history/spark-defaults.conf /opt/spark/conf/
ENTRYPOINT ["/opt/entrypoint.sh"]
spark-defaults.conf
spark.hadoop.fs.s3a.path.style.access true
Build the docker image
docker build -f spark-history/Dockerfile -t spark-history-server:latest .
Create secret for S3
kubectl create secret generic aws-secrets --from-file=secret/aws-access-key --from-file=secret/aws-secret-key
Install stable helm repo
helm repo add stable https://kubernetes-charts.storage.googleapis.com
install the helm chart
helm install stable/spark-history-server --namespace default -f examples/spark-history-server-manifest.yaml --generate-name
spark-history-server-manifest.yaml
#s3 config
image:
repository: spark-history-server
tag: latest
pullPolicy: Never
pvc:
enablePVC: false
existingClaimName: nfs-pvc
eventsDir: “/”
nfs:
enableExampleNFS: false
pvName: nfs-pv
pvcName: nfs-pvc
s3:
enableS3: true
enableIAM: false
secret: aws-secrets
accessKeyName: aws-access-key
secretKeyName: aws-secret-key
logDirectory: s3a://spark-events/logs/
endpoint: http://minio-published:9000
The server should be accessible at http://localhost:18080/
Note: Have seen some issues with caching so will recommend incognito mode.
Step 5: Submit the spark job
kubectl apply -f examples/spark-pi.yaml