Get spark jobs running in Kubernetes with Prometheus monitoring.

A step by step guide to monitor spark jobs running in K8 via Prometheus

Set up the Prometheus server using kube-prometheus-stack helm chart on Docker-On-Mac

Please follow my post to set up the Prometheus server on Docker-On-Mac

Inspect the Prometheus resource created by the kube-prometheus-stack helm chart

A Prometheus, defines a desired state of Prometheus deployment.

The Prometheus created by kube-prometheus-stack is looking for a service monitor objects defined with label: `release: prometheus`

Lets’ check the Prometheus resource. Execute below command:

kubectl get prometheus -oyaml

The yaml file looks like below…


A step by step guide to set up a prometheus server on local K8 cluster

Please go through my earlier post to set up the K8 and K8 Dashboard on Docker-On-Mac

Install the helm chart

We will use kube-prometheus-stack helm chart for getting Prometheus server running on our Docker-On-Mac.

Add the repo

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

Update the repo

helm repo update

Install the chart

helm install prometheus prometheus-community/kube-prometheus-stack
Output

Check the pods

kubectl get pods


Step to step guide to deploy Zeppelin for multi users in Kubernetes

Objective

Deploy Zeppelin:0.9.0 in K8

Authenticate Zeppelin UI via Apache Shiro JDBC realm

Repo

Code being discussed can be found in Github

Steps to meet the objective

Enable Kubernetes

Follow below link to run K8 on Mac.

Build the docker image

docker build . -t sg/zeppelin:latest

Dockerfile

#https://hub.docker.com/r/apache/zeppelin/dockerfile
ARG ZEPPELIN_IMAGE=apache/zeppelin:0.9.0
FROM ${ZEPPELIN_IMAGE}

WORKDIR ${Z_HOME}

USER root
ADD /shiro.ini ${Z_HOME}/conf/
#Required for local set up. Overriding so as to make imagePullPolicy as Never
ADD /100-interpreter-spec.yaml ${Z_HOME}/k8s/interpreter/

#Once ZEPPELIN-5192 is resolved, we can remove the below command.
ADD /credentials.json ${Z_HOME}/conf/

#Required for starting the Zeppelin server. Without this Shiro won't be able to set the env configuration.
ADD…


A step by step guide on setting up Spark history server backed by minIO in Kubernetes

Set up:

  1. Docker on Mac: 2.2.0
  2. Kubernetes: 1.15.5
  3. macOS: 10.15.6
  4. Spark: 3.0.0 with hadoop 3.2
  5. Spark Google operator
  6. minIO: latest image

Step 1: Enable K8 and set up Dashboard on local.

Step 2: Set up the spark google operator

Step 3: Get the minIO in the docker-compose and push the stack

docker-compose.yaml

version: '3.0'
services:
minio:
image: minio/minio
container_name: minio
ports:
- 9000:9000
environment:
MINIO_ACCESS_KEY: AKIAIOSFODNN7EXAMPLE
MINIO_SECRET_KEY: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
entrypoint: sh
command: -c 'mkdir -p /export/spark-events && mkdir -p /export/spark-events/logs && /usr/bin/minio server /export'
healthcheck:
test: [ "CMD", "curl", "-f", "http://localhost:9000/minio/health/live" ]
interval: 30s
timeout: 20s
retries: 3

Command to push the stack:


A step by step guide on running your local spark scala code on Kubernetes using spark-on-k8s-operator.

In my earlier post, I package the application jar inside the docker image. For local development, it would be easier if we can run the code on K8 without rebuilding the image each time.

One way of doing that would be via sharing your local development path as a Volume. In this post, we will see how to do that

Versions:

  1. Spark: 3.0.0
  2. Scala: 2.12
  3. SBT: 1.3.13
  4. Docker On Mac: 2.2.0.0
  5. Kubernetes: v1.15.5
  6. spark-on-k8s-operator: sparkoperator.k8s.io/v1beta2

Step 1: Set up Kubernetes:

Please follow my earlier…


At most once at the same time in Spring Scheduler.

Spring Scheduler is an easy and efficient way of scheduling the tasks. However, one caveat is in case of multiple instances it fires the task on every instance. This behavior is not ideal for use cases where the expected behavior is at most once at the same time.

I have been using quartz for scheduling the jobs but recently came across Shedlock which makes scheduling much easier.

Spring-Scheduler with Shedlock

Spring-Scheduler triggers all the instances, Shedlock initiates a race condition and whichever instances are able to win takes control over the task and…


Play around with multiple schemas in your flyway scripts by using Flyway placeholders.

Requirement

In each environment, we need to support two Database schemas. The schema should not be hardcoded and take the environment name as a suffix.

application.yaml

Our configuration:

spring:
datasource:
url: jdbc:postgresql://${database_name}/${service_database_name}
username: ${database_user}
password: ${database_password}
jpa:
properties:
hibernate:
dialect: org.hibernate.dialect.PostgreSQLDialect
default_schema: test_${spring.profiles.active}
flyway:
schemas: test_${spring.profiles.active}
out-of-order: true
baseline-on-migrate: true
placeholder-prefix: $$$
placeholder-suffix: $$$
placeholders:
raw_schema: test_raw_${spring.profiles.active}

Default Flyway Schema: This is were flyway scripts will be executed.

flyway:
schemas: ${database_schema_name}

Second Schema: This is the second schema which we will use in our flyway scripts…


Our objective would be to quickly set up a data pipeline and move data from MySQL to PostgreSQL.

Debezium

We would use the Debezium MySQL connector for reading MySQL Database logs. Debezium would put the events in Kafka.

Kafka-Connect

We would use the JDBC Connector to read the events from the Kafka Topic and put in PostgreSQL.

Code

The code can be downloaded from Github.

Build the pipeline

We would use docker-compose to set up the following:

  1. MySQL: The source DB.
  2. PostgreSQL: The destination DB
  3. Kafka-Connect(Debezium and JDBC Connector): Debezium for reading MySQL Logs and JDBC Connector for pushing the change to PostgreSQL.
  4. Kafka
  5. Zookeeper

Build…


A guide for writing integration test for a Spark Structured Streaming Application

Bird’s eye view of the architecture:

The basic Architecture

The focus of the post will be entirely on Integration Test

For Set Up And Running the Spark Application, please read my earlier post.

The Code

  1. Spark: 2.4.5
  2. Scala 2.12.8
  3. Test Containers: 0.37.0
  4. Scala-Test: 3.0.1

Project Structure


A quick guide to set up Debezium and use Spark-Streaming Job as a Kafka consumer.

Bird’s eye view of the architecture:

The Architecture

Debezium

Debezium is an open-source distributed platform for change data capture. It provides a set of Kafka connector which are capable of reading from the Database Binlog files and produce the changes as events in Kafka.

As a CDC, it gives you a state of Database Row before & after the commit and also tells what kind of operation it was i.e Create/ Update or Delete.

A Debezium Kafka message looks like:

{
"schema":{…

Suchit Gupta

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store