Spark Operator

The Spark Operator (Kubeflow spark-operator) enables running Apache Spark workloads natively on Kubernetes. It introduces two custom resource types that platform workloads use to define and schedule Spark jobs.

Custom Resources

CRD	Purpose
SparkApplication	A one-shot Spark job — runs once and completes
ScheduledSparkApplication	A cron-scheduled Spark job — the operator manages the lifecycle of recurring runs

Platform workloads that use these:

Workload	Type	Bundle
PII scanning (prod)	`ScheduledSparkApplication`	`aws/datahub`
PII scanning (test)	`SparkApplication`	`aws/datahub`

Event Log Storage

An S3 bucket and prefix are provisioned for Spark event logs. Spark applications can write their event logs here, making them available in the Spark History Server for post-run debugging and performance analysis.

Relationship to JupyterHub

Notebook pods in JupyterHub are bound to spark-cluster-role via RBAC, allowing users to submit SparkApplication resources from notebook code — Spark jobs run as Kubernetes pods orchestrated by this operator.

Go Deeper

JupyterHub — submits Spark jobs from notebook sessions
Datahub — PII scanning runs as a ScheduledSparkApplication

Custom Resources​

Event Log Storage​

Relationship to JupyterHub​

Go Deeper​

Custom Resources

Event Log Storage

Relationship to JupyterHub

Go Deeper