Skip to main content

Spark Operator

The Spark Operator (Kubeflow spark-operator) enables running Apache Spark workloads natively on Kubernetes. It introduces two custom resource types that platform workloads use to define and schedule Spark jobs.

Custom Resources

CRDPurpose
SparkApplicationA one-shot Spark job — runs once and completes
ScheduledSparkApplicationA cron-scheduled Spark job — the operator manages the lifecycle of recurring runs

Platform workloads that use these:

WorkloadTypeBundle
PII scanning (prod)ScheduledSparkApplicationaws/datahub
PII scanning (test)SparkApplicationaws/datahub

Event Log Storage

An S3 bucket and prefix are provisioned for Spark event logs. Spark applications can write their event logs here, making them available in the Spark History Server for post-run debugging and performance analysis.

Relationship to JupyterHub

Notebook pods in JupyterHub are bound to spark-cluster-role via RBAC, allowing users to submit SparkApplication resources from notebook code — Spark jobs run as Kubernetes pods orchestrated by this operator.

Go Deeper

  • JupyterHub — submits Spark jobs from notebook sessions
  • Datahub — PII scanning runs as a ScheduledSparkApplication