Spark Operator
The Spark Operator (Kubeflow spark-operator) enables running Apache Spark workloads natively on Kubernetes. It introduces two custom resource types that platform workloads use to define and schedule Spark jobs.
Custom Resources
| CRD | Purpose |
|---|---|
| SparkApplication | A one-shot Spark job — runs once and completes |
| ScheduledSparkApplication | A cron-scheduled Spark job — the operator manages the lifecycle of recurring runs |
Platform workloads that use these:
| Workload | Type | Bundle |
|---|---|---|
| PII scanning (prod) | ScheduledSparkApplication | aws/datahub |
| PII scanning (test) | SparkApplication | aws/datahub |
Event Log Storage
An S3 bucket and prefix are provisioned for Spark event logs. Spark applications can write their event logs here, making them available in the Spark History Server for post-run debugging and performance analysis.
Relationship to JupyterHub
Notebook pods in JupyterHub are bound to spark-cluster-role via RBAC, allowing users to submit SparkApplication resources from notebook code — Spark jobs run as Kubernetes pods orchestrated by this operator.
Go Deeper
- JupyterHub — submits Spark jobs from notebook sessions
- Datahub — PII scanning runs as a ScheduledSparkApplication