Celeborn
Celeborn is a remote shuffle service for distributed compute frameworks. It offloads Spark's shuffle phase — the expensive inter-stage data exchange — from executor pods to a dedicated, persistent shuffle cluster.
Why It Exists
In a standard Spark job, shuffle data (intermediate results between stages) is written to each executor's local disk and read by executors in the next stage. When executors are Kubernetes pods, local disks are small and ephemeral — executor failures lose shuffle data and require full task restarts.
Celeborn replaces local shuffle with a centralized service:
- Shuffle data is pushed to Celeborn workers during task execution
- Downstream tasks pull from Celeborn rather than the original executor
- Executor failures do not lose shuffle data — it is already in Celeborn
Components
| Component | Description |
|---|---|
| Celeborn Master | Coordinates shuffle job registration and worker assignment. Supports HA mode with multiple replicas. |
| Celeborn Worker | Stores shuffle data on local persistent volumes (PVCs). Horizontally scalable. |
Configuration
| Parameter | Purpose |
|---|---|
master_replicas | Number of master replicas (HA requires ≥ 2) |
worker_replicas | Number of worker replicas |
master_heap_memory | JVM heap for master pods |
worker_heap_memory | JVM heap for worker pods |
worker_offheap_memory | Off-heap memory for shuffle data buffering |
worker_disk_size | PVC size per worker for shuffle data |
Supported Frameworks
Celeborn supports Apache Spark, Flink, and Hadoop MapReduce as shuffle clients.
Go Deeper
- Spark Team — Spark jobs from team namespaces use Celeborn for shuffle
- Spark Operator — manages the Spark jobs that delegate shuffle to Celeborn