Skip to main content

Celeborn

Celeborn is a remote shuffle service for distributed compute frameworks. It offloads Spark's shuffle phase — the expensive inter-stage data exchange — from executor pods to a dedicated, persistent shuffle cluster.

Why It Exists

In a standard Spark job, shuffle data (intermediate results between stages) is written to each executor's local disk and read by executors in the next stage. When executors are Kubernetes pods, local disks are small and ephemeral — executor failures lose shuffle data and require full task restarts.

Celeborn replaces local shuffle with a centralized service:

  • Shuffle data is pushed to Celeborn workers during task execution
  • Downstream tasks pull from Celeborn rather than the original executor
  • Executor failures do not lose shuffle data — it is already in Celeborn

Components

ComponentDescription
Celeborn MasterCoordinates shuffle job registration and worker assignment. Supports HA mode with multiple replicas.
Celeborn WorkerStores shuffle data on local persistent volumes (PVCs). Horizontally scalable.

Configuration

ParameterPurpose
master_replicasNumber of master replicas (HA requires ≥ 2)
worker_replicasNumber of worker replicas
master_heap_memoryJVM heap for master pods
worker_heap_memoryJVM heap for worker pods
worker_offheap_memoryOff-heap memory for shuffle data buffering
worker_disk_sizePVC size per worker for shuffle data

Supported Frameworks

Celeborn supports Apache Spark, Flink, and Hadoop MapReduce as shuffle clients.

Go Deeper

  • Spark Team — Spark jobs from team namespaces use Celeborn for shuffle
  • Spark Operator — manages the Spark jobs that delegate shuffle to Celeborn