Skip to main content

JupyterHub + Enterprise Gateway Orchestration

This page describes the target platform behavior for moving JupyterHub + Jupyter Enterprise Gateway management from Terraform IaC into Control Plane orchestration.

Goal

Make JupyterHub notebook profiles and Enterprise Gateway kernel application specs first-class platform configuration managed by the Control Plane.

In the target model, platform operators define:

  • Notebook profiles (what users can pick in JupyterHub)
  • Kernel app specs (what Enterprise Gateway launches)

The platform then renders and applies:

  • JupyterHub singleuser.profileList values
  • Kubernetes ConfigMap data for Enterprise Gateway SparkApplication specs

Current State

Current setup is Terraform-driven in (from sparqd-infra-master root):

aws/aws-client-workspace/5-workspace-apps/jupyterhub

Current source branch:

sandbox-oqullus/0.0.1

Today, Terraform manages both:

  • JupyterHub values (including singleuser.profileList)
  • Enterprise Gateway deployment
  • ConfigMap eg-sparkapp-specs populated from sparkapp-specs/*.json

This works, but profile/kernel changes are coupled to IaC rollout instead of the platform catalog and stack lifecycle.

Target State

The Control Plane becomes the source of truth for runtime notebook/kernel options.

Control Boundary

Control Plane does not create or manage SparkApplication CRDs directly.

Control Plane owns:

  • NotebookProfile definitions
  • KernelAppSpec definitions
  • Rendering KernelAppSpec into ConfigMap payloads used by Enterprise Gateway

Enterprise Gateway owns:

  • Reading kernel spec payload from the configured ConfigMap key
  • Creating runtime SparkApplication CRs when a notebook kernel is launched

Spark Operator owns:

  • Reconciling SparkApplication CRs
  • Spawning and managing Spark driver/executor pods

Architecture Visual

Control Plane Managed Objects

  1. NotebookProfile
  • UI-facing profile definition (name, slug, description)
  • Notebook runtime shape (image, cpu/memory, node placement, tolerations)
  • Kernel routing settings (local kernel or Enterprise Gateway mode)
  1. KernelAppSpec
  • Named kernel workload template used by Enterprise Gateway
  • Includes execution settings needed for SparkApplication launch behavior
  • Versioned and reusable across multiple notebook profiles

Rendering Contract

At deploy/update time, the platform renders two artifacts from these objects:

  1. JupyterHub Helm values fragment
  • singleuser.profileList is generated from NotebookProfile records
  • Enterprise Gateway-enabled profiles include:
    • JUPYTER_GATEWAY_URL
    • JUPYTER_GATEWAY_AUTH_TOKEN
    • KERNEL_IMAGE
    • KERNEL_NAMESPACE
    • KERNEL_SERVICE_ACCOUNT_NAME
    • KERNEL_SPARKAPP_SPEC_CONFIG_MAP
    • KERNEL_SPARKAPP_SPEC_CONFIG_KEY
  1. Kernel spec ConfigMap
  • ConfigMap (for example eg-sparkapp-specs) is generated from KernelAppSpec records
  • Each spec is rendered under a stable key (for example <kernel-spec-slug>.json)
  • Profile entries reference the key via KERNEL_SPARKAPP_SPEC_CONFIG_KEY

End-to-End Flow

Lifecycle Behavior

  • Adding a profile updates JupyterHub profile options without requiring a Terraform run.
  • Updating a kernel spec updates ConfigMap content and is picked up by Enterprise Gateway-backed launches.
  • Disabling a profile removes it from user selection while preserving historical stack state.
  • Version rollback restores both profile definitions and kernel spec mappings together.

Validation Rules

Before applying, the Control Plane should reject invalid configurations:

  • Profile references a missing KernelAppSpec
  • Duplicate profile slug
  • Duplicate ConfigMap key output
  • Enterprise Gateway profile missing required env mapping fields
  • Invalid resource limits/requests shape

KernelAppSpec Management Rules

To prevent platform breakage, KernelAppSpec must be validated with a policy model, not treated as free-form JSON.

1. Field Governance Tiers

Classify every field into one of three tiers:

  • PlatformRequired: must exist and cannot be overridden by tenant profile-level input.
  • PlatformDefault: defaulted by platform, tenant may override within guardrails.
  • TenantConfigurable: tenant can set freely within schema/type constraints.

2. Mandatory Platform Contract

The following must be enforced as PlatformRequired for Spark kernels:

  • Driver and executor env entries for Infisical integration:
    • INFISICAL_CLIENT_ID
    • INFISICAL_CLIENT_SECRET
    • INFISICAL_ENVIRONMENT_SLUG
    • INFISICAL_PROJECT_ID
    • INFISICAL_URL
  • Spark conf integration for managed data plane:
    • spark.hadoop.hive.metastore.uris
    • spark.sql.catalogImplementation=hive
    • spark.sql.warehouse.dir
  • Spark event and storage wiring used by platform observability/runtime:
    • spark.eventLog.enabled=true
    • spark.eventLog.dir
  • Kubernetes identity and execution boundary:
    • driver.serviceAccount
    • executor.serviceAccount
  • Node placement governance:
    • driver.nodeSelector, executor.nodeSelector, and tolerations must resolve from platform-managed placement catalog entries
    • Raw arbitrary node selector keys/values are not allowed in tenant-managed spec input
  • Image provenance:
    • image, driver.image, executor.image, and required init container images must come from Quantdata-approved registries only
    • Allowed sources include Quantdata public ECR and Quantdata private registry paths
  • Volume permission and local-dir mount contract:
    • driver.initContainers includes the platform volume-permissions container
    • executor.initContainers includes the platform volume-permissions container
    • Required PVC mount keys for driver/executor local dir remain present and immutable except size fields
    • spark.local.dir remains bound to the platform-managed mount path

These keys must be platform-managed values, not tenant-managed values.

If any required key is missing, empty, changed to a disallowed value, or supplied from a non-platform-managed source, Control Plane must reject the spec version.

3. Policy Enforcement Strategy

Use a two-layer model on create/update:

  1. Schema validation
  • JSON schema/type checks, required sections (driver, executor, sparkConf).
  1. Policy validation
  • Evaluate required keys and value constraints.
  • Evaluate disallowed keys (for example unsafe hostPath volumes, privileged container settings, forbidden namespaces).

Only validated versions can become Active.

4. Merge and Override Rules

When rendering the final ConfigMap payload:

  1. Start from platform base spec (immutable baseline).
  2. Apply tenant/workspace overrides only for allowed fields.
  3. Re-apply and lock PlatformRequired keys last.

This guarantees that even if override payload includes conflicting values, required platform behavior wins.

For storage overrides specifically:

  • Tenants may change only size-related values (for example *.options.sizeLimit).
  • Tenants may not change claim names, storage class, mount path, mount mode, or permission init container behavior.
  • Supported PVC pattern is only Spark conf PVC wiring:
    • spark.kubernetes.driver.volumes.persistentVolumeClaim.*
    • spark.kubernetes.executor.volumes.persistentVolumeClaim.*
  • Alternative volume injection methods in KernelAppSpec are not supported (for example arbitrary pod-volume templates, hostPath, or non-platform PVC wiring).

For managed Spark paths specifically:

  • spark.sql.warehouse.dir is platform-managed and non-overridable.
  • spark.eventLog.dir is platform-managed and non-overridable.
  • spark.hadoop.hive.metastore.uris is platform-managed and non-overridable.
  • Control Plane resolves these values from workspace managed endpoints/buckets/paths during render.

For image overrides specifically:

  • Tenant-provided image values are allowed only when they match the registry allowlist policy.
  • Any image outside approved Quantdata registries is rejected.

5. Versioning and Promotion

  • Draft: editable, not deployable.
  • Validated: passed schema + policy checks.
  • Active: referenced by one or more NotebookProfiles.
  • Deprecated: blocked for new profile bindings, still readable for rollback/history.

Notebook profiles should reference a stable KernelAppSpec ID + version, not raw inline JSON.

5a. Node Placement Catalog

Control Plane should maintain a placement catalog (for example spark-graviton4-mem-instance, jupyterhub-arm) that defines:

  • allowed nodeSelector key/value sets
  • required tolerations
  • architecture/cloud compatibility constraints

NotebookProfile and KernelAppSpec should reference placement IDs from this catalog. During render, Control Plane expands the placement ID into concrete selector/toleration values.

6. Runtime Drift Protection

Control Plane should continuously compare desired rendered ConfigMap content vs. cluster state.

  • Drift detected: mark workspace app Degraded.
  • Auto-reconcile by reapplying desired ConfigMap via Agent.
  • Repeated drift: raise policy/security event for investigation.

7. Practical Rule for Your Example

For your provided spec, enforce these as hard checks:

  • Reject if any Infisical env var is removed from driver.env or executor.env.
  • Reject if spark.hadoop.hive.metastore.uris is missing.
  • Reject if spark.hadoop.hive.metastore.uris is not platform-managed.
  • Reject if spark.sql.warehouse.dir is missing.
  • Reject if spark.sql.warehouse.dir is changed from platform-managed value.
  • Reject if spark.sql.warehouse.dir is not platform-managed.
  • Reject if spark.eventLog.dir is missing.
  • Reject if spark.eventLog.dir is changed from platform-managed value.
  • Reject if spark.eventLog.dir is not platform-managed.
  • Reject if driver.serviceAccount / executor.serviceAccount do not match allowed service accounts for the workspace.
  • Reject if any runtime image does not match approved Quantdata registry prefixes.
  • Reject if node selectors or tolerations do not match a platform catalog placement entry.
  • Reject if the volume-permissions init container is removed or altered beyond allowed fields.
  • Reject if driver/executor PVC mount structure is changed (except sizeLimit values).
  • Reject if spec uses non-supported volume methods outside the approved Spark PVC keys.
  • Reject if KERNEL_SPARKAPP_SPEC_CONFIG_KEY in NotebookProfile points to a non-validated spec version.

Image Catalog Governance

Control Plane should own a centralized image catalog and KernelAppSpec should reference catalog IDs, not arbitrary image strings.

Catalog Object

Each catalog entry should include:

  • imageId (stable logical ID, for example spark-aws-core-3.5.5)
  • registry/repository
  • tag and/or immutable digest
  • visibility (public, private)
  • status (active, deprecated, blocked)
  • constraints (cloud, architecture, workspace tier, kernel type)

Spec Reference Model

KernelAppSpec references image entries:

  • runtimeImageRef
  • driverImageRef
  • executorImageRef
  • initImageRefs[]

During render, Control Plane resolves refs to concrete image values (prefer digest-pinned form).

Enforcement

  • Reject specs that use raw image URLs when *ImageRef is required by policy.
  • Reject references to deprecated or blocked images for new activations.
  • Reject references violating constraints (for example amd64 image on arm64-only profile policy).
  • Allow existing running workloads to continue during deprecation windows, but block new promotions.

Promotion Workflow

Suggested image lifecycle:

  1. candidate: available for internal validation only
  2. active: allowed for production KernelAppSpec activation
  3. deprecated: no new bindings, existing bindings allowed temporarily
  4. blocked: disallowed for all new launches and updates

Operational Benefit

This prevents broken or untrusted image drift and allows platform-wide upgrades/rollbacks by changing catalog policy rather than editing every KernelAppSpec.

Failure Scenarios

  • ConfigMap apply fails: keep previous active profile/kernel config and mark update failed.
  • JupyterHub upgrade fails: profile change is not activated; workspace app remains on previous release.
  • Enterprise Gateway unavailable: Enterprise Gateway profiles remain visible only if gateway health check passes (or are marked unavailable in UI, based on platform policy).

Security Model

  • Sensitive values (for example gateway auth token) are resolved from cluster secrets at execution time, not stored as plaintext in Control Plane state.
  • Control Plane stores references and rendered structure, while secret material remains tenant-side.
  • All updates follow existing Agent pull execution and mTLS authentication model.

Rollout Plan

  1. Mirror mode
  • Keep existing Terraform path as source.
  • Add Control Plane read-only projection to render equivalent profile/config outputs for comparison.
  1. Control Plane authority mode
  • Control Plane becomes write source for profileList and kernel spec ConfigMap.
  • Terraform stops managing those specific fields/artifacts.
  1. Full migration
  • JupyterHub + Enterprise Gateway stack behavior is managed through catalog specs and workspace stack lifecycle only.

Out of Scope

  • Replacing the underlying JupyterHub or Enterprise Gateway Helm charts
  • Changing Spark operator behavior outside kernel spec payload definitions
  • Redesigning tenant IAM/IRSA model

Go Deeper