JupyterHub + Enterprise Gateway Orchestration
This page describes the target platform behavior for moving JupyterHub + Jupyter Enterprise Gateway management from Terraform IaC into Control Plane orchestration.
Goal
Make JupyterHub notebook profiles and Enterprise Gateway kernel application specs first-class platform configuration managed by the Control Plane.
In the target model, platform operators define:
- Notebook profiles (what users can pick in JupyterHub)
- Kernel app specs (what Enterprise Gateway launches)
The platform then renders and applies:
- JupyterHub
singleuser.profileListvalues - Kubernetes
ConfigMapdata for Enterprise Gateway SparkApplication specs
Current State
Current setup is Terraform-driven in (from sparqd-infra-master root):
aws/aws-client-workspace/5-workspace-apps/jupyterhub
Current source branch:
sandbox-oqullus/0.0.1
Today, Terraform manages both:
- JupyterHub values (including
singleuser.profileList) - Enterprise Gateway deployment
- ConfigMap
eg-sparkapp-specspopulated fromsparkapp-specs/*.json
This works, but profile/kernel changes are coupled to IaC rollout instead of the platform catalog and stack lifecycle.
Target State
The Control Plane becomes the source of truth for runtime notebook/kernel options.
Control Boundary
Control Plane does not create or manage SparkApplication CRDs directly.
Control Plane owns:
NotebookProfiledefinitionsKernelAppSpecdefinitions- Rendering
KernelAppSpecinto ConfigMap payloads used by Enterprise Gateway
Enterprise Gateway owns:
- Reading kernel spec payload from the configured ConfigMap key
- Creating runtime
SparkApplicationCRs when a notebook kernel is launched
Spark Operator owns:
- Reconciling
SparkApplicationCRs - Spawning and managing Spark driver/executor pods
Architecture Visual
Control Plane Managed Objects
NotebookProfile
- UI-facing profile definition (name, slug, description)
- Notebook runtime shape (image, cpu/memory, node placement, tolerations)
- Kernel routing settings (local kernel or Enterprise Gateway mode)
KernelAppSpec
- Named kernel workload template used by Enterprise Gateway
- Includes execution settings needed for SparkApplication launch behavior
- Versioned and reusable across multiple notebook profiles
Rendering Contract
At deploy/update time, the platform renders two artifacts from these objects:
- JupyterHub Helm values fragment
singleuser.profileListis generated fromNotebookProfilerecords- Enterprise Gateway-enabled profiles include:
JUPYTER_GATEWAY_URLJUPYTER_GATEWAY_AUTH_TOKENKERNEL_IMAGEKERNEL_NAMESPACEKERNEL_SERVICE_ACCOUNT_NAMEKERNEL_SPARKAPP_SPEC_CONFIG_MAPKERNEL_SPARKAPP_SPEC_CONFIG_KEY
- Kernel spec ConfigMap
- ConfigMap (for example
eg-sparkapp-specs) is generated fromKernelAppSpecrecords - Each spec is rendered under a stable key (for example
<kernel-spec-slug>.json) - Profile entries reference the key via
KERNEL_SPARKAPP_SPEC_CONFIG_KEY
End-to-End Flow
Lifecycle Behavior
- Adding a profile updates JupyterHub profile options without requiring a Terraform run.
- Updating a kernel spec updates ConfigMap content and is picked up by Enterprise Gateway-backed launches.
- Disabling a profile removes it from user selection while preserving historical stack state.
- Version rollback restores both profile definitions and kernel spec mappings together.
Validation Rules
Before applying, the Control Plane should reject invalid configurations:
- Profile references a missing
KernelAppSpec - Duplicate profile
slug - Duplicate ConfigMap key output
- Enterprise Gateway profile missing required env mapping fields
- Invalid resource limits/requests shape
KernelAppSpec Management Rules
To prevent platform breakage, KernelAppSpec must be validated with a policy model, not treated as free-form JSON.
1. Field Governance Tiers
Classify every field into one of three tiers:
PlatformRequired: must exist and cannot be overridden by tenant profile-level input.PlatformDefault: defaulted by platform, tenant may override within guardrails.TenantConfigurable: tenant can set freely within schema/type constraints.
2. Mandatory Platform Contract
The following must be enforced as PlatformRequired for Spark kernels:
- Driver and executor
enventries for Infisical integration:INFISICAL_CLIENT_IDINFISICAL_CLIENT_SECRETINFISICAL_ENVIRONMENT_SLUGINFISICAL_PROJECT_IDINFISICAL_URL
- Spark conf integration for managed data plane:
spark.hadoop.hive.metastore.urisspark.sql.catalogImplementation=hivespark.sql.warehouse.dir
- Spark event and storage wiring used by platform observability/runtime:
spark.eventLog.enabled=truespark.eventLog.dir
- Kubernetes identity and execution boundary:
driver.serviceAccountexecutor.serviceAccount
- Node placement governance:
driver.nodeSelector,executor.nodeSelector, and tolerations must resolve from platform-managed placement catalog entries- Raw arbitrary node selector keys/values are not allowed in tenant-managed spec input
- Image provenance:
image,driver.image,executor.image, and required init container images must come from Quantdata-approved registries only- Allowed sources include Quantdata public ECR and Quantdata private registry paths
- Volume permission and local-dir mount contract:
driver.initContainersincludes the platformvolume-permissionscontainerexecutor.initContainersincludes the platformvolume-permissionscontainer- Required PVC mount keys for driver/executor local dir remain present and immutable except size fields
spark.local.dirremains bound to the platform-managed mount path
These keys must be platform-managed values, not tenant-managed values.
If any required key is missing, empty, changed to a disallowed value, or supplied from a non-platform-managed source, Control Plane must reject the spec version.
3. Policy Enforcement Strategy
Use a two-layer model on create/update:
- Schema validation
- JSON schema/type checks, required sections (
driver,executor,sparkConf).
- Policy validation
- Evaluate required keys and value constraints.
- Evaluate disallowed keys (for example unsafe hostPath volumes, privileged container settings, forbidden namespaces).
Only validated versions can become Active.
4. Merge and Override Rules
When rendering the final ConfigMap payload:
- Start from platform base spec (immutable baseline).
- Apply tenant/workspace overrides only for allowed fields.
- Re-apply and lock
PlatformRequiredkeys last.
This guarantees that even if override payload includes conflicting values, required platform behavior wins.
For storage overrides specifically:
- Tenants may change only size-related values (for example
*.options.sizeLimit). - Tenants may not change claim names, storage class, mount path, mount mode, or permission init container behavior.
- Supported PVC pattern is only Spark conf PVC wiring:
spark.kubernetes.driver.volumes.persistentVolumeClaim.*spark.kubernetes.executor.volumes.persistentVolumeClaim.*
- Alternative volume injection methods in
KernelAppSpecare not supported (for example arbitrary pod-volume templates, hostPath, or non-platform PVC wiring).
For managed Spark paths specifically:
spark.sql.warehouse.diris platform-managed and non-overridable.spark.eventLog.diris platform-managed and non-overridable.spark.hadoop.hive.metastore.urisis platform-managed and non-overridable.- Control Plane resolves these values from workspace managed endpoints/buckets/paths during render.
For image overrides specifically:
- Tenant-provided image values are allowed only when they match the registry allowlist policy.
- Any image outside approved Quantdata registries is rejected.
5. Versioning and Promotion
Draft: editable, not deployable.Validated: passed schema + policy checks.Active: referenced by one or more NotebookProfiles.Deprecated: blocked for new profile bindings, still readable for rollback/history.
Notebook profiles should reference a stable KernelAppSpec ID + version, not raw inline JSON.
5a. Node Placement Catalog
Control Plane should maintain a placement catalog (for example spark-graviton4-mem-instance, jupyterhub-arm) that defines:
- allowed
nodeSelectorkey/value sets - required tolerations
- architecture/cloud compatibility constraints
NotebookProfile and KernelAppSpec should reference placement IDs from this catalog. During render, Control Plane expands the placement ID into concrete selector/toleration values.
6. Runtime Drift Protection
Control Plane should continuously compare desired rendered ConfigMap content vs. cluster state.
- Drift detected: mark workspace app
Degraded. - Auto-reconcile by reapplying desired ConfigMap via Agent.
- Repeated drift: raise policy/security event for investigation.
7. Practical Rule for Your Example
For your provided spec, enforce these as hard checks:
- Reject if any Infisical env var is removed from
driver.envorexecutor.env. - Reject if
spark.hadoop.hive.metastore.urisis missing. - Reject if
spark.hadoop.hive.metastore.urisis not platform-managed. - Reject if
spark.sql.warehouse.diris missing. - Reject if
spark.sql.warehouse.diris changed from platform-managed value. - Reject if
spark.sql.warehouse.diris not platform-managed. - Reject if
spark.eventLog.diris missing. - Reject if
spark.eventLog.diris changed from platform-managed value. - Reject if
spark.eventLog.diris not platform-managed. - Reject if
driver.serviceAccount/executor.serviceAccountdo not match allowed service accounts for the workspace. - Reject if any runtime image does not match approved Quantdata registry prefixes.
- Reject if node selectors or tolerations do not match a platform catalog placement entry.
- Reject if the
volume-permissionsinit container is removed or altered beyond allowed fields. - Reject if driver/executor PVC mount structure is changed (except
sizeLimitvalues). - Reject if spec uses non-supported volume methods outside the approved Spark PVC keys.
- Reject if
KERNEL_SPARKAPP_SPEC_CONFIG_KEYin NotebookProfile points to a non-validated spec version.
Image Catalog Governance
Control Plane should own a centralized image catalog and KernelAppSpec should reference catalog IDs, not arbitrary image strings.
Catalog Object
Each catalog entry should include:
imageId(stable logical ID, for examplespark-aws-core-3.5.5)registry/repositorytagand/or immutabledigestvisibility(public,private)status(active,deprecated,blocked)constraints(cloud, architecture, workspace tier, kernel type)
Spec Reference Model
KernelAppSpec references image entries:
runtimeImageRefdriverImageRefexecutorImageRefinitImageRefs[]
During render, Control Plane resolves refs to concrete image values (prefer digest-pinned form).
Enforcement
- Reject specs that use raw image URLs when
*ImageRefis required by policy. - Reject references to
deprecatedorblockedimages for new activations. - Reject references violating constraints (for example amd64 image on arm64-only profile policy).
- Allow existing running workloads to continue during deprecation windows, but block new promotions.
Promotion Workflow
Suggested image lifecycle:
candidate: available for internal validation onlyactive: allowed for productionKernelAppSpecactivationdeprecated: no new bindings, existing bindings allowed temporarilyblocked: disallowed for all new launches and updates
Operational Benefit
This prevents broken or untrusted image drift and allows platform-wide upgrades/rollbacks by changing catalog policy rather than editing every KernelAppSpec.
Failure Scenarios
- ConfigMap apply fails: keep previous active profile/kernel config and mark update failed.
- JupyterHub upgrade fails: profile change is not activated; workspace app remains on previous release.
- Enterprise Gateway unavailable: Enterprise Gateway profiles remain visible only if gateway health check passes (or are marked unavailable in UI, based on platform policy).
Security Model
- Sensitive values (for example gateway auth token) are resolved from cluster secrets at execution time, not stored as plaintext in Control Plane state.
- Control Plane stores references and rendered structure, while secret material remains tenant-side.
- All updates follow existing Agent pull execution and mTLS authentication model.
Rollout Plan
- Mirror mode
- Keep existing Terraform path as source.
- Add Control Plane read-only projection to render equivalent profile/config outputs for comparison.
- Control Plane authority mode
- Control Plane becomes write source for
profileListand kernel spec ConfigMap. - Terraform stops managing those specific fields/artifacts.
- Full migration
- JupyterHub + Enterprise Gateway stack behavior is managed through catalog specs and workspace stack lifecycle only.
Out of Scope
- Replacing the underlying JupyterHub or Enterprise Gateway Helm charts
- Changing Spark operator behavior outside kernel spec payload definitions
- Redesigning tenant IAM/IRSA model