Skip to main content

Capability-Centric Platform — Overview

The platform is a collection of data infrastructure stacks (Datahub, Airflow, Trino, JupyterHub, etc.). As the stack count grew, a structural problem emerged: the compose model required operators to think in terms of stack topology — which stacks exist, what they depend on, how to wire them together. This is the wrong abstraction for the intended user.

This document covers the problem, the decision, and the mental model that drives everything else in this ADR set. For what changes in platform-stacks see Bundle Structure. For what changes in the control plane see Control Plane.


The Problem

Integration logic lived inside app bundles

Cross-stack wiring was implemented as post-deploy phases inside the owning app's bundle:

  • aws/datahub post-deploy: seeds a PAT, registers Airflow as a lineage source, registers Trino as a data source
  • aws/ranger post-deploy: seeds default policies

This created two problems:

  1. Redeploying the app reruns the seeds. The Datahub bundle had no way to distinguish "first install" from "upgrade". Seeds were protected only by Kubernetes Job semantics (ttlSecondsAfterFinished), which is implicit and fragile.

  2. The owning app had to know about every app it connects to. The Datahub bundle contained Airflow-specific logic. If a workspace deployed Airflow without Datahub, or Datahub without Airflow, the bundle still carried dead configuration for the missing stack.

The compose was a wiring diagram, not a product surface

The compose file (compose/aws/delta-spark.yaml) was a list of stack slugs and dependency edges. Adding a new integration meant editing the compose YAML — an operator concern, not a user concern. There was no way to represent "this workspace uses data lineage" without knowing which two stacks data lineage connects.


The Decision

Users configure capabilities, not stacks. The platform is responsible for knowing which stacks a capability requires and how to wire them.

A user enabling "Data Lineage" does not need to know that this requires Datahub, Airflow, and a PAT seed job. They see a toggle. The platform handles the rest.

This is the same model VS Code uses for extensions: an extension (capability) has its own settings panel. The user enables it. The editor handles loading, dependency resolution, and lifecycle. The user never edits a wiring file.


Three-Layer Model

Every stack in the platform belongs to one of three layers:

Layer 1 — Infrastructure
Shared cluster-level resources with no user-facing features.
Examples: karpenter, spark-operator, observability, kafka

Layer 2 — Applications
Self-contained services that expose a user-facing feature.
No knowledge of other Layer 2 apps.
Examples: datahub, airflow, jupyterhub, trino, superset

Layer 3 — Capabilities
Cross-stack wiring that delivers a user-visible outcome.
Depends on two or more Layer 2 apps being deployed.
Contains no Helm releases — only Jobs, SparkApplications, API calls.
Examples: data-lineage, pii-scanning, query-federation

Layer 2 apps are fully independent. A workspace can deploy Airflow without Datahub and vice versa. When both are present and the operator enables "Data Lineage", the Layer 3 capability stack deploys and wires them.


Capability Naming

Layer 3 bundles are named by the user-facing outcome, not by the stacks they connect.

DoDon't
aws/data-lineageaws/datahub-airflow-integration
aws/pii-scanningaws/datahub-spark-integration
aws/query-federationaws/trino-hive-integration
aws/ranger-policiesaws/ranger-post-deploy

This matters because the same capability may connect different stacks in different compose configurations. data-lineage might wire Datahub to Airflow in one workspace and Datahub to a different orchestrator in another. The name should survive that variation.


Two Orthogonal Dimensions

A platform configuration is described by two independent axes. Confusing them is the root cause of poorly named compose files and bloated bundles.

Dimension 1 — Architecture Variant (the compute model)

Represented by the compose kind. Defines which Layer 1 and Layer 2 stacks are present. Different variants use fundamentally different compute engines and cannot be derived from each other by toggling optional members.

Compose kindCompute modelKey stacks
aws/spark-platformDistributed Spark, Delta LakeKarpenter (large pools), Spark operator, Hive Metastore, Trino, S3
aws/serverless-platformServerless, embedded analyticsKarpenter (minimal), DuckDB/MotherDuck, S3, JupyterHub

You cannot reach a serverless platform by disabling optional members in the Spark platform — the underlying infrastructure is different. These are separate compose kinds.

Dimension 2 — Capabilities (what features are enabled)

Represented by optional members within a compose kind. Layer 3 bundles that wire Layer 2 apps together. Any compose kind that shares the same Layer 2 apps can offer the same capabilities.

CapabilityWhat it doesRequired Layer 2 apps
Data LineageAirflow pipeline lineage in Datahubdatahub + airflow
PII ScanningScheduled PII detection across datasetsdatahub + spark-operator
Catalog IngestionHive and Trino source registration in Datahubdatahub + hive-metastore + trino
Ranger PoliciesDefault authorisation policy bootstrapranger

Why these are orthogonal

The same capability (data-lineage) can exist in any compose kind that includes both Datahub and Airflow. The architecture variant determines which stacks are available; the capability layer determines which cross-stack wiring is active. Neither axis implies the other.

Capabilities (optional members)
────────────────────────────────────────►
none data-lineage pii-scan full

Architecture spark-platform [ variant A ] [ A + L ] [ A + P ] [ A + all ]
Variant serverless [ variant B ] [ B + L ] n/a [ B + all ]
(compose kind)

pii-scanning is not available in serverless-platform because that variant has no Spark operator — the dependency is simply absent, and the optional member is omitted from that compose kind entirely.


Compose as a Capability Declaration

The compose file is a declaration of which capabilities are enabled for an architecture variant, not a wiring diagram.

# Before: wiring diagram
members:
- name: datahub
stackTemplateSlug: aws/datahub
- name: airflow
stackTemplateSlug: aws/airflow
dependsOn: [storages]
# (no explicit data lineage — it was buried in datahub's post-deploy)

# After: capability declaration
members:
- name: datahub
stackTemplateSlug: aws/datahub
- name: airflow
stackTemplateSlug: aws/airflow
dependsOn: [storages]
- name: pipeline-lineage
stackTemplateSlug: aws/pipeline-lineage
optional: true
enabled: true
label: "Pipeline Lineage"
description: "Tracks Airflow DAG runs as lineage events in Datahub."
group: "Data Lineage"
dependsOn: [datahub, airflow]

- name: job-lineage
stackTemplateSlug: aws/job-lineage
optional: true
enabled: true
label: "Job Lineage"
description: "Captures Spark job read/write lineage via OpenLineage."
group: "Data Lineage"
dependsOn: [datahub, spark-operator]

- name: query-lineage
stackTemplateSlug: aws/query-lineage
optional: true
enabled: false
label: "Query Lineage"
description: "Captures SQL column-level lineage from Trino via OpenLineage."
group: "Data Lineage"
dependsOn: [datahub, trino]

A compose kind that does not include datahub simply omits data-lineage. Airflow is unmodified. Datahub is unmodified. The capability is absent because its dependencies are absent — no conditional logic required inside either app bundle.


Naming Conventions

Compose kinds — name by compute model, not technology stack

The technology inside the compose is an implementation detail. The name should describe the compute model the user is choosing.

DoDon't
aws/spark-platformaws/delta-spark
aws/serverless-platformaws/duckdb-serverless

Layer 3 bundles — name by user-facing outcome

DoDon't
aws/data-lineageaws/datahub-airflow-integration
aws/pii-scanningaws/datahub-spark-integration
aws/query-federationaws/trino-hive-integration
aws/ranger-policiesaws/ranger-post-deploy

Summary

BeforeAfter
Integration logic inside app bundlesIntegration logic in dedicated Layer 3 bundles
Compose = stack wiring diagramCompose = architecture variant + enabled capabilities
One dimension: which stacksTwo dimensions: compute model × feature set
Operator configures connectionsUser picks a platform, enables features
Bundle named by stacks it connectsBundle named by outcome it delivers
Seeds rerun on every app redeploySeeds isolated to their own lifecycle