Capability-Centric Platform — Bundle Structure
This document covers what changes in platform-stacks to support the three-layer model. For the motivation see Overview. For control plane changes see Control Plane.
What Moves Out of Existing Bundles
aws/datahub
The post-deploy phase is removed entirely from the datahub bundle. It contained:
| Resource | Moves to |
|---|---|
datahub_seed_scripts ConfigMap | aws/data-lineage or aws/data-catalog-ingestion |
| PAT seed Job | aws/data-lineage (PAT is consumed by Airflow) |
| Ingestion pipeline seed Jobs (Hive, Trino sources) | aws/data-catalog-ingestion |
pii-scanning phase (ScheduledSparkApplication) | aws/pii-scanning |
What stays in aws/datahub: namespaces, secrets, oauth-client, helm releases, IRSA module, ranger prerequisites chart.
aws/ranger
The post-deploy phase is removed. It contained:
| Resource | Moves to |
|---|---|
| Ranger policy seed ConfigMap + Job | aws/ranger-policies |
What stays in aws/ranger: namespace, KubeBlocks DB, admin secret, helm release.
Layer 3 Bundle Catalogue
| Bundle slug | User-facing capability | Depends on (Layer 2) |
|---|---|---|
aws/pipeline-lineage | Airflow DAG lineage in Datahub | datahub, airflow |
aws/job-lineage | Spark job read/write lineage via OpenLineage | datahub, spark-operator |
aws/query-lineage | Trino SQL column-level lineage via OpenLineage | datahub, trino |
aws/pii-scanning | Scheduled PII detection across datasets | datahub, spark-operator |
aws/data-catalog-ingestion | Hive and Trino source registration in Datahub | datahub, hive-metastore, trino |
aws/ranger-fgac-spark | Query-level authorization via Ranger for Trino + Hive | ranger, trino |
Additional capabilities can be added as the platform grows. Each follows the same structural rules below.
Layer 3 Bundle Structure
A Layer 3 bundle has no Helm releases. It contains only transient workloads: Kubernetes Jobs, SparkApplications, ConfigMaps holding scripts, and any RBAC needed to run them.
aws/data-lineage/
└── bundle.yaml
├── inputs
│ ├── cluster (from compose, auto-wired)
│ ├── datahub_gms_url (from deps.datahub.outputs.gms_url)
│ └── airflow_url (from deps.airflow.outputs.url)
├── phases
│ └── integration
│ ├── seed_scripts (k8s-manifest: ConfigMap)
│ └── lineage_setup (k8s-manifest: Job)
└── outputs
└── datahub_pat_secret (name of the K8s secret holding the PAT)
Rules for Layer 3 bundles
- No
helm-releaseresources. All resources arek8s-manifest. - Inputs come from dependency outputs, not from the compose schema directly. Use
deps.<member>.outputs.*. - Jobs must be idempotent. Either use
ttlSecondsAfterFinishedfor cleanup, or write the job script to check-before-act (query the API before seeding). - No shared namespaces. Deploy into the namespace of the primary app being configured (e.g.
datahubnamespace for data-lineage jobs). - Outputs are optional but should surface anything a downstream capability might need (e.g. the PAT secret name so another Layer 3 bundle can reference it).
Updated Compose Structure
With Layer 3 bundles extracted, the compose file explicitly declares which capabilities are active for an architecture variant. Here is the updated spark-platform compose structure (abbreviated):
compose:
members:
# Layer 1 — Infrastructure
- name: storages
stackTemplateSlug: aws/storages
- name: karpenter
stackTemplateSlug: aws/karpenter
- name: spark-operator
stackTemplateSlug: aws/spark-operator
dependsOn: [observability]
# Layer 2 — Applications
- name: datahub
stackTemplateSlug: aws/datahub
dependsOn: [...]
- name: airflow
stackTemplateSlug: aws/airflow
dependsOn: [storages]
# Layer 3 — Capabilities
- name: pipeline-lineage
stackTemplateSlug: aws/pipeline-lineage
optional: true
enabled: true
label: "Pipeline Lineage"
description: "Tracks Airflow DAG runs as lineage events in Datahub."
group: "Data Lineage"
dependsOn: [datahub, airflow]
- name: job-lineage
stackTemplateSlug: aws/job-lineage
optional: true
enabled: true
label: "Job Lineage"
description: "Captures Spark job read/write lineage via OpenLineage."
group: "Data Lineage"
dependsOn: [datahub, spark-operator]
- name: query-lineage
stackTemplateSlug: aws/query-lineage
optional: true
enabled: false
label: "Query Lineage"
description: "Captures SQL column-level lineage from Trino via OpenLineage."
group: "Data Lineage"
dependsOn: [datahub, trino]
- name: pii-scanning
stackTemplateSlug: aws/pii-scanning
optional: true
enabled: false
label: "PII Scanning"
description: "Scheduled Spark jobs that detect PII fields across registered datasets."
dependsOn: [datahub, spark-operator]
A compose variant that does not include datahub simply omits the Layer 3 entries that depend on it. No conditional logic is needed inside aws/airflow or aws/datahub.
Dependency Graph (After Refactor)
storages ──────────────────────────────────────────┐
karpenter → compute-profiles → spark-operator │
│ │
jupyterhub airflow
│
hive-metastore → ranger datahub
└→ trino → superset │
│ │
[data-catalog-ingestion] ◄─┘ (Layer 3)
[pipeline-lineage] ◄──────airflow + datahub
[job-lineage] ◄───────────spark-operator + datahub
[query-lineage] ◄──────────trino + datahub
[pii-scanning] ◄──────────spark-operator + datahub
[ranger-policies] ◄───────ranger
Layer 3 nodes are shown in brackets. They sit at the leaves of the graph — they consume outputs from Layer 2 but nothing depends on them.