Skip to main content

Capability-Centric Platform — Bundle Structure

This document covers what changes in platform-stacks to support the three-layer model. For the motivation see Overview. For control plane changes see Control Plane.


What Moves Out of Existing Bundles

aws/datahub

The post-deploy phase is removed entirely from the datahub bundle. It contained:

ResourceMoves to
datahub_seed_scripts ConfigMapaws/data-lineage or aws/data-catalog-ingestion
PAT seed Jobaws/data-lineage (PAT is consumed by Airflow)
Ingestion pipeline seed Jobs (Hive, Trino sources)aws/data-catalog-ingestion
pii-scanning phase (ScheduledSparkApplication)aws/pii-scanning

What stays in aws/datahub: namespaces, secrets, oauth-client, helm releases, IRSA module, ranger prerequisites chart.

aws/ranger

The post-deploy phase is removed. It contained:

ResourceMoves to
Ranger policy seed ConfigMap + Jobaws/ranger-policies

What stays in aws/ranger: namespace, KubeBlocks DB, admin secret, helm release.


Layer 3 Bundle Catalogue

Bundle slugUser-facing capabilityDepends on (Layer 2)
aws/pipeline-lineageAirflow DAG lineage in Datahubdatahub, airflow
aws/job-lineageSpark job read/write lineage via OpenLineagedatahub, spark-operator
aws/query-lineageTrino SQL column-level lineage via OpenLineagedatahub, trino
aws/pii-scanningScheduled PII detection across datasetsdatahub, spark-operator
aws/data-catalog-ingestionHive and Trino source registration in Datahubdatahub, hive-metastore, trino
aws/ranger-fgac-sparkQuery-level authorization via Ranger for Trino + Hiveranger, trino

Additional capabilities can be added as the platform grows. Each follows the same structural rules below.


Layer 3 Bundle Structure

A Layer 3 bundle has no Helm releases. It contains only transient workloads: Kubernetes Jobs, SparkApplications, ConfigMaps holding scripts, and any RBAC needed to run them.

aws/data-lineage/
└── bundle.yaml
├── inputs
│ ├── cluster (from compose, auto-wired)
│ ├── datahub_gms_url (from deps.datahub.outputs.gms_url)
│ └── airflow_url (from deps.airflow.outputs.url)
├── phases
│ └── integration
│ ├── seed_scripts (k8s-manifest: ConfigMap)
│ └── lineage_setup (k8s-manifest: Job)
└── outputs
└── datahub_pat_secret (name of the K8s secret holding the PAT)

Rules for Layer 3 bundles

  1. No helm-release resources. All resources are k8s-manifest.
  2. Inputs come from dependency outputs, not from the compose schema directly. Use deps.<member>.outputs.*.
  3. Jobs must be idempotent. Either use ttlSecondsAfterFinished for cleanup, or write the job script to check-before-act (query the API before seeding).
  4. No shared namespaces. Deploy into the namespace of the primary app being configured (e.g. datahub namespace for data-lineage jobs).
  5. Outputs are optional but should surface anything a downstream capability might need (e.g. the PAT secret name so another Layer 3 bundle can reference it).

Updated Compose Structure

With Layer 3 bundles extracted, the compose file explicitly declares which capabilities are active for an architecture variant. Here is the updated spark-platform compose structure (abbreviated):

compose:
members:
# Layer 1 — Infrastructure
- name: storages
stackTemplateSlug: aws/storages
- name: karpenter
stackTemplateSlug: aws/karpenter
- name: spark-operator
stackTemplateSlug: aws/spark-operator
dependsOn: [observability]

# Layer 2 — Applications
- name: datahub
stackTemplateSlug: aws/datahub
dependsOn: [...]
- name: airflow
stackTemplateSlug: aws/airflow
dependsOn: [storages]

# Layer 3 — Capabilities
- name: pipeline-lineage
stackTemplateSlug: aws/pipeline-lineage
optional: true
enabled: true
label: "Pipeline Lineage"
description: "Tracks Airflow DAG runs as lineage events in Datahub."
group: "Data Lineage"
dependsOn: [datahub, airflow]

- name: job-lineage
stackTemplateSlug: aws/job-lineage
optional: true
enabled: true
label: "Job Lineage"
description: "Captures Spark job read/write lineage via OpenLineage."
group: "Data Lineage"
dependsOn: [datahub, spark-operator]

- name: query-lineage
stackTemplateSlug: aws/query-lineage
optional: true
enabled: false
label: "Query Lineage"
description: "Captures SQL column-level lineage from Trino via OpenLineage."
group: "Data Lineage"
dependsOn: [datahub, trino]

- name: pii-scanning
stackTemplateSlug: aws/pii-scanning
optional: true
enabled: false
label: "PII Scanning"
description: "Scheduled Spark jobs that detect PII fields across registered datasets."
dependsOn: [datahub, spark-operator]

A compose variant that does not include datahub simply omits the Layer 3 entries that depend on it. No conditional logic is needed inside aws/airflow or aws/datahub.


Dependency Graph (After Refactor)

storages ──────────────────────────────────────────┐
karpenter → compute-profiles → spark-operator │
│ │
jupyterhub airflow

hive-metastore → ranger datahub
└→ trino → superset │
│ │
[data-catalog-ingestion] ◄─┘ (Layer 3)
[pipeline-lineage] ◄──────airflow + datahub
[job-lineage] ◄───────────spark-operator + datahub
[query-lineage] ◄──────────trino + datahub
[pii-scanning] ◄──────────spark-operator + datahub
[ranger-policies] ◄───────ranger

Layer 3 nodes are shown in brackets. They sit at the leaves of the graph — they consume outputs from Layer 2 but nothing depends on them.