Capability-Centric Platform — Bundle Structure

This document covers what changes in platform-stacks to support the three-layer model. For the motivation see Overview. For control plane changes see Control Plane.

What Moves Out of Existing Bundles

`aws/datahub`

The post-deploy phase is removed entirely from the datahub bundle. It contained:

Resource	Moves to
`datahub_seed_scripts` ConfigMap	`aws/data-lineage` or `aws/data-catalog-ingestion`
PAT seed Job	`aws/data-lineage` (PAT is consumed by Airflow)
Ingestion pipeline seed Jobs (Hive, Trino sources)	`aws/data-catalog-ingestion`
`pii-scanning` phase (ScheduledSparkApplication)	`aws/pii-scanning`

What stays in aws/datahub: namespaces, secrets, oauth-client, helm releases, IRSA module, ranger prerequisites chart.

`aws/ranger`

The post-deploy phase is removed. It contained:

Resource	Moves to
Ranger policy seed ConfigMap + Job	`aws/ranger-policies`

What stays in aws/ranger: namespace, KubeBlocks DB, admin secret, helm release.

Layer 3 Bundle Catalogue

Bundle slug	User-facing capability	Depends on (Layer 2)
`aws/pipeline-lineage`	Airflow DAG lineage in Datahub	datahub, airflow
`aws/job-lineage`	Spark job read/write lineage via OpenLineage	datahub, spark-operator
`aws/query-lineage`	Trino SQL column-level lineage via OpenLineage	datahub, trino
`aws/pii-scanning`	Scheduled PII detection across datasets	datahub, spark-operator
`aws/data-catalog-ingestion`	Hive and Trino source registration in Datahub	datahub, hive-metastore, trino
`aws/ranger-fgac-spark`	Query-level authorization via Ranger for Trino + Hive	ranger, trino

Additional capabilities can be added as the platform grows. Each follows the same structural rules below.

Layer 3 Bundle Structure

A Layer 3 bundle has no Helm releases. It contains only transient workloads: Kubernetes Jobs, SparkApplications, ConfigMaps holding scripts, and any RBAC needed to run them.

aws/data-lineage/
└── bundle.yaml
    ├── inputs
    │   ├── cluster            (from compose, auto-wired)
    │   ├── datahub_gms_url    (from deps.datahub.outputs.gms_url)
    │   └── airflow_url        (from deps.airflow.outputs.url)
    ├── phases
    │   └── integration
    │       ├── seed_scripts    (k8s-manifest: ConfigMap)
    │       └── lineage_setup   (k8s-manifest: Job)
    └── outputs
        └── datahub_pat_secret  (name of the K8s secret holding the PAT)

Rules for Layer 3 bundles

No helm-release resources. All resources are k8s-manifest.
Inputs come from dependency outputs, not from the compose schema directly. Use deps.<member>.outputs.*.
Jobs must be idempotent. Either use ttlSecondsAfterFinished for cleanup, or write the job script to check-before-act (query the API before seeding).
No shared namespaces. Deploy into the namespace of the primary app being configured (e.g. datahub namespace for data-lineage jobs).
Outputs are optional but should surface anything a downstream capability might need (e.g. the PAT secret name so another Layer 3 bundle can reference it).

Updated Compose Structure

With Layer 3 bundles extracted, the compose file explicitly declares which capabilities are active for an architecture variant. Here is the updated spark-platform compose structure (abbreviated):

compose:
  members:
    # Layer 1 — Infrastructure
    - name: storages
      stackTemplateSlug: aws/storages
    - name: karpenter
      stackTemplateSlug: aws/karpenter
    - name: spark-operator
      stackTemplateSlug: aws/spark-operator
      dependsOn: [observability]

    # Layer 2 — Applications
    - name: datahub
      stackTemplateSlug: aws/datahub
      dependsOn: [...]
    - name: airflow
      stackTemplateSlug: aws/airflow
      dependsOn: [storages]

    # Layer 3 — Capabilities
    - name: pipeline-lineage
      stackTemplateSlug: aws/pipeline-lineage
      optional: true
      enabled: true
      label: "Pipeline Lineage"
      description: "Tracks Airflow DAG runs as lineage events in Datahub."
      group: "Data Lineage"
      dependsOn: [datahub, airflow]

    - name: job-lineage
      stackTemplateSlug: aws/job-lineage
      optional: true
      enabled: true
      label: "Job Lineage"
      description: "Captures Spark job read/write lineage via OpenLineage."
      group: "Data Lineage"
      dependsOn: [datahub, spark-operator]

    - name: query-lineage
      stackTemplateSlug: aws/query-lineage
      optional: true
      enabled: false
      label: "Query Lineage"
      description: "Captures SQL column-level lineage from Trino via OpenLineage."
      group: "Data Lineage"
      dependsOn: [datahub, trino]

    - name: pii-scanning
      stackTemplateSlug: aws/pii-scanning
      optional: true
      enabled: false
      label: "PII Scanning"
      description: "Scheduled Spark jobs that detect PII fields across registered datasets."
      dependsOn: [datahub, spark-operator]

A compose variant that does not include datahub simply omits the Layer 3 entries that depend on it. No conditional logic is needed inside aws/airflow or aws/datahub.

Dependency Graph (After Refactor)

storages ──────────────────────────────────────────┐
karpenter → compute-profiles → spark-operator      │
                                    │               │
                              jupyterhub         airflow
                                                    │
hive-metastore → ranger                          datahub
              └→ trino → superset                   │
                              │                     │
                         [data-catalog-ingestion] ◄─┘ (Layer 3)
                         [pipeline-lineage] ◄──────airflow + datahub
                         [job-lineage] ◄───────────spark-operator + datahub
                         [query-lineage] ◄──────────trino + datahub
                         [pii-scanning] ◄──────────spark-operator + datahub
                         [ranger-policies] ◄───────ranger

Layer 3 nodes are shown in brackets. They sit at the leaves of the graph — they consume outputs from Layer 2 but nothing depends on them.

What Moves Out of Existing Bundles​

aws/datahub​

aws/ranger​

Layer 3 Bundle Catalogue​

Layer 3 Bundle Structure​

Rules for Layer 3 bundles​

Updated Compose Structure​

Dependency Graph (After Refactor)​