Skip to main content

Example — Spark Intelligence Workspace

A concrete end-to-end view of a Spark Intelligence SKU workspace: what the operator defines, what the user sees, and what gets deployed.


Compose Kind (operator-defined)

kind: aws/spark-platform
version: 1.0.0

compose:
members:
# Always included — omitted here for brevity, see aws/spark-platform

# Optional Layer 2 — default on
- name: trino
stackTemplateSlug: aws/trino
optional: true
enabled: true
label: "Query Engine"
dependsOn: [hive-metastore]

- name: ranger
stackTemplateSlug: aws/ranger
optional: true
enabled: false
label: "Authorization Engine"
group: "Access Control"
dependsOn: [hive-metastore]

- name: datahub
stackTemplateSlug: aws/datahub
optional: true
enabled: false
label: "Data Catalog"
group: "Data Intelligence"
dependsOn: [...]

# Optional Layer 3 — capabilities
- name: fine-grained-access-control
stackTemplateSlug: aws/ranger-fgac-spark
optional: true
enabled: false
label: "Fine-Grained Access Control"
description: "Query-level authorization via Ranger for Trino and Hive."
group: "Access Control"
dependsOn: [ranger, trino]

- name: pipeline-lineage
stackTemplateSlug: aws/pipeline-lineage
optional: true
enabled: false
label: "Pipeline Lineage"
description: "Tracks Airflow DAG runs as lineage events in Datahub."
group: "Data Lineage"
dependsOn: [datahub, airflow]

- name: job-lineage
stackTemplateSlug: aws/job-lineage
optional: true
enabled: false
label: "Job Lineage"
description: "Captures Spark job read/write lineage via OpenLineage."
group: "Data Lineage"
dependsOn: [datahub, spark-operator]

- name: query-lineage
stackTemplateSlug: aws/query-lineage
optional: true
enabled: false
label: "Query Lineage"
description: "Captures SQL column-level lineage from Trino via OpenLineage."
group: "Data Lineage"
dependsOn: [datahub, trino]

- name: data-catalog-ingestion
stackTemplateSlug: aws/data-catalog-ingestion
optional: true
enabled: false
label: "Catalog Ingestion"
description: "Registers Hive and Trino as sources in Datahub."
group: "Data Intelligence"
dependsOn: [datahub, hive-metastore, trino]

- name: pii-scanning
stackTemplateSlug: aws/pii-scanning
optional: true
enabled: false
label: "PII Scanning"
description: "Scheduled Spark jobs that detect PII across registered datasets."
group: "Data Intelligence"
dependsOn: [datahub, spark-operator]

User Settings Panel

The UI renders optional members grouped by group. The user provisioning a Spark Intelligence workspace enables:

Access Control
[✓] Authorization Engine
[✓] Fine-Grained Access Control
Query-level authorization via Ranger for Trino and Hive.

Data Catalog
[✓] Data Catalog

Data Lineage
[✓] Pipeline Lineage
Tracks Airflow DAG runs as lineage events in Datahub.
[✓] Job Lineage
Captures Spark job read/write lineage via OpenLineage.
[ ] Query Lineage
Captures SQL column-level lineage from Trino via OpenLineage.

Data Intelligence
[✓] Catalog Ingestion
Registers Hive and Trino as sources in Datahub.
[✓] PII Scanning
Scheduled Spark jobs that detect PII across registered datasets.

This produces enabledFeatures: [ranger, fine-grained-access-control, datahub, pipeline-lineage, job-lineage, data-catalog-ingestion, pii-scanning] in the compose spec.


Resulting Deployment (topological order)

storages, karpenter, kafka
→ compute-profile, observability
→ spark-operator, hive-metastore
→ spark-team, jupyterhub, airflow ← always-included
→ trino ← optional, on
→ ranger ← optional, enabled by user
→ datahub ← optional, enabled by user
→ superset, bff
→ fine-grained-access-control ← Layer 3, depends on ranger + trino
→ pipeline-lineage ← Layer 3, depends on datahub + airflow
→ job-lineage ← Layer 3, depends on datahub + spark-operator
→ data-catalog-ingestion ← Layer 3, depends on datahub + hive + trino
→ pii-scanning ← Layer 3, depends on datahub + spark-operator
# query-lineage omitted — not enabled in this workspace

Layer 3 stacks sit at the leaves. Nothing depends on them. Disabling a capability removes only its leaf node — no upstream stack is affected.