Observability
The observability stack provides metrics collection, log aggregation, and dashboards for the tenant cluster. It is a prerequisite for most application stacks — they depend on it for scrape targets and log shipping.
Components
| Component | Role |
|---|---|
| Prometheus | Scrapes metrics from all cluster workloads — application pods, Kafka, Spark, Trino, and more |
| Grafana | Dashboard UI for metrics. Pre-built dashboards for platform components are included. |
| Loki | Log aggregation backend — stores logs from all pods for querying via Grafana |
| Alloy | OpenTelemetry-based log collector — ships pod logs to Loki |
| Promtail | Secondary log shipper (legacy support) |
Deployed via kube-prometheus-stack (Prometheus + Grafana + Alertmanager) and separate Loki + Alloy Helm releases.
Storage
Three S3 buckets from the storages stack back long-term retention. The log archive bucket stores exported metrics and logs beyond the in-cluster Loki retention window.
Sizing
The observability stack supports four capacity sizes:
| Size | Intended use |
|---|---|
dev | Budget / single-node — minimal resource footprint |
small | Staging environments |
medium | Standard production |
large | High-volume production |