Observability

The observability stack provides metrics collection, log aggregation, and dashboards for the tenant cluster. It is a prerequisite for most application stacks — they depend on it for scrape targets and log shipping.

Components

Component	Role
Prometheus	Scrapes metrics from all cluster workloads — application pods, Kafka, Spark, Trino, and more
Grafana	Dashboard UI for metrics. Pre-built dashboards for platform components are included.
Loki	Log aggregation backend — stores logs from all pods for querying via Grafana
Alloy	OpenTelemetry-based log collector — ships pod logs to Loki
Promtail	Secondary log shipper (legacy support)

Deployed via kube-prometheus-stack (Prometheus + Grafana + Alertmanager) and separate Loki + Alloy Helm releases.

Storage

Three S3 buckets from the storages stack back long-term retention. The log archive bucket stores exported metrics and logs beyond the in-cluster Loki retention window.

Sizing

The observability stack supports four capacity sizes:

Size	Intended use
`dev`	Budget / single-node — minimal resource footprint
`small`	Staging environments
`medium`	Standard production
`large`	High-volume production

Go Deeper

Storages — the S3 buckets used for log and metrics archiving
Kafka — Kafka JMX metrics are scraped by Prometheus

Components​

Storage​

Sizing​

Go Deeper​

Components

Storage

Sizing

Go Deeper