Hive Metastore
The Hive Metastore (HMS) is the shared table catalog for the tenant cluster. It is the authoritative store of schema definitions — databases, tables, and partitions — and is used by both Trino and Spark as their catalog backend.
Role in the Stack
HMS decouples schema metadata from query engines. Trino reads table and column definitions from HMS to plan and execute queries; Spark uses HMS for the same reason when running jobs against Delta tables. Data itself is stored in S3 — HMS only stores the metadata that describes where and how that data is organized.
┌─────────────────────────────┐
│ Hive Metastore │
│ (schema / partition store) │
└──────────┬──────────────────┘
│ Thrift (port 9083)
┌──────────┴──────────┐
│ │
Trino Spark jobs
(query engine) (ETL / ML)
│ │
└──────────┬──────────┘
│ S3A
Warehouse bucket
(actual data files)
Components
| Component | Description |
|---|---|
| Hive Metastore | Custom image (hive-metastore:3.0.0) running the Thrift server on port 9083. |
| PostgreSQL | Backing database for schema and partition metadata (provisioned via KubeBlocks). |
Storage
HMS uses two storage systems:
| Storage | Purpose |
|---|---|
| PostgreSQL | Stores the metadata: database names, table names, column definitions, partition specs, SerDe configs |
| S3 warehouse bucket | Stores the actual data files (Parquet, Delta). HMS knows the S3 path for each table; query engines read/write directly |
The HMS pod has an IRSA role granting it s3:GetObject, s3:PutObject, s3:ListBucket, and s3:DeleteObject on the warehouse bucket. The warehouse root is s3a://<bucket>/warehouse.
Sizing
HMS supports multiple replica counts for higher availability. The recommended sizing is:
| Size | Replicas |
|---|---|
| Small | 1 |
| Medium | 2 |
| Large | 4 |
Consumers
| Consumer | How it uses HMS |
|---|---|
| Trino | Reads table and column definitions via Thrift to plan queries against the delta catalog |
| Spark | Reads and writes table definitions when running ETL or ML jobs with spark.sql.catalogImplementation=hive |
| Datahub | The sys-hive-metastore ingestion pipeline reads HMS's PostgreSQL database directly to populate Datahub with table and schema metadata |