Hive Metastore

The Hive Metastore (HMS) is the shared table catalog for the tenant cluster. It is the authoritative store of schema definitions — databases, tables, and partitions — and is used by both Trino and Spark as their catalog backend.

Role in the Stack

HMS decouples schema metadata from query engines. Trino reads table and column definitions from HMS to plan and execute queries; Spark uses HMS for the same reason when running jobs against Delta tables. Data itself is stored in S3 — HMS only stores the metadata that describes where and how that data is organized.

          ┌─────────────────────────────┐
          │      Hive Metastore         │
          │  (schema / partition store) │
          └──────────┬──────────────────┘
                     │ Thrift (port 9083)
          ┌──────────┴──────────┐
          │                     │
        Trino               Spark jobs
     (query engine)       (ETL / ML)
          │                     │
          └──────────┬──────────┘
                     │ S3A
              Warehouse bucket
              (actual data files)

Components

Component	Description
Hive Metastore	Custom image (`hive-metastore:3.0.0`) running the Thrift server on port 9083.
PostgreSQL	Backing database for schema and partition metadata (provisioned via KubeBlocks).

Storage

HMS uses two storage systems:

Storage	Purpose
PostgreSQL	Stores the metadata: database names, table names, column definitions, partition specs, SerDe configs
S3 warehouse bucket	Stores the actual data files (Parquet, Delta). HMS knows the S3 path for each table; query engines read/write directly

The HMS pod has an IRSA role granting it s3:GetObject, s3:PutObject, s3:ListBucket, and s3:DeleteObject on the warehouse bucket. The warehouse root is s3a://<bucket>/warehouse.

Sizing

HMS supports multiple replica counts for higher availability. The recommended sizing is:

Size	Replicas
Small	1
Medium	2
Large	4

Consumers

Consumer	How it uses HMS
Trino	Reads table and column definitions via Thrift to plan queries against the `delta` catalog
Spark	Reads and writes table definitions when running ETL or ML jobs with `spark.sql.catalogImplementation=hive`
Datahub	The `sys-hive-metastore` ingestion pipeline reads HMS's PostgreSQL database directly to populate Datahub with table and schema metadata

Go Deeper

Datahub — the sys-hive-metastore ingestion pipeline that reads HMS metadata into Datahub
Catalog — the UI view of HMS-derived metadata

Role in the Stack​

Components​

Storage​

Sizing​

Consumers​

Go Deeper​