Skip to main content

Hive Metastore

The Hive Metastore (HMS) is the shared table catalog for the tenant cluster. It is the authoritative store of schema definitions — databases, tables, and partitions — and is used by both Trino and Spark as their catalog backend.

Role in the Stack

HMS decouples schema metadata from query engines. Trino reads table and column definitions from HMS to plan and execute queries; Spark uses HMS for the same reason when running jobs against Delta tables. Data itself is stored in S3 — HMS only stores the metadata that describes where and how that data is organized.

┌─────────────────────────────┐
│ Hive Metastore │
│ (schema / partition store) │
└──────────┬──────────────────┘
│ Thrift (port 9083)
┌──────────┴──────────┐
│ │
Trino Spark jobs
(query engine) (ETL / ML)
│ │
└──────────┬──────────┘
│ S3A
Warehouse bucket
(actual data files)

Components

ComponentDescription
Hive MetastoreCustom image (hive-metastore:3.0.0) running the Thrift server on port 9083.
PostgreSQLBacking database for schema and partition metadata (provisioned via KubeBlocks).

Storage

HMS uses two storage systems:

StoragePurpose
PostgreSQLStores the metadata: database names, table names, column definitions, partition specs, SerDe configs
S3 warehouse bucketStores the actual data files (Parquet, Delta). HMS knows the S3 path for each table; query engines read/write directly

The HMS pod has an IRSA role granting it s3:GetObject, s3:PutObject, s3:ListBucket, and s3:DeleteObject on the warehouse bucket. The warehouse root is s3a://<bucket>/warehouse.

Sizing

HMS supports multiple replica counts for higher availability. The recommended sizing is:

SizeReplicas
Small1
Medium2
Large4

Consumers

ConsumerHow it uses HMS
TrinoReads table and column definitions via Thrift to plan queries against the delta catalog
SparkReads and writes table definitions when running ETL or ML jobs with spark.sql.catalogImplementation=hive
DatahubThe sys-hive-metastore ingestion pipeline reads HMS's PostgreSQL database directly to populate Datahub with table and schema metadata

Go Deeper

  • Datahub — the sys-hive-metastore ingestion pipeline that reads HMS metadata into Datahub
  • Catalog — the UI view of HMS-derived metadata