Skip to main content

Datahub

Datahub is the metadata catalog deployed into each tenant cluster. It stores the schema, lineage, ownership, and column-level tags for all data assets in the workspace. The Cogrion Catalog UI reads from and writes to Datahub via the BFF API.

Components

ComponentDescription
datahub-gmsGraph Metadata Service — the core backend. Serves the GraphQL and REST APIs, stores all metadata entities.
datahub-frontendThe Datahub web UI. Accessible directly but typically consumed via the Cogrion UI through the BFF.
acryl-datahub-actionsEvent-driven actions framework. Runs background reactions to metadata changes (notifications, propagation, etc.).
datahub-mce-consumerProcesses inbound Metadata Change Events (MCE) from the Kafka queue into GMS.
datahub-mae-consumerProcesses Metadata Audit Events (MAE) — downstream effects of committed metadata changes.
datahub-ingestion-cronRuns scheduled ingestion jobs that pull schema and table metadata from connected sources.

Backing infrastructure (deployed by the same bundle):

ComponentDescription
MySQLDatahub's primary relational store for metadata entities (provisioned via KubeBlocks).
ElasticSearchPowers Datahub's search index and graph queries.
Schema RegistryKafka Schema Registry for Avro-encoded Datahub event streams (shared platform Kafka, separate Schema Registry pod).

Authentication

Datahub is configured with OIDC using the workspace's Keycloak realm. The datahub_user client role is created as part of the Keycloak OAuth client provisioned by the bundle. Keycloak realm roles are mapped to Datahub access as follows:

Keycloak Realm RoleDatahub Access
platform_admindatahub_user
data_engineerdatahub_user
ml_engineerdatahub_user

JIT provisioning is enabled — users are created in Datahub on first login without pre-provisioning.

Metadata Ingestion

Two system ingestion pipelines are seeded automatically at deploy time. They run daily at midnight and can also be triggered manually from the Datahub UI.

PipelineSourceWhat it ingests
sys-hive-metastoreHive Metastore (PostgreSQL)Database, schema, and table definitions from the Hive Metastore
sys-trinoTrinoTables, views, and column definitions from the delta catalog

These pipelines populate the Dataset entities in Datahub that the Cogrion Catalog UI displays.

Ranger Tag Sync

The datahub-ranger-tag-sync service bridges Datahub and Ranger. When a column tag is created or modified in Datahub, the change is reflected in Ranger's tag store automatically — no manual step required.

How It Works

ConfigurationValue
Kafka topicMetadataChangeLog_Versioned_v1
Consumer groupdatahub-tag-sync
Target Ranger servicetrino
Rate limit120 events/minute

Once a tag is in Ranger's tag store, it can be used in tag-based column restriction and masking policies via the Data Access Management UI.

Lineage

Table lineage is populated by the Datahub Airflow plugin installed in the Airflow deployment. When a DAG runs a pipeline that reads from or writes to tables, the plugin emits lineage events to Datahub GMS. Lineage appears in the Cogrion Catalog UI without any manual step.

The sys-trino and sys-hive-metastore ingestion pipelines populate Dataset entities (schemas, tables, columns) but do not emit lineage — lineage is a runtime signal from Airflow, not a structural signal from ingestion.

PII Scanning

A scheduled Spark application (pii-scan-prod) runs on a cron schedule and scans data tables for PII column values. Detected PII entities are written back to Datahub as structured properties (PII Scanning) on the relevant Dataset entities.

Supported PII entity types include: CREDIT_CARD, EMAIL_ADDRESS, PHONE_NUMBER, PERSON, LOCATION, IP_ADDRESS, and others.

The PII scanner uses the Hive Metastore to discover tables and writes results to the system.pii_scan_results Delta table in addition to Datahub.

Go Deeper