Datahub

Datahub is the metadata catalog deployed into each tenant cluster. It stores the schema, lineage, ownership, and column-level tags for all data assets in the workspace. The Cogrion Catalog UI reads from and writes to Datahub via the BFF API.

Components

Component	Description
datahub-gms	Graph Metadata Service — the core backend. Serves the GraphQL and REST APIs, stores all metadata entities.
datahub-frontend	The Datahub web UI. Accessible directly but typically consumed via the Cogrion UI through the BFF.
acryl-datahub-actions	Event-driven actions framework. Runs background reactions to metadata changes (notifications, propagation, etc.).
datahub-mce-consumer	Processes inbound Metadata Change Events (MCE) from the Kafka queue into GMS.
datahub-mae-consumer	Processes Metadata Audit Events (MAE) — downstream effects of committed metadata changes.
datahub-ingestion-cron	Runs scheduled ingestion jobs that pull schema and table metadata from connected sources.

Backing infrastructure (deployed by the same bundle):

Component	Description
MySQL	Datahub's primary relational store for metadata entities (provisioned via KubeBlocks).
ElasticSearch	Powers Datahub's search index and graph queries.
Schema Registry	Kafka Schema Registry for Avro-encoded Datahub event streams (shared platform Kafka, separate Schema Registry pod).

Authentication

Datahub is configured with OIDC using the workspace's Keycloak realm. The datahub_user client role is created as part of the Keycloak OAuth client provisioned by the bundle. Keycloak realm roles are mapped to Datahub access as follows:

Keycloak Realm Role	Datahub Access
`platform_admin`	`datahub_user`
`data_engineer`	`datahub_user`
`ml_engineer`	`datahub_user`

JIT provisioning is enabled — users are created in Datahub on first login without pre-provisioning.

Metadata Ingestion

Two system ingestion pipelines are seeded automatically at deploy time. They run daily at midnight and can also be triggered manually from the Datahub UI.

Pipeline	Source	What it ingests
`sys-hive-metastore`	Hive Metastore (PostgreSQL)	Database, schema, and table definitions from the Hive Metastore
`sys-trino`	Trino	Tables, views, and column definitions from the `delta` catalog

These pipelines populate the Dataset entities in Datahub that the Cogrion Catalog UI displays.

Ranger Tag Sync

The datahub-ranger-tag-sync service bridges Datahub and Ranger. When a column tag is created or modified in Datahub, the change is reflected in Ranger's tag store automatically — no manual step required.

How It Works

Configuration	Value
Kafka topic	`MetadataChangeLog_Versioned_v1`
Consumer group	`datahub-tag-sync`
Target Ranger service	`trino`
Rate limit	120 events/minute

Once a tag is in Ranger's tag store, it can be used in tag-based column restriction and masking policies via the Data Access Management UI.

Lineage

Table lineage is populated by the Datahub Airflow plugin installed in the Airflow deployment. When a DAG runs a pipeline that reads from or writes to tables, the plugin emits lineage events to Datahub GMS. Lineage appears in the Cogrion Catalog UI without any manual step.

The sys-trino and sys-hive-metastore ingestion pipelines populate Dataset entities (schemas, tables, columns) but do not emit lineage — lineage is a runtime signal from Airflow, not a structural signal from ingestion.

PII Scanning

A scheduled Spark application (pii-scan-prod) runs on a cron schedule and scans data tables for PII column values. Detected PII entities are written back to Datahub as structured properties (PII Scanning) on the relevant Dataset entities.

Supported PII entity types include: CREDIT_CARD, EMAIL_ADDRESS, PHONE_NUMBER, PERSON, LOCATION, IP_ADDRESS, and others.

The PII scanner uses the Hive Metastore to discover tables and writes results to the system.pii_scan_results Delta table in addition to Datahub.

Go Deeper

Catalog — the Cogrion UI view of Datahub metadata
Column Tagging — how tags flow from the Catalog UI to Ranger
Data Access Management — creating policies using Datahub tags

Components​

Authentication​

Metadata Ingestion​

Ranger Tag Sync​

How It Works​

Lineage​

PII Scanning​

Go Deeper​