Datahub
Datahub is the metadata catalog deployed into each tenant cluster. It stores the schema, lineage, ownership, and column-level tags for all data assets in the workspace. The Cogrion Catalog UI reads from and writes to Datahub via the BFF API.
Components
| Component | Description |
|---|---|
| datahub-gms | Graph Metadata Service — the core backend. Serves the GraphQL and REST APIs, stores all metadata entities. |
| datahub-frontend | The Datahub web UI. Accessible directly but typically consumed via the Cogrion UI through the BFF. |
| acryl-datahub-actions | Event-driven actions framework. Runs background reactions to metadata changes (notifications, propagation, etc.). |
| datahub-mce-consumer | Processes inbound Metadata Change Events (MCE) from the Kafka queue into GMS. |
| datahub-mae-consumer | Processes Metadata Audit Events (MAE) — downstream effects of committed metadata changes. |
| datahub-ingestion-cron | Runs scheduled ingestion jobs that pull schema and table metadata from connected sources. |
Backing infrastructure (deployed by the same bundle):
| Component | Description |
|---|---|
| MySQL | Datahub's primary relational store for metadata entities (provisioned via KubeBlocks). |
| ElasticSearch | Powers Datahub's search index and graph queries. |
| Schema Registry | Kafka Schema Registry for Avro-encoded Datahub event streams (shared platform Kafka, separate Schema Registry pod). |
Authentication
Datahub is configured with OIDC using the workspace's Keycloak realm. The datahub_user client role is created as part of the Keycloak OAuth client provisioned by the bundle. Keycloak realm roles are mapped to Datahub access as follows:
| Keycloak Realm Role | Datahub Access |
|---|---|
platform_admin | datahub_user |
data_engineer | datahub_user |
ml_engineer | datahub_user |
JIT provisioning is enabled — users are created in Datahub on first login without pre-provisioning.
Metadata Ingestion
Two system ingestion pipelines are seeded automatically at deploy time. They run daily at midnight and can also be triggered manually from the Datahub UI.
| Pipeline | Source | What it ingests |
|---|---|---|
sys-hive-metastore | Hive Metastore (PostgreSQL) | Database, schema, and table definitions from the Hive Metastore |
sys-trino | Trino | Tables, views, and column definitions from the delta catalog |
These pipelines populate the Dataset entities in Datahub that the Cogrion Catalog UI displays.
Ranger Tag Sync
The datahub-ranger-tag-sync service bridges Datahub and Ranger. When a column tag is created or modified in Datahub, the change is reflected in Ranger's tag store automatically — no manual step required.
How It Works
| Configuration | Value |
|---|---|
| Kafka topic | MetadataChangeLog_Versioned_v1 |
| Consumer group | datahub-tag-sync |
| Target Ranger service | trino |
| Rate limit | 120 events/minute |
Once a tag is in Ranger's tag store, it can be used in tag-based column restriction and masking policies via the Data Access Management UI.
Lineage
Table lineage is populated by the Datahub Airflow plugin installed in the Airflow deployment. When a DAG runs a pipeline that reads from or writes to tables, the plugin emits lineage events to Datahub GMS. Lineage appears in the Cogrion Catalog UI without any manual step.
The sys-trino and sys-hive-metastore ingestion pipelines populate Dataset entities (schemas, tables, columns) but do not emit lineage — lineage is a runtime signal from Airflow, not a structural signal from ingestion.
PII Scanning
A scheduled Spark application (pii-scan-prod) runs on a cron schedule and scans data tables for PII column values. Detected PII entities are written back to Datahub as structured properties (PII Scanning) on the relevant Dataset entities.
Supported PII entity types include: CREDIT_CARD, EMAIL_ADDRESS, PHONE_NUMBER, PERSON, LOCATION, IP_ADDRESS, and others.
The PII scanner uses the Hive Metastore to discover tables and writes results to the system.pii_scan_results Delta table in addition to Datahub.
Go Deeper
- Catalog — the Cogrion UI view of Datahub metadata
- Column Tagging — how tags flow from the Catalog UI to Ranger
- Data Access Management — creating policies using Datahub tags