Open Source Components | Platform Documentation

📄️Datahub

Datahub is the metadata catalog deployed into each tenant cluster. It stores the schema, lineage, ownership, and column-level tags for all data assets in the workspace. The Cogrion Catalog UI reads from and writes to Datahub via the BFF API.

📄️Ranger

Apache Ranger is the policy enforcement engine deployed into each tenant cluster. Every query that reaches Trino is checked against Ranger before execution — access is denied by default if no matching policy exists.

📄️Hive Metastore

The Hive Metastore (HMS) is the shared table catalog for the tenant cluster. It is the authoritative store of schema definitions — databases, tables, and partitions — and is used by both Trino and Spark as their catalog backend.

📄️Trino

Trino is the SQL query engine for the tenant cluster. It executes queries against Delta Lake tables stored in S3, using Hive Metastore for table definitions, and enforces access policies through the Ranger plugin at query time.

📄️Superset

Apache Superset is the SQL and dashboard layer for tenant users. It provides SQL Lab for interactive queries, a chart builder, and shareable dashboards. All queries are routed through Trino and enforced by Ranger.

📄️Airflow

Apache Airflow is the workflow orchestration engine. Data engineers author DAGs that define pipelines — ingestion, transformation, and ML training jobs — and Airflow schedules and executes them on Kubernetes.

📄️JupyterHub

JupyterHub provides multi-user notebook servers for data scientists and ML engineers. Each user gets their own isolated notebook pod with configurable compute resources, S3 access, and Spark connectivity.

📄️MLflow

MLflow is the ML experiment tracking and model registry. Data scientists use it to log training runs, compare metrics, register models, and promote them through staging to production.