Skip to main content

Airflow

Apache Airflow is the workflow orchestration engine. Data engineers author DAGs that define pipelines — ingestion, transformation, and ML training jobs — and Airflow schedules and executes them on Kubernetes.

Executor

Airflow uses the KubernetesExecutor. Each task runs in an isolated Kubernetes pod that is created on demand and destroyed on completion. There are no persistent worker nodes — compute scales to zero between runs.

DAG Delivery

DAGs are loaded via gitSync — a sidecar that continuously polls a configured Git repository and syncs DAG files into the Airflow scheduler. This means DAG deployments are git-push operations with no manual file copying.

ConfigDescription
dag_git_repoGit repo URL
dag_git_branchBranch to track
dag_git_sub_pathSubpath within the repo (optional)
dag_git_ssh_keyBase64-encoded SSH private key for private repos

Authentication

Airflow uses Keycloak OIDC. Realm roles are mapped to Airflow roles:

Keycloak Realm RoleAirflow Role
platform_adminAdmin
data_engineerOp (via workflow_editor)
workflow_viewerViewer
workflow_adminAdmin

Storage

StorePurpose
PostgreSQL (KubeBlocks)Airflow metadata — DAG state, task history, connections, variables
S3 log bucketRemote task logs — stored in S3 so logs persist after a pod is destroyed

Airflow has an IRSA role granting it read/write access to its S3 log bucket and read access to AWS Secrets Manager (for Datahub and other service connections seeded at deploy time).

Datahub Integration

Airflow connects to Datahub via a datahub_rest_default connection seeded into AWS Secrets Manager at Datahub deploy time. When the Datahub Airflow plugin is installed, DAG runs automatically emit table lineage events to Datahub GMS — making lineage visible in the Cogrion Catalog without any manual step.

Go Deeper