Skip to main content

Regional Deployment — Architecture Overview

Cogrion runs as a multi-region SaaS control plane using a BYOC (Bring Your Own Cloud) model. Each region is a fully self-contained deployment. Client workspaces run on their own AWS infrastructure, connected to the nearest Cogrion region via DNS delegation and cross-account IAM.

This document covers domain conventions, environment structure, component layout, DNS and TLS ownership, database strategy, and workspace provisioning. For how the infrastructure is provisioned see Infrastructure (Terraform). For how workloads are deployed and promoted see GitOps (ArgoCD + Helm).


Environments

EnvironmentPurposeCluster
devShared development cluster, used by all engineersOne cluster, sgp-1 only
prodProduction, multi-regionsgp-1 now, frankfurt-1 when needed

Local development runs against a local stack (Docker Compose) or points at the shared dev cluster for integration testing. No per-engineer cloud clusters.


Region Naming Convention

Regions use geographic city names, not cloud provider codes.

Region IDLocationStatus
sgp-1Singapore 1Live
fra-1Frankfurt 1Add when EU tenant requires it

Pattern: {city}-{index} — leaves room for a second Singapore cluster (sgp-2) without changing the convention.


Domain Structure

URLs in the platform have two distinct tiers: global (stable, single entry points for clients) and regional (where compute actually runs). Only global URLs appear in client-facing documentation.

Tiers

.
├── Global — served from Cloudflare, never change as regions are added:
│   ├── app.cogrion.com ← Dashboard UI (Cloudflare Pages)
│   ├── auth.cogrion.com ← Auth entry point (CF Worker → regional Keycloak)
│   └── cplane.cogrion.com/lookup ← Tenant → region lookup (CF Worker + KV)

└── Regional — one set per environment per region:
   ├── {service}.{region}.cogrion.com ← Cogrion-owned services (prod)
   ├── {service}.{env}.{region}.cogrion.com ← Cogrion-owned services (non-prod)
   ├── {service}.w-{id}.{region}.cogrion.com ← Per-workspace services (prod)
└── {service}.w-{id}.{env}.{region}.cogrion.com ← Per-workspace services (non-prod)

Global URLs

These are the only URLs that appear in client-facing docs. They never change regardless of how many regions are added.

ServiceDevProdServed by
Dashboard UIapp.dev.cogrion.comapp.cogrion.comCloudflare Pages
Auth (Keycloak entry)auth.dev.cogrion.comauth.cogrion.comCF Worker → regional
Tenant lookupcplane.cogrion.com/lookupcplane.cogrion.com/lookupCF Worker + KV

auth.cogrion.com is a Cloudflare Worker that proxies to the correct regional Keycloak (auth.sgp-1.cogrion.com) based on a tenant→region lookup. The UI's OIDC config always points here — it never needs to know which region it's talking to.

After login the UI reads the cogrion_region claim from the JWT and constructs the regional API base URL itself (https://cplane.{region}.cogrion.com). No extra network call needed.

Regional URLs (Cogrion cluster)

Internal URLs used by the UI at runtime and by platform operators. Not in client-facing docs.

ServiceDevProd (sgp-1)
Control plane APIcplane.dev.sgp-1.cogrion.comcplane.sgp-1.cogrion.com
Keycloak (direct)auth.dev.sgp-1.cogrion.comauth.sgp-1.cogrion.com
Temporal UItemporal.dev.sgp-1.cogrion.comtemporal.sgp-1.cogrion.com
Grafanagrafana.dev.sgp-1.cogrion.comgrafana.sgp-1.cogrion.com

Per-workspace (client) URLs

Client workspaces get a delegated subdomain zone. Cogrion owns the NS record; the client's Route53 manages everything beneath it.

ServiceDevProd (sgp-1)
BFFbff.w-xxxx.dev.sgp-1.cogrion.combff.w-xxxx.sgp-1.cogrion.com
App (if hosted)app.w-xxxx.dev.sgp-1.cogrion.comapp.w-xxxx.sgp-1.cogrion.com
Delegated zone root*.w-xxxx.dev.sgp-1.cogrion.com*.w-xxxx.sgp-1.cogrion.com

Component Map

Cogrion cluster (per region, per environment)

Every component below runs inside a single Kubernetes cluster. Dev and prod are separate clusters; they do not share any resources.

Cluster: sgp-1 (prod)
├── Control plane app
├── Keycloak
├── Temporal (server + workers)
├── OpenBao (PKI backend — cluster agent mTLS certificates)
├── Observability stack
│ ├── Prometheus
│ ├── Grafana
│ └── Loki (or CloudWatch exporter)
├── Ingress (AWS ALB via aws-load-balancer-controller)
└── cert-manager (TLS for *.sgp-1.cogrion.com)

Client cluster (per workspace, BYOC)

Client runs their own AWS account with their own EKS cluster. Cogrion provisions the DNS delegation and ACM cert via cross-account IAM at onboarding time. Everything else in the cluster is owned and operated by the client.

Client cluster: acme (their AWS account)
├── BFF (deployed via Cogrion Helm chart)
├── Client application
├── Client database (RDS or otherwise)
├── ALB (TLS termination, ACM cert)
└── Route53 (delegated zone: *.w-acme.sgp-1.cogrion.com)

Shared global layer (not a cluster)

ComponentWherePurpose
Cloudflare DNScogrion.com zoneParent DNS, NS delegation records
Cloudflare Pagesapp.cogrion.comDashboard UI — static bundle, global CDN
Cloudflare Worker (auth)auth.cogrion.comProxies login traffic to the correct regional Keycloak
Cloudflare Worker + KVcplane.cogrion.com/lookupTenant → region routing lookup
S3 bucketsPer region, per envArtifacts, agent outputs, exports

DNS Ownership Boundaries

cogrion.com ← Cloudflare — Cogrion owns
├── app.cogrion.com ← Cloudflare Pages — Dashboard UI
├── auth.cogrion.com ← Cloudflare Worker — proxies to regional Keycloak
├── cplane.cogrion.com ← Cloudflare Worker — tenant lookup (/lookup)
└── sgp-1.cogrion.com ← Route53 zone — Cogrion owns
├── cplane.sgp-1.cogrion.com ← Route53 → ALB (Cogrion cluster)
├── auth.sgp-1.cogrion.com ← Route53 → ALB (Cogrion cluster)
├── temporal.sgp-1.cogrion.com ← Route53 → ALB (Cogrion cluster)
├── grafana.sgp-1.cogrion.com ← Route53 → ALB (Cogrion cluster)
└── *.w-xxxx.sgp-1.cogrion.com ← Route53 — NS delegated to client Route53
├── bff.w-xxxx.sgp-1.cogrion.com ← Client Route53 → Client ALB → Client EKS
└── app.w-xxxx.sgp-1.cogrion.com ← Client Route53 → Client ALB → Client EKS

TLS for app.cogrion.com and auth.cogrion.com — managed by Cloudflare automatically. TLS for *.sgp-1.cogrion.com — cert-manager in Cogrion cluster, Let's Encrypt or ACM. TLS for *.w-xxxx.sgp-1.cogrion.com — ACM in client account, DNS validated via client Route53. Auto-renewed by AWS.


Database Strategy

Per region, fully independent

Each regional cluster has its own database. No cross-region replication. Tenant data never leaves the region it was provisioned in.

sgp-1 DB (RDS)
├── tenants ← global provisioning metadata (all regions)
├── principals ← global user identity (all regions)
├── workspaces ← sgp-1 tenants only
└── workspace_data ← sgp-1 tenant data only

frankfurt-1 DB (RDS, when live)
├── workspaces ← frankfurt-1 tenants only
└── workspace_data ← frankfurt-1 tenant data only
(no tenants — calls sgp-1 control plane for provisioning metadata)
(no principals — calls sgp-1 control plane for principal lookup)

Why tenants table stays in sgp-1

The tenants table is provisioning metadata only — workspace ID, assigned region, delegated zone, status. It is not tenant business data. Keeping it in the primary region (sgp-1) avoids distributed state until there is a real operational reason to split it.

When frankfurt-1 is live, its control plane calls back to sgp-1 only during provisioning operations (low frequency, not in the data path).

Why principals table stays in sgp-1

The principals table tracks user identity — email, external Keycloak ID, system roles. Every authenticated request resolves a principal by email against the local database. If principals were per-region, the same user would get a different uid in each region they have access to, breaking cross-region audit trails and making IAM policy reasoning inconsistent.

Keeping principals in sgp-1 alongside tenants gives a single source of identity truth. Regional control planes call back to sgp-1 to resolve a principal by email on first encounter, then cache the result locally for the lifetime of the request. This call is on the authenticated request path, so it must be fast — sgp-1 must be treated as a hard dependency for auth in all regions.

Accepted limitation: if sgp-1 is unavailable, authenticated requests to all other regional control planes will fail. This is the same limitation that already applies to provisioning operations.

S3 buckets

One set of buckets per region per environment. Buckets are never shared across regions.

cogrion-dev-sgp-1-artifacts
cogrion-dev-sgp-1-exports
cogrion-prod-sgp-1-artifacts
cogrion-prod-sgp-1-exports

Naming convention: cogrion-{env}-{region}-{purpose}


Authentication — Multi-Region Principal Resolution

Every authenticated request to the control plane runs through authenticationMiddleware, which must produce a req.principal before the route handler runs. req.principal contains two distinct pieces of information that come from different places:

  • Identityuid, email, kind, externalId, system_roles — lives in the principals table in sgp-1.
  • Account memberships — which accounts the principal belongs to and with what roles — lives in the account_members table in the local regional database, because accounts are per-region.

Same-region flow (sgp-1)

No cross-region call. Identity and account memberships are both in the local database.

Cross-region flow (frankfurt-1)

frankfurt-1's control plane has no principals table. On every authenticated request it calls back to sgp-1 to resolve identity, then queries its own database for local account memberships.

What needs to be built

Two changes are required when frankfurt-1 goes live:

1. New internal endpoint on the control plane

GET /internal/principals?email={email}
  • Returns { uid, externalId, kind, systemRoles } — identity only, no account memberships.
  • Auth: Authorization: Bearer {token} checked against INTER_REGION_SERVICE_TOKEN (shared secret injected via Helm values, not a user JWT).
  • Not exposed via the public ingress — only reachable within the Cogrion cluster network or over a private VPC peering link.

2. Auth middleware split for non-primary regions

The middleware detects which mode it is in via PRIMARY_CPLANE_API_URL:

Env varValue on sgp-1Value on frankfurt-1
PRIMARY_CPLANE_API_URL(unset)https://cplane.sgp-1.cogrion.com
INTER_REGION_SERVICE_TOKEN<shared secret><same shared secret>

When PRIMARY_CPLANE_API_URL is set, the middleware:

  1. Calls GET {PRIMARY_CPLANE_API_URL}/internal/principals?email={email} to get identity + system roles.
  2. Queries the local DB for account memberships by principalUid.
  3. Merges both into req.principal — same shape as today, same downstream code, no route handler changes needed.

JWT verification is already cross-region

The auth middleware fetches JWKS dynamically from {iss}/protocol/openid-connect/certs — the iss claim in the JWT always points to the Keycloak that issued it. A frankfurt-1 JWT has iss: auth.frankfurt-1.cogrion.com; the middleware fetches frankfurt-1's JWKS and verifies there. No config change needed for JWT verification when a new region is added.


Cluster Agent Authentication (mTLS)

Cluster agents — processes running inside client Kubernetes clusters — authenticate to the control plane using mutual TLS, not JWTs. OpenBao's PKI backend issues the agent certificates. The control plane ingress uses a dedicated ALB (cplane-alb) with mTLS passthrough mode to support this.

This is a separate authentication path from the Keycloak JWT flow above and applies only to machine-to-machine calls from cluster agents.

TODO: link to detailed mTLS / OpenBao PKI doc


Replicated vs Shared

ComponentModeOwner
Dashboard UIShared — Cloudflare PagesCogrion
Auth Worker (auth.cogrion.com)Shared — Cloudflare WorkerCogrion
Tenant → region lookupShared — Cloudflare KVCogrion
DNS zone (cogrion.com)Shared — CloudflareCogrion
Control plane appReplicated per regionCogrion
KeycloakReplicated per regionCogrion
TemporalReplicated per regionCogrion
Database (RDS)Replicated per region, independentCogrion
S3 bucketsReplicated per regionCogrion
Observability stackReplicated per regionCogrion
ALB (Cogrion cluster)Replicated per regionCogrion
Client ALBPer workspaceClient
Client Route53 zonePer workspace (delegated)Client (DNS parent: Cogrion)
Client ACM certPer workspaceClient (domain: Cogrion)
Client EKS / k8sPer workspaceClient
BFFPer workspace, on client clusterClient (Helm chart: Cogrion)

Cluster Configuration

Each cluster deployment is driven by a region+environment specific Helm values file. The application code reads environment variables only — it has no knowledge of which region it is in. The key variables injected into every pod are:

VariableExample (prod-sgp-1)Purpose
BASE_DOMAINcplane.sgp-1.cogrion.comControl plane API / Keycloak redirect URIs
WORKSPACE_DOMAINsgp-1.cogrion.comSuffix for per-workspace URLs and delegated zones
AUTH_DOMAINauth.sgp-1.cogrion.comKeycloak endpoint
TEMPORAL_DOMAINtemporal.sgp-1.cogrion.comTemporal UI
REGIONsgp-1Region identifier
ENVIRONMENTprodEnvironment tag

For the full values file structure, Helm chart layout, and how variables are wired through ArgoCD see GitOps (ArgoCD + Helm).


Workspace Provisioning Flow

When a new workspace is created, the control plane runs the following sequence via cross-account IAM:

1. Assume client IAM role (sts:AssumeRole)
2. Create Route53 hosted zone in client account
zone: w-xxxx.{env.}sgp-1.cogrion.com
3. Write NS record in Cogrion Route53
*.w-xxxx.{env.}sgp-1.cogrion.com → client nameservers
4. Request ACM certificate in client account
domain: *.w-xxxx.{env.}sgp-1.cogrion.com
5. Write DNS validation CNAME in client Route53
6. Wait for ACM validation (async, ~2 min)
7. Write workspace record to tenants DB
{ workspace_id, region, delegated_zone, status: active }
8. Write entry to Cloudflare KV
w-xxxx → { region: sgp-1, bff_url: bff.w-xxxx.sgp-1.cogrion.com }
9. Create Keycloak client in regional realm
redirect_uris: https://*.w-xxxx.sgp-1.cogrion.com/*

BFF startup sequence (client cluster)

1. BFF pod starts
2. Pull config from control plane
GET https://cplane.sgp-1.cogrion.com/api/workspaces/w-xxxx/config
→ Keycloak endpoints, feature flags, Temporal address, etc.
3. Register BFF with control plane
POST https://cplane.sgp-1.cogrion.com/api/workspaces/w-xxxx/register
{ bff_url: "https://bff.w-xxxx.sgp-1.cogrion.com", public_ip: "..." }
4. Control plane updates workspace record, marks BFF as healthy

Adding a New Region

When a client requires EU data residency, spin up frankfurt-1. The high-level sequence is:

  1. Terraform — copy envs/prod-sgp-1/envs/prod-frankfurt-1/, update tfvars and backend.tf, run apply. This provisions VPC, EKS, RDS, Route53 zone, S3 buckets, and IAM roles. See Infrastructure (Terraform) → Adding a New Region.
  2. GitOps — copy argocd/apps/prod-sgp-1/ and values/prod-sgp-1/, update all domain and bucket values, bootstrap ArgoCD into the new cluster. See GitOps → Bootstrap: New Cluster.
  3. DNS — add Cloudflare DNS records for *.frankfurt-1.cogrion.com (delegating to the new Route53 zone).
  4. Routing — assign new EU tenants to frankfurt-1 at signup (Cloudflare KV lookup).

No application code changes. No Helm chart changes.


Local Development

Engineers run the full stack locally via Docker Compose. The local stack does not use regional domains.

Local:
http://localhost:3000 control plane
http://localhost:8080 Keycloak
http://localhost:8088 Temporal UI
http://localhost:3001 BFF (mock workspace)

For integration testing against the shared dev cluster, point BASE_DOMAIN at dev.sgp-1.cogrion.com in your local .env. Do not create real workspace records against dev without coordinating with the team — dev shares a single Keycloak realm and database.


Alternatives: Pure AWS Global Layer

The current design uses Cloudflare for three global components: DNS, the auth proxy worker, and the tenant→region KV lookup. This section documents AWS-native equivalents for each — relevant if Cloudflare is not available, not preferred, or needs to be exited incrementally. Each component can be replaced independently.

KV lookup (cplane.cogrion.com/lookup)

This is the highest-priority replacement target: it sits on the auth path and is written to on every workspace provision.

Option 1: API Gateway + Lambda + DynamoDB (recommended migration bridge)

Route53 alias → API Gateway → Lambda (lookup handler) → DynamoDB (tenant-region-map)
  • DynamoDB table: tenant_id (PK){ region, bff_url }. Same shape as CF KV.
  • Write path: control plane writes to DynamoDB at provision time instead of CF KV (step 8 in the provisioning flow).
  • Read path: Lambda returns the same JSON shape the CF Worker returns today — no client changes needed.
  • Latency: ~2–5 ms DynamoDB read vs sub-millisecond CF KV edge read. Acceptable for a login-path lookup that is cached by the caller.
  • DNS: cplane.cogrion.com Route53 alias pointing to the API Gateway regional endpoint, or a CloudFront distribution in front of it for caching.

Option 2: CloudFront + Lambda@Edge + DynamoDB Global Tables

Closer to the CF Worker + KV model in behavior (edge compute + edge data). Use this if lookup latency is a hard constraint.

CloudFront (cplane.cogrion.com) → Lambda@Edge (origin-request) → DynamoDB Global Table
  • DynamoDB Global Tables replicate the tenant-region-map to each AWS region with a CloudFront PoP nearby.
  • Adds operational complexity (Lambda@Edge cold starts, Global Table replication lag on writes).
  • Not needed unless you have measured lookup latency as a problem with Option 1.

Auth proxy (auth.cogrion.com)

The CF Worker only fires on OIDC login redirects — it is not on the per-request hot path. Latency here is less critical than for the KV lookup.

Option 1: API Gateway + Lambda (simplest)

Route53 alias → API Gateway → Lambda (proxy handler → auth.{region}.cogrion.com)
  • Lambda reads tenant→region from the same DynamoDB table, constructs the target Keycloak URL, and proxies the request with http-proxy or node-fetch.
  • Stateless, no edge deployment required.
  • Cold starts add ~100–200 ms to the first request in a login flow; provisioned concurrency removes this if needed.

Option 2: CloudFront + Lambda@Edge

CloudFront (auth.cogrion.com) → Lambda@Edge (viewer-request) → regional Keycloak origin
  • Same lookup store as above. Lambda@Edge rewrites the origin host to auth.{region}.cogrion.com before CloudFront forwards.
  • Use this if you want consistent edge behavior for both auth. and cplane. subdomains, or if you have strict latency requirements for login redirects.

DNS and static UI

ComponentCloudflareAWS equivalent
cogrion.com zoneCloudflare DNSRoute53 hosted zone (transfer registrar NS)
app.cogrion.comCloudflare PagesS3 + CloudFront distribution
NS delegation recordsCloudflare DNS recordsRoute53 records in cogrion.com zone

DNS transfer is the last step — it must happen after all worker/page replacements are live, so the cutover is a single NS change at the registrar.

Incremental migration path

Each step is independent. Do them in any order; all four are required to fully exit Cloudflare.

PhaseChangeCloudflare still needed?
1Deploy API GW + Lambda + DynamoDB for cplane.cogrion.com/lookup. Update provisioning code to write DynamoDB instead of CF KV.Yes (DNS only)
2Deploy API GW + Lambda for auth.cogrion.com.Yes (DNS only)
3Deploy S3 + CloudFront for app.cogrion.com.Yes (DNS only)
4Transfer cogrion.com zone to Route53. Update registrar NS. Decommission Cloudflare account.No

Phases 1–3 each leave Cloudflare managing DNS while replacing the compute. Phase 4 is a single registrar NS change; once propagated, Cloudflare is fully removed.

Trade-off summary

ConcernCloudflare (current)Pure AWS
KV lookup latencySub-ms, 300 PoPs~2–5 ms DynamoDB; sub-ms with Global Tables + Lambda@Edge
Auth proxyCF Worker (edge)API GW + Lambda or CloudFront + Lambda@Edge
Static UICF Pages (edge CDN, zero config)S3 + CloudFront (more config, same capability)
DNSCloudflare DNSRoute53 (equivalent TTL propagation)
Vendor surfaceCloudflare for global layerAWS-only — aligns with existing BYOC AWS posture
Ops complexityLow — CF Workers/Pages are simple to deployHigher — Lambda@Edge, DynamoDB Global Tables, CloudFront configs
CostCF Workers free tier is generous at current scaleAPI GW + Lambda + DynamoDB costs scale with requests; low at current scale
Migration riskNone — current designLow per phase; Phase 4 (DNS transfer) is the only risky step

Recommendation: If Cloudflare is acceptable, the current design is operationally simpler. Phase 1 (KV lookup → DynamoDB) is the best entry point if an incremental migration is required — it is the most isolated component, has no user-visible surface, and unblocks full AWS alignment. Defer Phase 4 (DNS transfer) until Phases 1–3 are stable in production.


Summary

  • One Helm chart, multiple values files — one per environment per region
  • Region is always in the domain: {service}.{env?}.{region}.cogrion.com
  • Client workspaces get a delegated zone: *.w-{id}.{env?}.{region}.cogrion.com
  • TLS for Cogrion services: cert-manager in cluster
  • TLS for client workspaces: ACM in client account, auto-renewed
  • Database is per region, never replicated — tenant data stays in its region
  • S3 buckets are per region per environment, never shared
  • The only shared global state: Cloudflare DNS zone + KV lookup (tenant → region)
  • Adding a region = new values file + new cluster + new Route53 zone, nothing else