Regional Deployment — Architecture Overview
Cogrion runs as a multi-region SaaS control plane using a BYOC (Bring Your Own Cloud) model. Each region is a fully self-contained deployment. Client workspaces run on their own AWS infrastructure, connected to the nearest Cogrion region via DNS delegation and cross-account IAM.
This document covers domain conventions, environment structure, component layout, DNS and TLS ownership, database strategy, and workspace provisioning. For how the infrastructure is provisioned see Infrastructure (Terraform). For how workloads are deployed and promoted see GitOps (ArgoCD + Helm).
Environments
| Environment | Purpose | Cluster |
|---|---|---|
dev | Shared development cluster, used by all engineers | One cluster, sgp-1 only |
prod | Production, multi-region | sgp-1 now, frankfurt-1 when needed |
Local development runs against a local stack (Docker Compose) or points at the shared dev cluster for integration testing. No per-engineer cloud clusters.
Region Naming Convention
Regions use geographic city names, not cloud provider codes.
| Region ID | Location | Status |
|---|---|---|
sgp-1 | Singapore 1 | Live |
fra-1 | Frankfurt 1 | Add when EU tenant requires it |
Pattern: {city}-{index} — leaves room for a second Singapore cluster (sgp-2) without changing the convention.
Domain Structure
URLs in the platform have two distinct tiers: global (stable, single entry points for clients) and regional (where compute actually runs). Only global URLs appear in client-facing documentation.
Tiers
.
├── Global — served from Cloudflare, never change as regions are added:
│ ├── app.cogrion.com ← Dashboard UI (Cloudflare Pages)
│ ├── auth.cogrion.com ← Auth entry point (CF Worker → regional Keycloak)
│ └── cplane.cogrion.com/lookup ← Tenant → region lookup (CF Worker + KV)
│
└── Regional — one set per environment per region:
├── {service}.{region}.cogrion.com ← Cogrion-owned services (prod)
├── {service}.{env}.{region}.cogrion.com ← Cogrion-owned services (non-prod)
├── {service}.w-{id}.{region}.cogrion.com ← Per-workspace services (prod)
└── {service}.w-{id}.{env}.{region}.cogrion.com ← Per-workspace services (non-prod)
Global URLs
These are the only URLs that appear in client-facing docs. They never change regardless of how many regions are added.
| Service | Dev | Prod | Served by |
|---|---|---|---|
| Dashboard UI | app.dev.cogrion.com | app.cogrion.com | Cloudflare Pages |
| Auth (Keycloak entry) | auth.dev.cogrion.com | auth.cogrion.com | CF Worker → regional |
| Tenant lookup | cplane.cogrion.com/lookup | cplane.cogrion.com/lookup | CF Worker + KV |
auth.cogrion.com is a Cloudflare Worker that proxies to the correct regional Keycloak (auth.sgp-1.cogrion.com) based on a tenant→region lookup. The UI's OIDC config always points here — it never needs to know which region it's talking to.
After login the UI reads the cogrion_region claim from the JWT and constructs the regional API base URL itself (https://cplane.{region}.cogrion.com). No extra network call needed.
Regional URLs (Cogrion cluster)
Internal URLs used by the UI at runtime and by platform operators. Not in client-facing docs.
| Service | Dev | Prod (sgp-1) |
|---|---|---|
| Control plane API | cplane.dev.sgp-1.cogrion.com | cplane.sgp-1.cogrion.com |
| Keycloak (direct) | auth.dev.sgp-1.cogrion.com | auth.sgp-1.cogrion.com |
| Temporal UI | temporal.dev.sgp-1.cogrion.com | temporal.sgp-1.cogrion.com |
| Grafana | grafana.dev.sgp-1.cogrion.com | grafana.sgp-1.cogrion.com |
Per-workspace (client) URLs
Client workspaces get a delegated subdomain zone. Cogrion owns the NS record; the client's Route53 manages everything beneath it.
| Service | Dev | Prod (sgp-1) |
|---|---|---|
| BFF | bff.w-xxxx.dev.sgp-1.cogrion.com | bff.w-xxxx.sgp-1.cogrion.com |
| App (if hosted) | app.w-xxxx.dev.sgp-1.cogrion.com | app.w-xxxx.sgp-1.cogrion.com |
| Delegated zone root | *.w-xxxx.dev.sgp-1.cogrion.com | *.w-xxxx.sgp-1.cogrion.com |
Component Map
Cogrion cluster (per region, per environment)
Every component below runs inside a single Kubernetes cluster. Dev and prod are separate clusters; they do not share any resources.
Cluster: sgp-1 (prod)
├── Control plane app
├── Keycloak
├── Temporal (server + workers)
├── OpenBao (PKI backend — cluster agent mTLS certificates)
├── Observability stack
│ ├── Prometheus
│ ├── Grafana
│ └── Loki (or CloudWatch exporter)
├── Ingress (AWS ALB via aws-load-balancer-controller)
└── cert-manager (TLS for *.sgp-1.cogrion.com)
Client cluster (per workspace, BYOC)
Client runs their own AWS account with their own EKS cluster. Cogrion provisions the DNS delegation and ACM cert via cross-account IAM at onboarding time. Everything else in the cluster is owned and operated by the client.
Client cluster: acme (their AWS account)
├── BFF (deployed via Cogrion Helm chart)
├── Client application
├── Client database (RDS or otherwise)
├── ALB (TLS termination, ACM cert)
└── Route53 (delegated zone: *.w-acme.sgp-1.cogrion.com)
Shared global layer (not a cluster)
| Component | Where | Purpose |
|---|---|---|
| Cloudflare DNS | cogrion.com zone | Parent DNS, NS delegation records |
| Cloudflare Pages | app.cogrion.com | Dashboard UI — static bundle, global CDN |
| Cloudflare Worker (auth) | auth.cogrion.com | Proxies login traffic to the correct regional Keycloak |
| Cloudflare Worker + KV | cplane.cogrion.com/lookup | Tenant → region routing lookup |
| S3 buckets | Per region, per env | Artifacts, agent outputs, exports |
DNS Ownership Boundaries
cogrion.com ← Cloudflare — Cogrion owns
├── app.cogrion.com ← Cloudflare Pages — Dashboard UI
├── auth.cogrion.com ← Cloudflare Worker — proxies to regional Keycloak
├── cplane.cogrion.com ← Cloudflare Worker — tenant lookup (/lookup)
└── sgp-1.cogrion.com ← Route53 zone — Cogrion owns
├── cplane.sgp-1.cogrion.com ← Route53 → ALB (Cogrion cluster)
├── auth.sgp-1.cogrion.com ← Route53 → ALB (Cogrion cluster)
├── temporal.sgp-1.cogrion.com ← Route53 → ALB (Cogrion cluster)
├── grafana.sgp-1.cogrion.com ← Route53 → ALB (Cogrion cluster)
└── *.w-xxxx.sgp-1.cogrion.com ← Route53 — NS delegated to client Route53
├── bff.w-xxxx.sgp-1.cogrion.com ← Client Route53 → Client ALB → Client EKS
└── app.w-xxxx.sgp-1.cogrion.com ← Client Route53 → Client ALB → Client EKS
TLS for app.cogrion.com and auth.cogrion.com — managed by Cloudflare automatically.
TLS for *.sgp-1.cogrion.com — cert-manager in Cogrion cluster, Let's Encrypt or ACM.
TLS for *.w-xxxx.sgp-1.cogrion.com — ACM in client account, DNS validated via client Route53. Auto-renewed by AWS.
Database Strategy
Per region, fully independent
Each regional cluster has its own database. No cross-region replication. Tenant data never leaves the region it was provisioned in.
sgp-1 DB (RDS)
├── tenants ← global provisioning metadata (all regions)
├── principals ← global user identity (all regions)
├── workspaces ← sgp-1 tenants only
└── workspace_data ← sgp-1 tenant data only
frankfurt-1 DB (RDS, when live)
├── workspaces ← frankfurt-1 tenants only
└── workspace_data ← frankfurt-1 tenant data only
(no tenants — calls sgp-1 control plane for provisioning metadata)
(no principals — calls sgp-1 control plane for principal lookup)
Why tenants table stays in sgp-1
The tenants table is provisioning metadata only — workspace ID, assigned region, delegated zone, status. It is not tenant business data. Keeping it in the primary region (sgp-1) avoids distributed state until there is a real operational reason to split it.
When frankfurt-1 is live, its control plane calls back to sgp-1 only during provisioning operations (low frequency, not in the data path).
Why principals table stays in sgp-1
The principals table tracks user identity — email, external Keycloak ID, system roles. Every authenticated request resolves a principal by email against the local database. If principals were per-region, the same user would get a different uid in each region they have access to, breaking cross-region audit trails and making IAM policy reasoning inconsistent.
Keeping principals in sgp-1 alongside tenants gives a single source of identity truth. Regional control planes call back to sgp-1 to resolve a principal by email on first encounter, then cache the result locally for the lifetime of the request. This call is on the authenticated request path, so it must be fast — sgp-1 must be treated as a hard dependency for auth in all regions.
Accepted limitation: if sgp-1 is unavailable, authenticated requests to all other regional control planes will fail. This is the same limitation that already applies to provisioning operations.
S3 buckets
One set of buckets per region per environment. Buckets are never shared across regions.
cogrion-dev-sgp-1-artifacts
cogrion-dev-sgp-1-exports
cogrion-prod-sgp-1-artifacts
cogrion-prod-sgp-1-exports
Naming convention: cogrion-{env}-{region}-{purpose}
Authentication — Multi-Region Principal Resolution
Every authenticated request to the control plane runs through authenticationMiddleware, which must produce a req.principal before the route handler runs. req.principal contains two distinct pieces of information that come from different places:
- Identity —
uid,email,kind,externalId,system_roles— lives in theprincipalstable in sgp-1. - Account memberships — which accounts the principal belongs to and with what roles — lives in the
account_memberstable in the local regional database, because accounts are per-region.
Same-region flow (sgp-1)
No cross-region call. Identity and account memberships are both in the local database.
Cross-region flow (frankfurt-1)
frankfurt-1's control plane has no principals table. On every authenticated request it calls back to sgp-1 to resolve identity, then queries its own database for local account memberships.
What needs to be built
Two changes are required when frankfurt-1 goes live:
1. New internal endpoint on the control plane
GET /internal/principals?email={email}
- Returns
{ uid, externalId, kind, systemRoles }— identity only, no account memberships. - Auth:
Authorization: Bearer {token}checked againstINTER_REGION_SERVICE_TOKEN(shared secret injected via Helm values, not a user JWT). - Not exposed via the public ingress — only reachable within the Cogrion cluster network or over a private VPC peering link.
2. Auth middleware split for non-primary regions
The middleware detects which mode it is in via PRIMARY_CPLANE_API_URL:
| Env var | Value on sgp-1 | Value on frankfurt-1 |
|---|---|---|
PRIMARY_CPLANE_API_URL | (unset) | https://cplane.sgp-1.cogrion.com |
INTER_REGION_SERVICE_TOKEN | <shared secret> | <same shared secret> |
When PRIMARY_CPLANE_API_URL is set, the middleware:
- Calls
GET {PRIMARY_CPLANE_API_URL}/internal/principals?email={email}to get identity + system roles. - Queries the local DB for account memberships by
principalUid. - Merges both into
req.principal— same shape as today, same downstream code, no route handler changes needed.
JWT verification is already cross-region
The auth middleware fetches JWKS dynamically from {iss}/protocol/openid-connect/certs — the iss claim in the JWT always points to the Keycloak that issued it. A frankfurt-1 JWT has iss: auth.frankfurt-1.cogrion.com; the middleware fetches frankfurt-1's JWKS and verifies there. No config change needed for JWT verification when a new region is added.
Cluster Agent Authentication (mTLS)
Cluster agents — processes running inside client Kubernetes clusters — authenticate to the control plane using mutual TLS, not JWTs. OpenBao's PKI backend issues the agent certificates. The control plane ingress uses a dedicated ALB (cplane-alb) with mTLS passthrough mode to support this.
This is a separate authentication path from the Keycloak JWT flow above and applies only to machine-to-machine calls from cluster agents.
TODO: link to detailed mTLS / OpenBao PKI doc
Replicated vs Shared
| Component | Mode | Owner |
|---|---|---|
| Dashboard UI | Shared — Cloudflare Pages | Cogrion |
Auth Worker (auth.cogrion.com) | Shared — Cloudflare Worker | Cogrion |
| Tenant → region lookup | Shared — Cloudflare KV | Cogrion |
DNS zone (cogrion.com) | Shared — Cloudflare | Cogrion |
| Control plane app | Replicated per region | Cogrion |
| Keycloak | Replicated per region | Cogrion |
| Temporal | Replicated per region | Cogrion |
| Database (RDS) | Replicated per region, independent | Cogrion |
| S3 buckets | Replicated per region | Cogrion |
| Observability stack | Replicated per region | Cogrion |
| ALB (Cogrion cluster) | Replicated per region | Cogrion |
| Client ALB | Per workspace | Client |
| Client Route53 zone | Per workspace (delegated) | Client (DNS parent: Cogrion) |
| Client ACM cert | Per workspace | Client (domain: Cogrion) |
| Client EKS / k8s | Per workspace | Client |
| BFF | Per workspace, on client cluster | Client (Helm chart: Cogrion) |
Cluster Configuration
Each cluster deployment is driven by a region+environment specific Helm values file. The application code reads environment variables only — it has no knowledge of which region it is in. The key variables injected into every pod are:
| Variable | Example (prod-sgp-1) | Purpose |
|---|---|---|
BASE_DOMAIN | cplane.sgp-1.cogrion.com | Control plane API / Keycloak redirect URIs |
WORKSPACE_DOMAIN | sgp-1.cogrion.com | Suffix for per-workspace URLs and delegated zones |
AUTH_DOMAIN | auth.sgp-1.cogrion.com | Keycloak endpoint |
TEMPORAL_DOMAIN | temporal.sgp-1.cogrion.com | Temporal UI |
REGION | sgp-1 | Region identifier |
ENVIRONMENT | prod | Environment tag |
For the full values file structure, Helm chart layout, and how variables are wired through ArgoCD see GitOps (ArgoCD + Helm).
Workspace Provisioning Flow
When a new workspace is created, the control plane runs the following sequence via cross-account IAM:
1. Assume client IAM role (sts:AssumeRole)
2. Create Route53 hosted zone in client account
zone: w-xxxx.{env.}sgp-1.cogrion.com
3. Write NS record in Cogrion Route53
*.w-xxxx.{env.}sgp-1.cogrion.com → client nameservers
4. Request ACM certificate in client account
domain: *.w-xxxx.{env.}sgp-1.cogrion.com
5. Write DNS validation CNAME in client Route53
6. Wait for ACM validation (async, ~2 min)
7. Write workspace record to tenants DB
{ workspace_id, region, delegated_zone, status: active }
8. Write entry to Cloudflare KV
w-xxxx → { region: sgp-1, bff_url: bff.w-xxxx.sgp-1.cogrion.com }
9. Create Keycloak client in regional realm
redirect_uris: https://*.w-xxxx.sgp-1.cogrion.com/*
BFF startup sequence (client cluster)
1. BFF pod starts
2. Pull config from control plane
GET https://cplane.sgp-1.cogrion.com/api/workspaces/w-xxxx/config
→ Keycloak endpoints, feature flags, Temporal address, etc.
3. Register BFF with control plane
POST https://cplane.sgp-1.cogrion.com/api/workspaces/w-xxxx/register
{ bff_url: "https://bff.w-xxxx.sgp-1.cogrion.com", public_ip: "..." }
4. Control plane updates workspace record, marks BFF as healthy
Adding a New Region
When a client requires EU data residency, spin up frankfurt-1. The high-level sequence is:
- Terraform — copy
envs/prod-sgp-1/→envs/prod-frankfurt-1/, updatetfvarsandbackend.tf, run apply. This provisions VPC, EKS, RDS, Route53 zone, S3 buckets, and IAM roles. See Infrastructure (Terraform) → Adding a New Region. - GitOps — copy
argocd/apps/prod-sgp-1/andvalues/prod-sgp-1/, update all domain and bucket values, bootstrap ArgoCD into the new cluster. See GitOps → Bootstrap: New Cluster. - DNS — add Cloudflare DNS records for
*.frankfurt-1.cogrion.com(delegating to the new Route53 zone). - Routing — assign new EU tenants to
frankfurt-1at signup (Cloudflare KV lookup).
No application code changes. No Helm chart changes.
Local Development
Engineers run the full stack locally via Docker Compose. The local stack does not use regional domains.
Local:
http://localhost:3000 control plane
http://localhost:8080 Keycloak
http://localhost:8088 Temporal UI
http://localhost:3001 BFF (mock workspace)
For integration testing against the shared dev cluster, point BASE_DOMAIN at dev.sgp-1.cogrion.com in your local .env. Do not create real workspace records against dev without coordinating with the team — dev shares a single Keycloak realm and database.
Alternatives: Pure AWS Global Layer
The current design uses Cloudflare for three global components: DNS, the auth proxy worker, and the tenant→region KV lookup. This section documents AWS-native equivalents for each — relevant if Cloudflare is not available, not preferred, or needs to be exited incrementally. Each component can be replaced independently.
KV lookup (cplane.cogrion.com/lookup)
This is the highest-priority replacement target: it sits on the auth path and is written to on every workspace provision.
Option 1: API Gateway + Lambda + DynamoDB (recommended migration bridge)
Route53 alias → API Gateway → Lambda (lookup handler) → DynamoDB (tenant-region-map)
- DynamoDB table:
tenant_id (PK)→{ region, bff_url }. Same shape as CF KV. - Write path: control plane writes to DynamoDB at provision time instead of CF KV (step 8 in the provisioning flow).
- Read path: Lambda returns the same JSON shape the CF Worker returns today — no client changes needed.
- Latency: ~2–5 ms DynamoDB read vs sub-millisecond CF KV edge read. Acceptable for a login-path lookup that is cached by the caller.
- DNS:
cplane.cogrion.comRoute53 alias pointing to the API Gateway regional endpoint, or a CloudFront distribution in front of it for caching.
Option 2: CloudFront + Lambda@Edge + DynamoDB Global Tables
Closer to the CF Worker + KV model in behavior (edge compute + edge data). Use this if lookup latency is a hard constraint.
CloudFront (cplane.cogrion.com) → Lambda@Edge (origin-request) → DynamoDB Global Table
- DynamoDB Global Tables replicate the
tenant-region-mapto each AWS region with a CloudFront PoP nearby. - Adds operational complexity (Lambda@Edge cold starts, Global Table replication lag on writes).
- Not needed unless you have measured lookup latency as a problem with Option 1.
Auth proxy (auth.cogrion.com)
The CF Worker only fires on OIDC login redirects — it is not on the per-request hot path. Latency here is less critical than for the KV lookup.
Option 1: API Gateway + Lambda (simplest)
Route53 alias → API Gateway → Lambda (proxy handler → auth.{region}.cogrion.com)
- Lambda reads tenant→region from the same DynamoDB table, constructs the target Keycloak URL, and proxies the request with
http-proxyornode-fetch. - Stateless, no edge deployment required.
- Cold starts add ~100–200 ms to the first request in a login flow; provisioned concurrency removes this if needed.
Option 2: CloudFront + Lambda@Edge
CloudFront (auth.cogrion.com) → Lambda@Edge (viewer-request) → regional Keycloak origin
- Same lookup store as above. Lambda@Edge rewrites the origin host to
auth.{region}.cogrion.combefore CloudFront forwards. - Use this if you want consistent edge behavior for both
auth.andcplane.subdomains, or if you have strict latency requirements for login redirects.
DNS and static UI
| Component | Cloudflare | AWS equivalent |
|---|---|---|
cogrion.com zone | Cloudflare DNS | Route53 hosted zone (transfer registrar NS) |
app.cogrion.com | Cloudflare Pages | S3 + CloudFront distribution |
| NS delegation records | Cloudflare DNS records | Route53 records in cogrion.com zone |
DNS transfer is the last step — it must happen after all worker/page replacements are live, so the cutover is a single NS change at the registrar.
Incremental migration path
Each step is independent. Do them in any order; all four are required to fully exit Cloudflare.
| Phase | Change | Cloudflare still needed? |
|---|---|---|
| 1 | Deploy API GW + Lambda + DynamoDB for cplane.cogrion.com/lookup. Update provisioning code to write DynamoDB instead of CF KV. | Yes (DNS only) |
| 2 | Deploy API GW + Lambda for auth.cogrion.com. | Yes (DNS only) |
| 3 | Deploy S3 + CloudFront for app.cogrion.com. | Yes (DNS only) |
| 4 | Transfer cogrion.com zone to Route53. Update registrar NS. Decommission Cloudflare account. | No |
Phases 1–3 each leave Cloudflare managing DNS while replacing the compute. Phase 4 is a single registrar NS change; once propagated, Cloudflare is fully removed.
Trade-off summary
| Concern | Cloudflare (current) | Pure AWS |
|---|---|---|
| KV lookup latency | Sub-ms, 300 PoPs | ~2–5 ms DynamoDB; sub-ms with Global Tables + Lambda@Edge |
| Auth proxy | CF Worker (edge) | API GW + Lambda or CloudFront + Lambda@Edge |
| Static UI | CF Pages (edge CDN, zero config) | S3 + CloudFront (more config, same capability) |
| DNS | Cloudflare DNS | Route53 (equivalent TTL propagation) |
| Vendor surface | Cloudflare for global layer | AWS-only — aligns with existing BYOC AWS posture |
| Ops complexity | Low — CF Workers/Pages are simple to deploy | Higher — Lambda@Edge, DynamoDB Global Tables, CloudFront configs |
| Cost | CF Workers free tier is generous at current scale | API GW + Lambda + DynamoDB costs scale with requests; low at current scale |
| Migration risk | None — current design | Low per phase; Phase 4 (DNS transfer) is the only risky step |
Recommendation: If Cloudflare is acceptable, the current design is operationally simpler. Phase 1 (KV lookup → DynamoDB) is the best entry point if an incremental migration is required — it is the most isolated component, has no user-visible surface, and unblocks full AWS alignment. Defer Phase 4 (DNS transfer) until Phases 1–3 are stable in production.
Summary
- One Helm chart, multiple values files — one per environment per region
- Region is always in the domain:
{service}.{env?}.{region}.cogrion.com - Client workspaces get a delegated zone:
*.w-{id}.{env?}.{region}.cogrion.com - TLS for Cogrion services: cert-manager in cluster
- TLS for client workspaces: ACM in client account, auto-renewed
- Database is per region, never replicated — tenant data stays in its region
- S3 buckets are per region per environment, never shared
- The only shared global state: Cloudflare DNS zone + KV lookup (tenant → region)
- Adding a region = new values file + new cluster + new Route53 zone, nothing else