Regional Deployment — Architecture Overview

Cogrion runs as a multi-region SaaS control plane using a BYOC (Bring Your Own Cloud) model. Each region is a fully self-contained deployment. Client workspaces run on their own AWS infrastructure, connected to the nearest Cogrion region via DNS delegation and cross-account IAM.

This document covers domain conventions, environment structure, component layout, DNS and TLS ownership, database strategy, and workspace provisioning. For how the infrastructure is provisioned see Infrastructure (Terraform). For how workloads are deployed and promoted see GitOps (ArgoCD + Helm).

Environments

Environment	Purpose	Cluster
`dev`	Shared development cluster, used by all engineers	One cluster, sgp-1 only
`prod`	Production, multi-region	sgp-1 now, frankfurt-1 when needed

Local development runs against a local stack (Docker Compose) or points at the shared dev cluster for integration testing. No per-engineer cloud clusters.

Region Naming Convention

Regions use geographic city names, not cloud provider codes.

Region ID	Location	Status
`sgp-1`	Singapore 1	Live
`fra-1`	Frankfurt 1	Add when EU tenant requires it

Pattern: {city}-{index} — leaves room for a second Singapore cluster (sgp-2) without changing the convention.

Domain Structure

URLs in the platform have two distinct tiers: global (stable, single entry points for clients) and regional (where compute actually runs). Only global URLs appear in client-facing documentation.

Tiers

.
├── Global — served from Cloudflare, never change as regions are added:
│   ├── app.cogrion.com                              ← Dashboard UI (Cloudflare Pages)
│   ├── auth.cogrion.com                             ← Auth entry point (CF Worker → regional Keycloak)
│   └── cplane.cogrion.com/lookup                    ← Tenant → region lookup (CF Worker + KV)
│
└── Regional — one set per environment per region:
    ├── {service}.{region}.cogrion.com               ← Cogrion-owned services (prod)
    ├── {service}.{env}.{region}.cogrion.com         ← Cogrion-owned services (non-prod)
    ├── {service}.w-{id}.{region}.cogrion.com        ← Per-workspace services (prod)
    └── {service}.w-{id}.{env}.{region}.cogrion.com  ← Per-workspace services (non-prod)

Global URLs

These are the only URLs that appear in client-facing docs. They never change regardless of how many regions are added.

Service	Dev	Prod	Served by
Dashboard UI	`app.dev.cogrion.com`	`app.cogrion.com`	Cloudflare Pages
Auth (Keycloak entry)	`auth.dev.cogrion.com`	`auth.cogrion.com`	CF Worker → regional
Tenant lookup	`cplane.cogrion.com/lookup`	`cplane.cogrion.com/lookup`	CF Worker + KV

auth.cogrion.com is a Cloudflare Worker that proxies to the correct regional Keycloak (auth.sgp-1.cogrion.com) based on a tenant→region lookup. The UI's OIDC config always points here — it never needs to know which region it's talking to.

After login the UI reads the cogrion_region claim from the JWT and constructs the regional API base URL itself (https://cplane.{region}.cogrion.com). No extra network call needed.

Regional URLs (Cogrion cluster)

Internal URLs used by the UI at runtime and by platform operators. Not in client-facing docs.

Service	Dev	Prod (sgp-1)
Control plane API	`cplane.dev.sgp-1.cogrion.com`	`cplane.sgp-1.cogrion.com`
Keycloak (direct)	`auth.dev.sgp-1.cogrion.com`	`auth.sgp-1.cogrion.com`
Temporal UI	`temporal.dev.sgp-1.cogrion.com`	`temporal.sgp-1.cogrion.com`
Grafana	`grafana.dev.sgp-1.cogrion.com`	`grafana.sgp-1.cogrion.com`

Per-workspace (client) URLs

Client workspaces get a delegated subdomain zone. Cogrion owns the NS record; the client's Route53 manages everything beneath it.

Service	Dev	Prod (sgp-1)
BFF	`bff.w-xxxx.dev.sgp-1.cogrion.com`	`bff.w-xxxx.sgp-1.cogrion.com`
App (if hosted)	`app.w-xxxx.dev.sgp-1.cogrion.com`	`app.w-xxxx.sgp-1.cogrion.com`
Delegated zone root	`*.w-xxxx.dev.sgp-1.cogrion.com`	`*.w-xxxx.sgp-1.cogrion.com`

Component Map

Cogrion cluster (per region, per environment)

Every component below runs inside a single Kubernetes cluster. Dev and prod are separate clusters; they do not share any resources.

Cluster: sgp-1 (prod)
  ├── Control plane app
  ├── Keycloak
  ├── Temporal (server + workers)
  ├── OpenBao (PKI backend — cluster agent mTLS certificates)
  ├── Observability stack
  │     ├── Prometheus
  │     ├── Grafana
  │     └── Loki (or CloudWatch exporter)
  ├── Ingress (AWS ALB via aws-load-balancer-controller)
  └── cert-manager (TLS for *.sgp-1.cogrion.com)

Client cluster (per workspace, BYOC)

Client runs their own AWS account with their own EKS cluster. Cogrion provisions the DNS delegation and ACM cert via cross-account IAM at onboarding time. Everything else in the cluster is owned and operated by the client.

Client cluster: acme (their AWS account)
  ├── BFF (deployed via Cogrion Helm chart)
  ├── Client application
  ├── Client database (RDS or otherwise)
  ├── ALB (TLS termination, ACM cert)
  └── Route53 (delegated zone: *.w-acme.sgp-1.cogrion.com)

Shared global layer (not a cluster)

Component	Where	Purpose
Cloudflare DNS	`cogrion.com` zone	Parent DNS, NS delegation records
Cloudflare Pages	`app.cogrion.com`	Dashboard UI — static bundle, global CDN
Cloudflare Worker (auth)	`auth.cogrion.com`	Proxies login traffic to the correct regional Keycloak
Cloudflare Worker + KV	`cplane.cogrion.com/lookup`	Tenant → region routing lookup
S3 buckets	Per region, per env	Artifacts, agent outputs, exports

DNS Ownership Boundaries

cogrion.com                                  ← Cloudflare — Cogrion owns
├── app.cogrion.com                          ← Cloudflare Pages — Dashboard UI
├── auth.cogrion.com                         ← Cloudflare Worker — proxies to regional Keycloak
├── cplane.cogrion.com                       ← Cloudflare Worker — tenant lookup (/lookup)
└── sgp-1.cogrion.com                        ← Route53 zone — Cogrion owns
    ├── cplane.sgp-1.cogrion.com             ← Route53 → ALB (Cogrion cluster)
    ├── auth.sgp-1.cogrion.com               ← Route53 → ALB (Cogrion cluster)
    ├── temporal.sgp-1.cogrion.com           ← Route53 → ALB (Cogrion cluster)
    ├── grafana.sgp-1.cogrion.com            ← Route53 → ALB (Cogrion cluster)
    └── *.w-xxxx.sgp-1.cogrion.com           ← Route53 — NS delegated to client Route53
        ├── bff.w-xxxx.sgp-1.cogrion.com     ← Client Route53 → Client ALB → Client EKS
        └── app.w-xxxx.sgp-1.cogrion.com     ← Client Route53 → Client ALB → Client EKS

TLS for app.cogrion.com and auth.cogrion.com — managed by Cloudflare automatically. TLS for *.sgp-1.cogrion.com — cert-manager in Cogrion cluster, Let's Encrypt or ACM. TLS for *.w-xxxx.sgp-1.cogrion.com — ACM in client account, DNS validated via client Route53. Auto-renewed by AWS.

Database Strategy

Per region, fully independent

Each regional cluster has its own database. No cross-region replication. Tenant data never leaves the region it was provisioned in.

sgp-1 DB (RDS)
├── tenants          ← global provisioning metadata (all regions)
├── principals       ← global user identity (all regions)
├── workspaces       ← sgp-1 tenants only
└── workspace_data   ← sgp-1 tenant data only

frankfurt-1 DB (RDS, when live)
├── workspaces       ← frankfurt-1 tenants only
└── workspace_data   ← frankfurt-1 tenant data only
                        (no tenants — calls sgp-1 control plane for provisioning metadata)
                        (no principals — calls sgp-1 control plane for principal lookup)

Why tenants table stays in sgp-1

The tenants table is provisioning metadata only — workspace ID, assigned region, delegated zone, status. It is not tenant business data. Keeping it in the primary region (sgp-1) avoids distributed state until there is a real operational reason to split it.

When frankfurt-1 is live, its control plane calls back to sgp-1 only during provisioning operations (low frequency, not in the data path).

Why principals table stays in sgp-1

The principals table tracks user identity — email, external Keycloak ID, system roles. Every authenticated request resolves a principal by email against the local database. If principals were per-region, the same user would get a different uid in each region they have access to, breaking cross-region audit trails and making IAM policy reasoning inconsistent.

Keeping principals in sgp-1 alongside tenants gives a single source of identity truth. Regional control planes call back to sgp-1 to resolve a principal by email on first encounter, then cache the result locally for the lifetime of the request. This call is on the authenticated request path, so it must be fast — sgp-1 must be treated as a hard dependency for auth in all regions.

Accepted limitation: if sgp-1 is unavailable, authenticated requests to all other regional control planes will fail. This is the same limitation that already applies to provisioning operations.

S3 buckets

One set of buckets per region per environment. Buckets are never shared across regions.

cogrion-dev-sgp-1-artifacts
cogrion-dev-sgp-1-exports
cogrion-prod-sgp-1-artifacts
cogrion-prod-sgp-1-exports

Naming convention: cogrion-{env}-{region}-{purpose}

Authentication — Multi-Region Principal Resolution

Every authenticated request to the control plane runs through authenticationMiddleware, which must produce a req.principal before the route handler runs. req.principal contains two distinct pieces of information that come from different places:

Identity — uid, email, kind, externalId, system_roles — lives in the principals table in sgp-1.
Account memberships — which accounts the principal belongs to and with what roles — lives in the account_members table in the local regional database, because accounts are per-region.

Same-region flow (sgp-1)

No cross-region call. Identity and account memberships are both in the local database.

Cross-region flow (frankfurt-1)

frankfurt-1's control plane has no principals table. On every authenticated request it calls back to sgp-1 to resolve identity, then queries its own database for local account memberships.

What needs to be built

Two changes are required when frankfurt-1 goes live:

1. New internal endpoint on the control plane

GET /internal/principals?email={email}

Returns { uid, externalId, kind, systemRoles } — identity only, no account memberships.
Auth: Authorization: Bearer {token} checked against INTER_REGION_SERVICE_TOKEN (shared secret injected via Helm values, not a user JWT).
Not exposed via the public ingress — only reachable within the Cogrion cluster network or over a private VPC peering link.

2. Auth middleware split for non-primary regions

The middleware detects which mode it is in via PRIMARY_CPLANE_API_URL:

Env var	Value on sgp-1	Value on frankfurt-1
`PRIMARY_CPLANE_API_URL`	(unset)	`https://cplane.sgp-1.cogrion.com`
`INTER_REGION_SERVICE_TOKEN`	`<shared secret>`	`<same shared secret>`

When PRIMARY_CPLANE_API_URL is set, the middleware:

Calls GET {PRIMARY_CPLANE_API_URL}/internal/principals?email={email} to get identity + system roles.
Queries the local DB for account memberships by principalUid.
Merges both into req.principal — same shape as today, same downstream code, no route handler changes needed.

JWT verification is already cross-region

The auth middleware fetches JWKS dynamically from {iss}/protocol/openid-connect/certs — the iss claim in the JWT always points to the Keycloak that issued it. A frankfurt-1 JWT has iss: auth.frankfurt-1.cogrion.com; the middleware fetches frankfurt-1's JWKS and verifies there. No config change needed for JWT verification when a new region is added.

Cluster Agent Authentication (mTLS)

Cluster agents — processes running inside client Kubernetes clusters — authenticate to the control plane using mutual TLS, not JWTs. OpenBao's PKI backend issues the agent certificates. The control plane ingress uses a dedicated ALB (cplane-alb) with mTLS passthrough mode to support this.

This is a separate authentication path from the Keycloak JWT flow above and applies only to machine-to-machine calls from cluster agents.

TODO: link to detailed mTLS / OpenBao PKI doc

Replicated vs Shared

Component	Mode	Owner
Dashboard UI	Shared — Cloudflare Pages	Cogrion
Auth Worker (`auth.cogrion.com`)	Shared — Cloudflare Worker	Cogrion
Tenant → region lookup	Shared — Cloudflare KV	Cogrion
DNS zone (`cogrion.com`)	Shared — Cloudflare	Cogrion
Control plane app	Replicated per region	Cogrion
Keycloak	Replicated per region	Cogrion
Temporal	Replicated per region	Cogrion
Database (RDS)	Replicated per region, independent	Cogrion
S3 buckets	Replicated per region	Cogrion
Observability stack	Replicated per region	Cogrion
ALB (Cogrion cluster)	Replicated per region	Cogrion
Client ALB	Per workspace	Client
Client Route53 zone	Per workspace (delegated)	Client (DNS parent: Cogrion)
Client ACM cert	Per workspace	Client (domain: Cogrion)
Client EKS / k8s	Per workspace	Client
BFF	Per workspace, on client cluster	Client (Helm chart: Cogrion)

Cluster Configuration

Each cluster deployment is driven by a region+environment specific Helm values file. The application code reads environment variables only — it has no knowledge of which region it is in. The key variables injected into every pod are:

Variable	Example (prod-sgp-1)	Purpose
`BASE_DOMAIN`	`cplane.sgp-1.cogrion.com`	Control plane API / Keycloak redirect URIs
`WORKSPACE_DOMAIN`	`sgp-1.cogrion.com`	Suffix for per-workspace URLs and delegated zones
`AUTH_DOMAIN`	`auth.sgp-1.cogrion.com`	Keycloak endpoint
`TEMPORAL_DOMAIN`	`temporal.sgp-1.cogrion.com`	Temporal UI
`REGION`	`sgp-1`	Region identifier
`ENVIRONMENT`	`prod`	Environment tag

For the full values file structure, Helm chart layout, and how variables are wired through ArgoCD see GitOps (ArgoCD + Helm).

Workspace Provisioning Flow

When a new workspace is created, the control plane runs the following sequence via cross-account IAM:

1. Assume client IAM role (sts:AssumeRole)
2. Create Route53 hosted zone in client account
     zone: w-xxxx.{env.}sgp-1.cogrion.com
3. Write NS record in Cogrion Route53
     *.w-xxxx.{env.}sgp-1.cogrion.com → client nameservers
4. Request ACM certificate in client account
     domain: *.w-xxxx.{env.}sgp-1.cogrion.com
5. Write DNS validation CNAME in client Route53
6. Wait for ACM validation (async, ~2 min)
7. Write workspace record to tenants DB
     { workspace_id, region, delegated_zone, status: active }
8. Write entry to Cloudflare KV
     w-xxxx → { region: sgp-1, bff_url: bff.w-xxxx.sgp-1.cogrion.com }
9. Create Keycloak client in regional realm
     redirect_uris: https://*.w-xxxx.sgp-1.cogrion.com/*

BFF startup sequence (client cluster)

1. BFF pod starts
2. Pull config from control plane
     GET https://cplane.sgp-1.cogrion.com/api/workspaces/w-xxxx/config
     → Keycloak endpoints, feature flags, Temporal address, etc.
3. Register BFF with control plane
     POST https://cplane.sgp-1.cogrion.com/api/workspaces/w-xxxx/register
     { bff_url: "https://bff.w-xxxx.sgp-1.cogrion.com", public_ip: "..." }
4. Control plane updates workspace record, marks BFF as healthy

Adding a New Region

When a client requires EU data residency, spin up frankfurt-1. The high-level sequence is:

Terraform — copy envs/prod-sgp-1/ → envs/prod-frankfurt-1/, update tfvars and backend.tf, run apply. This provisions VPC, EKS, RDS, Route53 zone, S3 buckets, and IAM roles. See Infrastructure (Terraform) → Adding a New Region.
GitOps — copy argocd/apps/prod-sgp-1/ and values/prod-sgp-1/, update all domain and bucket values, bootstrap ArgoCD into the new cluster. See GitOps → Bootstrap: New Cluster.
DNS — add Cloudflare DNS records for *.frankfurt-1.cogrion.com (delegating to the new Route53 zone).
Routing — assign new EU tenants to frankfurt-1 at signup (Cloudflare KV lookup).

No application code changes. No Helm chart changes.

Local Development

Engineers run the full stack locally via Docker Compose. The local stack does not use regional domains.

Local:
  http://localhost:3000     control plane
  http://localhost:8080     Keycloak
  http://localhost:8088     Temporal UI
  http://localhost:3001     BFF (mock workspace)

For integration testing against the shared dev cluster, point BASE_DOMAIN at dev.sgp-1.cogrion.com in your local .env. Do not create real workspace records against dev without coordinating with the team — dev shares a single Keycloak realm and database.

Alternatives: Pure AWS Global Layer

The current design uses Cloudflare for three global components: DNS, the auth proxy worker, and the tenant→region KV lookup. This section documents AWS-native equivalents for each — relevant if Cloudflare is not available, not preferred, or needs to be exited incrementally. Each component can be replaced independently.

KV lookup (`cplane.cogrion.com/lookup`)

This is the highest-priority replacement target: it sits on the auth path and is written to on every workspace provision.

Option 1: API Gateway + Lambda + DynamoDB (recommended migration bridge)

Route53 alias → API Gateway → Lambda (lookup handler) → DynamoDB (tenant-region-map)

DynamoDB table: tenant_id (PK) → { region, bff_url }. Same shape as CF KV.
Write path: control plane writes to DynamoDB at provision time instead of CF KV (step 8 in the provisioning flow).
Read path: Lambda returns the same JSON shape the CF Worker returns today — no client changes needed.
Latency: ~2–5 ms DynamoDB read vs sub-millisecond CF KV edge read. Acceptable for a login-path lookup that is cached by the caller.
DNS: cplane.cogrion.com Route53 alias pointing to the API Gateway regional endpoint, or a CloudFront distribution in front of it for caching.

Option 2: CloudFront + Lambda@Edge + DynamoDB Global Tables

Closer to the CF Worker + KV model in behavior (edge compute + edge data). Use this if lookup latency is a hard constraint.

CloudFront (cplane.cogrion.com) → Lambda@Edge (origin-request) → DynamoDB Global Table

DynamoDB Global Tables replicate the tenant-region-map to each AWS region with a CloudFront PoP nearby.
Adds operational complexity (Lambda@Edge cold starts, Global Table replication lag on writes).
Not needed unless you have measured lookup latency as a problem with Option 1.

Auth proxy (`auth.cogrion.com`)

The CF Worker only fires on OIDC login redirects — it is not on the per-request hot path. Latency here is less critical than for the KV lookup.

Option 1: API Gateway + Lambda (simplest)

Route53 alias → API Gateway → Lambda (proxy handler → auth.{region}.cogrion.com)

Lambda reads tenant→region from the same DynamoDB table, constructs the target Keycloak URL, and proxies the request with http-proxy or node-fetch.
Stateless, no edge deployment required.
Cold starts add ~100–200 ms to the first request in a login flow; provisioned concurrency removes this if needed.

Option 2: CloudFront + Lambda@Edge

CloudFront (auth.cogrion.com) → Lambda@Edge (viewer-request) → regional Keycloak origin

Same lookup store as above. Lambda@Edge rewrites the origin host to auth.{region}.cogrion.com before CloudFront forwards.
Use this if you want consistent edge behavior for both auth. and cplane. subdomains, or if you have strict latency requirements for login redirects.

DNS and static UI

Component	Cloudflare	AWS equivalent
`cogrion.com` zone	Cloudflare DNS	Route53 hosted zone (transfer registrar NS)
`app.cogrion.com`	Cloudflare Pages	S3 + CloudFront distribution
NS delegation records	Cloudflare DNS records	Route53 records in `cogrion.com` zone

DNS transfer is the last step — it must happen after all worker/page replacements are live, so the cutover is a single NS change at the registrar.

Incremental migration path

Each step is independent. Do them in any order; all four are required to fully exit Cloudflare.

Phase	Change	Cloudflare still needed?
1	Deploy API GW + Lambda + DynamoDB for `cplane.cogrion.com/lookup`. Update provisioning code to write DynamoDB instead of CF KV.	Yes (DNS only)
2	Deploy API GW + Lambda for `auth.cogrion.com`.	Yes (DNS only)
3	Deploy S3 + CloudFront for `app.cogrion.com`.	Yes (DNS only)
4	Transfer `cogrion.com` zone to Route53. Update registrar NS. Decommission Cloudflare account.	No

Phases 1–3 each leave Cloudflare managing DNS while replacing the compute. Phase 4 is a single registrar NS change; once propagated, Cloudflare is fully removed.

Trade-off summary

Concern	Cloudflare (current)	Pure AWS
KV lookup latency	Sub-ms, 300 PoPs	~2–5 ms DynamoDB; sub-ms with Global Tables + Lambda@Edge
Auth proxy	CF Worker (edge)	API GW + Lambda or CloudFront + Lambda@Edge
Static UI	CF Pages (edge CDN, zero config)	S3 + CloudFront (more config, same capability)
DNS	Cloudflare DNS	Route53 (equivalent TTL propagation)
Vendor surface	Cloudflare for global layer	AWS-only — aligns with existing BYOC AWS posture
Ops complexity	Low — CF Workers/Pages are simple to deploy	Higher — Lambda@Edge, DynamoDB Global Tables, CloudFront configs
Cost	CF Workers free tier is generous at current scale	API GW + Lambda + DynamoDB costs scale with requests; low at current scale
Migration risk	None — current design	Low per phase; Phase 4 (DNS transfer) is the only risky step

Recommendation: If Cloudflare is acceptable, the current design is operationally simpler. Phase 1 (KV lookup → DynamoDB) is the best entry point if an incremental migration is required — it is the most isolated component, has no user-visible surface, and unblocks full AWS alignment. Defer Phase 4 (DNS transfer) until Phases 1–3 are stable in production.

Summary

One Helm chart, multiple values files — one per environment per region
Region is always in the domain: {service}.{env?}.{region}.cogrion.com
Client workspaces get a delegated zone: *.w-{id}.{env?}.{region}.cogrion.com
TLS for Cogrion services: cert-manager in cluster
TLS for client workspaces: ACM in client account, auto-renewed
Database is per region, never replicated — tenant data stays in its region
S3 buckets are per region per environment, never shared
The only shared global state: Cloudflare DNS zone + KV lookup (tenant → region)
Adding a region = new values file + new cluster + new Route53 zone, nothing else

Environments​

Region Naming Convention​

Domain Structure​

Tiers​

Global URLs​

Regional URLs (Cogrion cluster)​

Per-workspace (client) URLs​

Component Map​

Cogrion cluster (per region, per environment)​

Client cluster (per workspace, BYOC)​

Shared global layer (not a cluster)​

DNS Ownership Boundaries​

Database Strategy​

Per region, fully independent​

Why tenants table stays in sgp-1​

Why principals table stays in sgp-1​

S3 buckets​

Authentication — Multi-Region Principal Resolution​

Same-region flow (sgp-1)​

Cross-region flow (frankfurt-1)​

What needs to be built​

JWT verification is already cross-region​

Cluster Agent Authentication (mTLS)​

Replicated vs Shared​

Cluster Configuration​

Workspace Provisioning Flow​

BFF startup sequence (client cluster)​

Adding a New Region​

Local Development​

Alternatives: Pure AWS Global Layer​

KV lookup (cplane.cogrion.com/lookup)​

Auth proxy (auth.cogrion.com)​

DNS and static UI​

Incremental migration path​

Trade-off summary​

Summary​