Infrastructure as Code (Terraform)
Infrastructure is managed via Terraform in a dedicated cogrion-terraform repository. It provisions the AWS-level primitives that the Kubernetes/Helm layer (in cogrion-infra) runs on top of.
| Layer | Repo | Tool |
|---|---|---|
| AWS infrastructure | cogrion-terraform | Terraform |
| Kubernetes workloads | cogrion-infra | Helm + ArgoCD |
| Application code | cogrion-app | GitHub Actions → GHCR |
Terraform does not manage anything inside Kubernetes. ArgoCD does not manage anything in AWS. The boundary is clean: Terraform outputs a cluster endpoint and IAM roles; ArgoCD takes it from there. For the Kubernetes and deployment layer see GitOps (ArgoCD + Helm).
Repository: cogrion-terraform
cogrion-terraform/
├── modules/ # reusable, no provider or backend config
│ ├── cluster/ # EKS cluster + node groups
│ ├── networking/ # VPC, subnets, NAT, security groups
│ ├── dns/ # Route53 zones and delegation records
│ ├── database/ # RDS, parameter groups, subnet groups
│ ├── storage/ # S3 buckets, bucket policies
│ ├── iam/ # roles, policies, IRSA
│ ├── acm/ # certificates for Cogrion-owned domains
│ └── workspace-provisioner/ # per-workspace: Route53 delegation + ACM
│
├── envs/ # root modules — one per cluster
│ ├── dev-sgp-1/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ ├── terraform.tfvars # non-secret values
│ │ └── backend.tf
│ └── prod-sgp-1/
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ ├── terraform.tfvars
│ └── backend.tf
│
├── cicd/ # CodePipeline + CodeBuild definitions
│ ├── pipeline.tf
│ ├── codebuild.tf
│ └── iam.tf
│
└── global/ # resources that exist once across all regions
├── main.tf # Cloudflare DNS zone, CF Worker, KV namespace
├── state-backend/ # S3 bucket + DynamoDB for Terraform state itself
│ └── main.tf
└── cicd-bootstrap/ # CodePipeline IAM + S3 artifact bucket
└── main.tf
AWS Account Structure
Each deployment tier gets its own AWS account. Accounts are never shared across environments or between Cogrion infrastructure and client workspaces.
| Account | Purpose | Example alias |
|---|---|---|
cogrion-dev-sgp-1 | Cogrion dev/staging cluster — sgp-1 | cogrion-dev-sg1 |
cogrion-dev-fra-1 | Cogrion dev/staging cluster — fra-1 (add when needed) | cogrion-dev-fra1 |
cogrion-prod-sgp-1 | Cogrion production cluster — sgp-1 | cogrion-prod-sg1 |
cogrion-prod-fra-1 | Cogrion production cluster — fra-1 | cogrion-prod-fra1 |
cogrion-global | Shared global resources — Terraform state, CI/CD pipelines, ECR | cogrion-global |
client-{workspace-id} | Client BYOC account — one per workspace, owned by the client | acme-prod |
Why not one prod account spanning multiple regions?
The main reason is data residency. IAM is global per account — an IAM role or policy in a single account can access resources in any region unless you add explicit aws:RequestedRegion conditions to every single policy. One misconfigured policy and a Singapore operator can read Frankfurt data. Separate accounts make that structurally impossible: Singapore credentials don't exist in the Frankfurt account.
The secondary reasons:
- Regulatory compliance. GDPR and similar frameworks often require that EU data not be accessible from accounts that also hold APAC resources, even at the admin level. Auditors want a hard boundary, not a policy condition.
- Blast radius. A
terraform destroygone wrong, or an accidental S3 bucket policy change, is contained to one region's account. - Account-level services are global. CloudTrail, AWS Config, IAM, Cost Explorer — all operate at the account level. A shared account means Frankfurt audit logs sit alongside Singapore audit logs in the same trail, complicating data residency reporting.
The trade-off is slightly more accounts to manage. In practice it's low overhead — accounts are cheap, and the separation pays for itself the first time a customer asks "can you prove our EU data never left the EU?"
Naming pattern: cogrion-{environment}-{region_id} for Cogrion-owned accounts. The cogrion-global account is the exception — it holds resources that exist once (S3 state bucket, DynamoDB lock table, CodePipeline, ECR registry).
Terraform state lives in cogrion-global. Each regional env assumes a cross-account role into its own account to apply. Client accounts are never touched by Cogrion Terraform directly — only by the workspace-provisioner module via sts:AssumeRole at runtime.
State Backend
Each environment gets its own state file. Never share state between environments or regions.
Backend configuration
envs/prod-sgp-1/backend.tf
terraform {
backend "s3" {
bucket = "cogrion-terraform-state"
key = "prod/sgp-1/terraform.tfstate"
region = "ap-southeast-1"
dynamodb_table = "cogrion-terraform-locks"
encrypt = true
}
}
envs/dev-sgp-1/backend.tf
terraform {
backend "s3" {
bucket = "cogrion-terraform-state"
key = "dev/sgp-1/terraform.tfstate"
region = "ap-southeast-1"
dynamodb_table = "cogrion-terraform-locks"
encrypt = true
}
}
State key convention: {env}/{region}/terraform.tfstate
Bootstrap (one-time, manual)
The state bucket and lock table cannot manage themselves. Bootstrap them once:
cd global/state-backend
terraform init -backend=false
terraform apply
# Creates: cogrion-terraform-state (S3) + cogrion-terraform-locks (DynamoDB)
After this, all other environments use S3 backend normally.
Variables
Pattern
Each environment root module (envs/{env}-{region}/) has:
variables.tf— variable declarations with descriptions and typesterraform.tfvars— non-secret values committed to git- Secrets — pulled from AWS Secrets Manager at apply time, never in tfvars
variables.tf (shared pattern across all envs)
variable "region" {
type = string
description = "AWS region for this deployment"
}
variable "environment" {
type = string
description = "Environment name: dev or prod"
validation {
condition = contains(["dev", "prod"], var.environment)
error_message = "Must be dev or prod."
}
}
variable "region_id" {
type = string
description = "Cogrion region identifier e.g. sgp-1, fra-1"
}
variable "base_domain" {
type = string
description = "Control plane domain e.g. cplane.sgp-1.cogrion.com"
}
variable "workspace_domain" {
type = string
description = "Workspace subdomain suffix e.g. sgp-1.cogrion.com"
}
variable "eks_node_instance_type" {
type = string
default = "t3.medium"
}
variable "eks_node_min" {
type = number
default = 1
}
variable "eks_node_max" {
type = number
default = 5
}
variable "eks_node_desired" {
type = number
default = 2
}
variable "db_instance_class" {
type = string
default = "db.t3.medium"
}
variable "db_name" {
type = string
default = "cogrion"
}
variable "artifact_bucket" {
type = string
description = "S3 bucket for artifacts e.g. cogrion-prod-sgp-1-artifacts"
}
variable "exports_bucket" {
type = string
description = "S3 bucket for exports e.g. cogrion-prod-sgp-1-exports"
}
terraform.tfvars — prod-sgp-1
region = "ap-southeast-1"
environment = "prod"
region_id = "sgp-1"
base_domain = "cplane.sgp-1.cogrion.com"
workspace_domain = "sgp-1.cogrion.com"
eks_node_instance_type = "t3.large"
eks_node_min = 2
eks_node_max = 10
eks_node_desired = 3
db_instance_class = "db.t3.large"
db_name = "cogrion"
artifact_bucket = "cogrion-prod-sgp-1-artifacts"
exports_bucket = "cogrion-prod-sgp-1-exports"
terraform.tfvars — dev-sgp-1
region = "ap-southeast-1"
environment = "dev"
region_id = "sgp-1"
base_domain = "cplane.dev.sgp-1.cogrion.com"
workspace_domain = "dev.sgp-1.cogrion.com"
eks_node_instance_type = "t3.medium"
eks_node_min = 1
eks_node_max = 3
eks_node_desired = 1
db_instance_class = "db.t3.small"
db_name = "cogrion"
artifact_bucket = "cogrion-dev-sgp-1-artifacts"
exports_bucket = "cogrion-dev-sgp-1-exports"
Modules
Each concern (networking, cluster, DNS, database, storage, IAM) is its own module rather than a flat pile of .tf files for two reasons:
- Reuse without copy-paste.
dev-sgp-1andprod-sgp-1call the same modules with different variables. A fix or addition lands once and both environments pick it up on the next apply. - Blast radius. A change to
modules/storagecan only affect storage resources. A flat file structure makes it easy to accidentally couple unrelated resources, and Terraform's dependency graph becomes harder to reason about as the file count grows.
Modules contain no provider or backend configuration — those live in envs/ only. This keeps modules portable and testable in isolation.
modules/networking
Provisions VPC, public/private subnets across AZs, NAT gateway, internet gateway, security groups.
module "networking" {
source = "../../modules/networking"
environment = var.environment
region_id = var.region_id
vpc_cidr = "10.0.0.0/16"
}
modules/cluster
EKS cluster, managed node groups, OIDC provider for IRSA, aws-load-balancer-controller IAM role.
module "cluster" {
source = "../../modules/cluster"
environment = var.environment
region_id = var.region_id
vpc_id = module.networking.vpc_id
private_subnets = module.networking.private_subnet_ids
node_instance_type = var.eks_node_instance_type
node_min = var.eks_node_min
node_max = var.eks_node_max
node_desired = var.eks_node_desired
}
modules/dns
Route53 hosted zone for the regional Cogrion domain, ACM wildcard cert.
module "dns" {
source = "../../modules/dns"
environment = var.environment
region_id = var.region_id
base_domain = var.base_domain
workspace_domain = var.workspace_domain
# e.g. creates zone: sgp-1.cogrion.com
# creates cert: *.sgp-1.cogrion.com
}
modules/database
RDS Postgres, subnet group, parameter group, Secrets Manager entry for the connection string.
module "database" {
source = "../../modules/database"
environment = var.environment
region_id = var.region_id
vpc_id = module.networking.vpc_id
private_subnets = module.networking.private_subnet_ids
db_name = var.db_name
instance_class = var.db_instance_class
}
Outputs the Secrets Manager ARN for the connection string. The Helm chart reads it via External Secrets Operator — Terraform never passes the password to Kubernetes directly.
modules/storage
S3 buckets with versioning, encryption, lifecycle rules, and bucket policies.
module "storage" {
source = "../../modules/storage"
environment = var.environment
region_id = var.region_id
artifact_bucket = var.artifact_bucket
exports_bucket = var.exports_bucket
}
modules/iam
IRSA roles for control plane pods, Temporal workers, External Secrets Operator, aws-load-balancer-controller.
module "iam" {
source = "../../modules/iam"
environment = var.environment
region_id = var.region_id
oidc_provider = module.cluster.oidc_provider
artifact_bucket = var.artifact_bucket
exports_bucket = var.exports_bucket
}
modules/workspace-provisioner
Called by the control plane at runtime (not at cluster bootstrap time) for each new workspace. Creates Route53 delegation NS record and requests ACM cert in the client account via cross-account IAM.
This module is invoked by the control plane application code using the Terraform SDK or via a CodeBuild project triggered by the provisioning workflow — not as part of the cluster bootstrap apply.
Root Module
(envs/prod-sgp-1/main.tf)
terraform {
required_version = ">= 1.7"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = var.region
default_tags {
tags = {
Environment = var.environment
Region = var.region_id
ManagedBy = "terraform"
Repo = "cogrion-terraform"
}
}
}
module "networking" {
source = "../../modules/networking"
environment = var.environment
region_id = var.region_id
}
module "cluster" {
source = "../../modules/cluster"
environment = var.environment
region_id = var.region_id
vpc_id = module.networking.vpc_id
private_subnets = module.networking.private_subnet_ids
node_instance_type = var.eks_node_instance_type
node_min = var.eks_node_min
node_max = var.eks_node_max
node_desired = var.eks_node_desired
}
module "dns" {
source = "../../modules/dns"
environment = var.environment
region_id = var.region_id
base_domain = var.base_domain
workspace_domain = var.workspace_domain
}
module "database" {
source = "../../modules/database"
environment = var.environment
region_id = var.region_id
vpc_id = module.networking.vpc_id
private_subnets = module.networking.private_subnet_ids
db_name = var.db_name
instance_class = var.db_instance_class
}
module "storage" {
source = "../../modules/storage"
environment = var.environment
region_id = var.region_id
artifact_bucket = var.artifact_bucket
exports_bucket = var.exports_bucket
}
module "iam" {
source = "../../modules/iam"
environment = var.environment
region_id = var.region_id
oidc_provider = module.cluster.oidc_provider
artifact_bucket = var.artifact_bucket
exports_bucket = var.exports_bucket
}
Automation: CodePipeline + CodeBuild
Architecture
GitHub (cogrion-terraform)
└── push to main
└── CodePipeline triggered (via CodeStar connection)
├── Stage: Source — pull repo
├── Stage: Plan — terraform plan, post summary
├── Stage: Approval — manual gate (prod only)
└── Stage: Apply — terraform apply
Dev and prod have separate pipelines. Dev has no manual approval gate. Prod requires explicit approval before apply.
CodeBuild project (shared buildspec)
cicd/buildspec.yml
version: 0.2
env:
variables:
TF_VERSION: "1.7.5"
parameter-store:
AWS_ACCOUNT_ID: /cogrion/cicd/aws-account-id
phases:
install:
commands:
- curl -Lo terraform.zip https://releases.hashicorp.com/terraform/${TF_VERSION}/terraform_${TF_VERSION}_linux_amd64.zip
- unzip terraform.zip && mv terraform /usr/local/bin/
- terraform version
pre_build:
commands:
- cd envs/${ENV_NAME}
- terraform init -input=false
build:
commands:
- |
if [ "$ACTION" = "plan" ]; then
terraform plan -var-file=terraform.tfvars -out=tfplan -input=false
terraform show -no-color tfplan > plan-output.txt
cat plan-output.txt
fi
- |
if [ "$ACTION" = "apply" ]; then
terraform apply -auto-approve -var-file=terraform.tfvars -input=false
fi
artifacts:
files:
- envs/${ENV_NAME}/tfplan
- envs/${ENV_NAME}/plan-output.txt
Environment variables ENV_NAME (e.g. prod-sgp-1) and ACTION (plan or apply) are set per CodeBuild project.
CodePipeline definition (prod)
cicd/pipeline.tf
resource "aws_codepipeline" "prod_singapore_1" {
name = "cogrion-terraform-prod-sgp-1"
role_arn = aws_iam_role.codepipeline.arn
artifact_store {
location = aws_s3_bucket.pipeline_artifacts.bucket
type = "S3"
}
stage {
name = "Source"
action {
name = "GitHub"
category = "Source"
owner = "AWS"
provider = "CodeStarSourceConnection"
version = "1"
output_artifacts = ["source"]
configuration = {
ConnectionArn = aws_codestarconnections_connection.github.arn
FullRepositoryId = "cogrion/cogrion-terraform"
BranchName = "main"
}
}
}
stage {
name = "Plan"
action {
name = "Plan"
category = "Build"
owner = "AWS"
provider = "CodeBuild"
version = "1"
input_artifacts = ["source"]
output_artifacts = ["plan"]
configuration = {
ProjectName = aws_codebuild_project.terraform.name
EnvironmentVariables = jsonencode([
{ name = "ENV_NAME", value = "prod-sgp-1" },
{ name = "ACTION", value = "plan" }
])
}
}
}
stage {
name = "Approval"
action {
name = "ApproveApply"
category = "Approval"
owner = "AWS"
provider = "Manual"
version = "1"
configuration = {
NotificationArn = aws_sns_topic.approvals.arn
CustomData = "Review the plan output in CodeBuild logs before approving."
}
}
}
stage {
name = "Apply"
action {
name = "Apply"
category = "Build"
owner = "AWS"
provider = "CodeBuild"
version = "1"
input_artifacts = ["source"]
configuration = {
ProjectName = aws_codebuild_project.terraform.name
EnvironmentVariables = jsonencode([
{ name = "ENV_NAME", value = "prod-sgp-1" },
{ name = "ACTION", value = "apply" }
])
}
}
}
}
Dev pipeline is identical minus the Approval stage.
CodeBuild IAM role
The CodeBuild execution role needs:
# Terraform state access
s3:GetObject, s3:PutObject, s3:DeleteObject → cogrion-terraform-state/*
dynamodb:GetItem, dynamodb:PutItem, ... → cogrion-terraform-locks
# Resources Terraform manages
eks:*, ec2:*, rds:*, route53:*, acm:*,
iam:*, s3:*, secretsmanager:*, cloudwatch:*
# Scoped to Cogrion-tagged resources where possible
Use a condition on aws:RequestedRegion to restrict to ap-southeast-1 for the sgp-1 pipeline. Frankfurt pipeline gets its own role scoped to eu-central-1.
Adding a New Region (prod-fra-1)
This covers the Terraform layer only. Once terraform apply completes, continue with the GitOps bootstrap in GitOps → Bootstrap: New Cluster.
# 1. Create env directory
cp -r envs/prod-sgp-1 envs/prod-fra-1
# 2. Update backend.tf
# key: "prod/fra-1/terraform.tfstate"
# region: "eu-central-1"
# 3. Update terraform.tfvars
# region = "eu-central-1"
# region_id = "fra-1"
# base_domain = "cplane.fra-1.cogrion.com"
# workspace_domain = "fra-1.cogrion.com"
# artifact_bucket = "cogrion-prod-fra-1-artifacts"
# exports_bucket = "cogrion-prod-fra-1-exports"
# 4. Update provider region in main.tf if hardcoded,
# or pass var.region (recommended)
# 5. Add CodePipeline for prod-fra-1 in cicd/pipeline.tf
# 6. Commit, push — pipeline runs plan, team approves, apply runs
No module changes. No variable additions. Values only.
CLI Workflow (current, until pipeline is live)
cd envs/prod-sgp-1
# First time
terraform init
# Preview
terraform plan -var-file=terraform.tfvars -out=tfplan
# Apply
terraform apply tfplan
# Targeted apply (use sparingly)
terraform apply -target=module.dns -var-file=terraform.tfvars
Always run plan before apply. Always use -var-file=terraform.tfvars — never pass -var flags inline as they are not tracked.
State is in S3 so multiple engineers share the same state automatically. DynamoDB prevents concurrent applies.
Conventions
Naming
All AWS resources follow: cogrion-{environment}-{region_id}-{purpose}
cogrion-prod-sgp-1-cluster
cogrion-prod-sgp-1-vpc
cogrion-prod-sgp-1-db
cogrion-prod-sgp-1-artifacts
Enforced via default_tags on the AWS provider — every resource gets Environment, Region, and ManagedBy tags automatically.
What Terraform owns vs ArgoCD owns
| Terraform | ArgoCD |
|---|---|
| VPC, subnets, NAT | Kubernetes Deployments |
| EKS cluster + node groups | Helm releases |
| RDS instance | ConfigMaps, Secrets (via ESO) |
| S3 buckets | Ingress objects |
| Route53 zones + records | cert-manager certificates |
| ACM certificates | ArgoCD Applications |
| IAM roles + policies | — |
| CodePipeline + CodeBuild | — |
Never manage Kubernetes resources from Terraform and ArgoCD simultaneously. Pick one owner per resource type and stick to it.
Summary
- One
envs/{env}-{region}/root module per cluster — variables drive all differences - Shared reusable modules in
modules/— no environment logic inside them - State isolated per environment per region —
{env}/{region}/terraform.tfstate terraform.tfvarscommitted to git — secrets never in tfvars, always in Secrets Manager- CodePipeline + CodeBuild automates plan → approve → apply — separate pipelines per cluster
- Prod pipeline has a manual approval gate between plan and apply
- Adding a region = copy one env directory, update tfvars, add pipeline definition
- Terraform owns AWS primitives; ArgoCD owns everything inside Kubernetes