Karpenter — AWS

AWS Karpenter uses the karpenter-provider-aws and two CRD types: EC2NodeClass (defines the EC2 node template) and NodePool (defines scheduling constraints and limits per workload).

Bundle source: stacks/aws/karpenter

Infrastructure Components

The bundle deploys two resource groups in order:

infra group (runs first via tofu-module):

Creates an IAM role with EC2, pricing API, and SQS permissions
Associates the role to the karpenter service account via Pod Identity
Provisions an SQS queue for EC2 Spot interruption notifications
Installs the Karpenter Helm chart with the role ARN and queue name injected

kubernetes group (runs after infra):

Applies a default EC2NodeClass (node template)
Applies all NodePool definitions inline via the control plane UI

EC2NodeClass

The karpenter-nodeclass-default is the shared node template referenced by all NodePools.

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: karpenter-nodeclass-default
spec:
  amiFamily: AL2023                      # Amazon Linux 2023
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        deleteOnTermination: true
        encrypted: true
        volumeSize: 100Gi
        volumeType: gp3
  detailedMonitoring: true
  metadataOptions:
    httpTokens: required                 # IMDSv2 enforced
    httpPutResponseHopLimit: 2
  role: "{{ cluster_name }}-karpenter-node-role"
  securityGroupSelectorTerms:
    - tags:
        Name: "{{ cluster_name }}-node"  # matches cluster node SG
  subnetSelectorTerms:
    - tags:
        Name: "{{ ext_account_id }}-private*"  # private subnets only

Key points:

Subnet and security group selection is tag-based — no hardcoded IDs
{{ cluster_name }} and {{ ext_account_id }} are resolved by the cluster agent at apply time
IMDSv2 is enforced (httpTokens: required)
All nodes get encrypted gp3 volumes

NodePools

`airflow-worker`

Source — used by Airflow KubernetesExecutor pods and Spark driver/executor pods.

Setting	Value
Instance family	`m5.large` (general compute)
Architecture	`arm64`
Capacity	Spot + On-Demand
Consolidation	`WhenEmpty` after 2m
Limits	2000 CPU · 8000Gi memory
Taint	`airflow-worker: NoSchedule`
Node expiry	720h

`trino-xsmall`

Source — used by Trino coordinator and worker pods.

Setting	Value
Instance family	`r8g.large` (memory-optimized)
Architecture	`arm64`
Capacity	On-Demand + Spot
Consolidation	`WhenEmptyOrUnderutilized` after 5m
Limits	1000 CPU · 4000Gi memory
Taint	`trino-xsmall: NoSchedule`
Weight	50 (lower priority than airflow pool)

`jupyterhub-small`

Source — used by JupyterHub single-user notebook servers.

Setting	Value
Instance family	`t3.large / xlarge / 2xlarge` (burstable)
Architecture	`amd64`
Capacity	Spot + On-Demand
Consolidation	`WhenEmptyOrUnderutilized` after 5m
Limits	1600 CPU · 16000Gi memory
Taint	`jupyterhub-small: NoSchedule`

Deploying via bundle.yaml

The bundle.yaml registers the stack in the platform catalog. Deployment follows the standard stack workflow:

The bundle takes a single required input: the target cluster resource. All other values (cluster name, account ID) are resolved from the cluster resource at deploy time.

Adding a New NodePool

Define a NodePool manifest referencing karpenter-nodeclass-default
Add a taint matching the workload type (e.g. spark-worker: NoSchedule)
Update the karpenter-resources manifest in the catalog via the control plane UI
Re-apply the stack to the target workspace
Configure the workload's pod template with the matching nodeSelector and toleration

Infrastructure Components​

EC2NodeClass​

NodePools​

airflow-worker​

trino-xsmall​

jupyterhub-small​

Deploying via bundle.yaml​

Adding a New NodePool​