Skip to main content

Karpenter — AliCloud

AliCloud Karpenter uses the karpenter-provider-alibabacloud maintained by CloudPilot AI. The API surface mirrors the AWS provider — the same NodePool CRD is used, but the node template CRD is ECSNodeClass instead of EC2NodeClass.

Bundle source: stacks/alicloud/karpenter


Differences from AWS

ConceptAWSAliCloud
Node template CRDEC2NodeClassECSNodeClass
IAM / identityIAM Role (Pod Identity)RAM Role
Spot interruptionSQS queueCloud Monitor events
Subnet selectionTag-based (subnetSelectorTerms)VSwitch ID list (vSwitchSelectorTerms)
Security groupsTag-based (securityGroupSelectorTerms)Security group ID list
AMI familyAL2023, Bottlerocket, etc.AliyunLinux, ContainerOS
Instance category labelkarpenter.k8s.aws/instance-categorykarpenter.k8s.alibabacloud.com/ecs-instance-category

Infrastructure Components

infra group (runs first via tofu-module):

  • Creates a RAM role with ECS, pricing API permissions
  • Associates the role to the karpenter service account via RRSA (Ram Role for Service Account)
  • Installs the Karpenter Helm chart with the role ARN injected

kubernetes group (runs after infra):

  • Applies a default ECSNodeClass (node template)
  • Applies all NodePool definitions inline via the control plane UI

ECSNodeClass

The ECSNodeClass is the AliCloud equivalent of AWS EC2NodeClass. It selects VSwitches and security groups by ID and specifies the RAM role nodes will assume.

apiVersion: karpenter.k8s.alibabacloud.com/v1alpha1
kind: ECSNodeClass
metadata:
name: karpenter-nodeclass-default
spec:
amiFamily: AliyunLinux
systemDisk:
size: 100
category: cloud_essd
encrypted: true
deleteWithInstance: true
vSwitchSelectorTerms:
- tags:
Name: "{{ cluster_name }}-private*" # private VSwitches only
securityGroupSelectorTerms:
- tags:
Name: "{{ cluster_name }}-node"
ramRoleName: "{{ cluster_name }}-karpenter-node-role"
userData: ""

Key points:

  • VSwitch and security group selection is tag-based — same pattern as AWS
  • {{ cluster_name }} is resolved by the cluster agent at apply time
  • System disk uses cloud_essd (equivalent to gp3) with encryption enabled
  • No SQS — AliCloud interruption handling is built into the provider via Cloud Monitor

NodePools

NodePool specs are identical in structure to AWS. Only the instance family labels differ.

airflow-worker

Used by Airflow KubernetesExecutor pods and Spark driver/executor pods.

SettingValue
Instance familyecs.g7.large (general compute, equiv. to m5)
Architectureamd64
CapacitySpot + On-Demand
ConsolidationWhenEmpty after 2m
Taintairflow-worker: NoSchedule
requirements:
- key: karpenter.k8s.alibabacloud.com/ecs-instance-category
operator: In
values: [g]
- key: karpenter.k8s.alibabacloud.com/ecs-instance-family
operator: In
values: [ecs.g7]
- key: karpenter.sh/capacity-type
operator: In
values: [spot, on-demand]

trino-xsmall

Used by Trino coordinator and worker pods.

SettingValue
Instance familyecs.r7.large (memory-optimized, equiv. to r8g)
Architectureamd64
CapacityOn-Demand + Spot
ConsolidationWhenEmptyOrUnderutilized after 5m
Tainttrino-xsmall: NoSchedule
requirements:
- key: karpenter.k8s.alibabacloud.com/ecs-instance-category
operator: In
values: [r]
- key: karpenter.k8s.alibabacloud.com/ecs-instance-family
operator: In
values: [ecs.r7]
- key: karpenter.sh/capacity-type
operator: In
values: [on-demand, spot]

jupyterhub-small

Used by JupyterHub single-user notebook servers.

SettingValue
Instance familyecs.t6.large–2xlarge (burstable, equiv. to t3)
Architectureamd64
CapacitySpot + On-Demand
ConsolidationWhenEmptyOrUnderutilized after 5m
Taintjupyterhub-small: NoSchedule
requirements:
- key: karpenter.k8s.alibabacloud.com/ecs-instance-category
operator: In
values: [t]
- key: karpenter.k8s.alibabacloud.com/ecs-instance-family
operator: In
values: [ecs.t6]
- key: karpenter.sh/capacity-type
operator: In
values: [spot, on-demand]

Deploying via bundle.yaml

Deployment follows the same stack workflow as AWS. The bundle takes a single required input: the target cluster resource (an ACK cluster).

AliCloud-specific inputs required at deploy time:

InputDescription
clusterTarget ACK cluster resource
regionAliCloud region (e.g. cn-hangzhou)
vswitch_idsList of private VSwitch IDs for node placement