-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Description
Which component are you using?:
Cluster-autoscaler
What version of the component are you using?:
1.32.2
Component version:
What k8s version are you using (kubectl version)?:
kubectl version Output
$ kubectl version Server Version: v1.32.1
What environment is this in?:
OCI
What did you expect to happen?:
When a nodepool is marked as status: Unhealthy due to OCI being Out of Host Capacity, then it should stop trying to schedule Pending pods on that nodepool and switch to a different nodepool with lower priority.
What happened instead?:
Cluster autoscaler keeps trying to template the Pending pod on an upcoming node, that is not getting created. As seen from the logs:
Pod can be moved to template-node-for--upcoming-0
Same nodepool is set to Unhealthy in the cluster-autoscaler-status configmap, due to OCI being out of capacity and unable to remove the upcoming node, since OCI has not given upcoming nodes an ID yet
Found 1 instances with errorCode OutOfResource.InternalError in nodeGroup
Deleting 1 from node group because of create errors
Error while trying to delete nodes from: Node doesn't have an instance id so it can't be deleted.
How to reproduce it (as minimally and precisely as possible):
Create 2 nodepools with different instance types in OKE cluster, where one nodepool is out of capacity on OCI. And configure Cluster autoscaler with different priority, with the nodepool that is out of capacity having higher priority. Then create new pods to be scheduled on these nodepools and wait for cluster-autoscaler to mark the nodepool as unhealthy and not scale up the secondary nodepool
Anything else we need to know?:
Not sure if it changes the behaviour, but the nodepool is trying scale from 0 nodes.