Skip to content

[cluster-autoscaler] Add node annotations for scale-down blocking reasons #8710

@SarCode

Description

@SarCode

Which component are you using?:

/area cluster-autoscaler
/label area/core-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

Currently, when cluster-autoscaler cannot remove a node during scale-down, the reasons are tracked internally (thanks to PR #7307) but are not exposed as annotations on the actual Kubernetes node objects. This creates a significant user experience gap where users can see nodes that should be removed but aren't, with no clear indication why.

Current Problems:

  • Users have no direct way to understand why their nodes cannot be scaled down
  • Debugging requires access to cluster-autoscaler logs or Prometheus metrics, which may not be readily available to end users
  • No easy way to programmatically check why a specific node cannot be scaled down
  • Difficult to build automation or alerts around specific blocking reasons
  • Operations teams struggle to quickly identify scaling bottlenecks

Background:
PR #7307 "Modify scale down set processor to add reasons to unremovable nodes" (merged October 2024, available in cluster-autoscaler-1.34.0+) successfully added infrastructure to track detailed reasons why nodes cannot be removed. These reasons are currently exposed via:

  • Prometheus metrics (UpdateUnremovableNodesCount)
  • Status APIs (UnremovableNodes())
  • Internal logging

However, the reasons are not written as annotations to the actual node objects.

Describe the solution you'd like.:

Add annotations to node objects when they cannot be removed during scale-down, indicating the specific blocking reason.

Proposed Annotation Format:

# Primary annotation with the reason
cluster-autoscaler.kubernetes.io/scale-down-disabled-reason: "NodeGroupMinSizeReached"

# Optional: Human-readable description  
cluster-autoscaler.kubernetes.io/scale-down-disabled-reason-description: "Node group has reached minimum size (1)"

# Optional: Timestamp when this reason was last updated
cluster-autoscaler.kubernetes.io/scale-down-disabled-reason-timestamp: "2025-10-30T10:00:00Z"

Supported Reason Values (based on simulator.UnremovableReason):

  • ScaleDownDisabledAnnotation - Node has scale-down disabled annotation
  • NotUnneededLongEnough - Node hasn't been unneeded for sufficient time
  • NotUnreadyLongEnough - Unready node hasn't been unready for sufficient time
  • NodeGroupMinSizeReached - Node group is at minimum size
  • MinimalResourceLimitExceeded - Removing node would exceed resource limits
  • NotAutoscaled - Node is not part of an autoscaled group
  • UnexpectedError - Internal error occurred
  • AtomicScaleDownFailed - Atomic scale-down operation failed

Implementation Location:
The annotation logic should be added in core/scaledown/unneeded/nodes.go in the RemovableAt() method, right after the unremovable reason is determined:

if r := n.unremovableReason(context, scaleDownContext, v, ts, nodeGroupSize); r != simulator.NoReason {
    unremovable = append(unremovable, simulator.UnremovableNode{Node: v.ntbr.Node, Reason: r})
    // ADD ANNOTATION LOGIC HERE
    continue
}

Configuration Options:

  • --annotate-unremovable-nodes flag (default: false for backward compatibility)
  • Configurable annotation prefix
  • Option to include human-readable descriptions
  • Automatic cleanup when nodes become removable again

Describe any alternative solutions you've considered.:

  1. Status API Only (current state): Requires special access and tooling
  2. Metrics Only: Good for monitoring but not user-friendly for troubleshooting
  3. Events: Could use Kubernetes events, but they're ephemeral and harder to query
  4. Custom Resources: More complex than annotations and overkill for this use case
  5. Enhanced Logging: Still requires log access and parsing

Additional context.:

Example Use Cases:

Troubleshooting:

# Check why a node cannot be scaled down
kubectl get node my-node -o jsonpath='{.metadata.annotations.cluster-autoscaler\.kubernetes\.io/scale-down-disabled-reason}'

# List all nodes with scale-down issues  
kubectl get nodes -o json | jq '.items[] | select(.metadata.annotations."cluster-autoscaler.kubernetes.io/scale-down-disabled-reason" != null) | {name: .metadata.name, reason: .metadata.annotations."cluster-autoscaler.kubernetes.io/scale-down-disabled-reason"}'

Monitoring/Alerting:

# Alert when nodes are blocked by minimum size constraints
- alert: NodeGroupMinSizeBlocking
  expr: count by (reason) (kube_node_annotations{annotation_cluster_autoscaler_kubernetes_io_scale_down_disabled_reason="NodeGroupMinSizeReached"}) > 0

Benefits:

  1. Improved User Experience: Clear visibility into why nodes cannot be scaled down
  2. Better Debugging: No need to access logs or metrics for basic troubleshooting
  3. Automation Friendly: Enables kubectl/automation scripts to check scale-down status
  4. Operational Visibility: Operations teams can quickly identify scaling bottlenecks
  5. Consistent with Kubernetes Patterns: Uses standard annotation approach for metadata

Backward Compatibility:

This feature would significantly improve the user experience for cluster-autoscaler users while building upon the excellent foundation provided by PR #7307.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions