[cluster-autoscaler] Add node annotations for scale-down blocking reasons

**Which component are you using?**:

/area cluster-autoscaler
/label area/core-autoscaler

**Is your feature request designed to solve a problem? If so describe the problem this feature should solve.**:

Currently, when cluster-autoscaler cannot remove a node during scale-down, the reasons are tracked internally (thanks to PR #7307) but are not exposed as annotations on the actual Kubernetes node objects. This creates a significant user experience gap where users can see nodes that should be removed but aren't, with no clear indication why.

**Current Problems:**
- Users have no direct way to understand why their nodes cannot be scaled down
- Debugging requires access to cluster-autoscaler logs or Prometheus metrics, which may not be readily available to end users
- No easy way to programmatically check why a specific node cannot be scaled down
- Difficult to build automation or alerts around specific blocking reasons
- Operations teams struggle to quickly identify scaling bottlenecks

**Background:**
PR #7307 "Modify scale down set processor to add reasons to unremovable nodes" (merged October 2024, available in cluster-autoscaler-1.34.0+) successfully added infrastructure to track detailed reasons why nodes cannot be removed. These reasons are currently exposed via:
- Prometheus metrics (`UpdateUnremovableNodesCount`)
- Status APIs (`UnremovableNodes()`) 
- Internal logging

However, the reasons are **not written as annotations to the actual node objects**.

**Describe the solution you'd like.**:

Add annotations to node objects when they cannot be removed during scale-down, indicating the specific blocking reason.

**Proposed Annotation Format:**
```yaml
# Primary annotation with the reason
cluster-autoscaler.kubernetes.io/scale-down-disabled-reason: "NodeGroupMinSizeReached"

# Optional: Human-readable description  
cluster-autoscaler.kubernetes.io/scale-down-disabled-reason-description: "Node group has reached minimum size (1)"

# Optional: Timestamp when this reason was last updated
cluster-autoscaler.kubernetes.io/scale-down-disabled-reason-timestamp: "2025-10-30T10:00:00Z"
```

**Supported Reason Values** (based on `simulator.UnremovableReason`):
- `ScaleDownDisabledAnnotation` - Node has scale-down disabled annotation
- `NotUnneededLongEnough` - Node hasn't been unneeded for sufficient time
- `NotUnreadyLongEnough` - Unready node hasn't been unready for sufficient time  
- `NodeGroupMinSizeReached` - Node group is at minimum size
- `MinimalResourceLimitExceeded` - Removing node would exceed resource limits
- `NotAutoscaled` - Node is not part of an autoscaled group
- `UnexpectedError` - Internal error occurred
- `AtomicScaleDownFailed` - Atomic scale-down operation failed

**Implementation Location:**
The annotation logic should be added in `core/scaledown/unneeded/nodes.go` in the `RemovableAt()` method, right after the unremovable reason is determined:

```go
if r := n.unremovableReason(context, scaleDownContext, v, ts, nodeGroupSize); r != simulator.NoReason {
    unremovable = append(unremovable, simulator.UnremovableNode{Node: v.ntbr.Node, Reason: r})
    // ADD ANNOTATION LOGIC HERE
    continue
}
```

**Configuration Options:**
- `--annotate-unremovable-nodes` flag (default: false for backward compatibility)
- Configurable annotation prefix 
- Option to include human-readable descriptions
- Automatic cleanup when nodes become removable again

**Describe any alternative solutions you've considered.**:

1. **Status API Only** (current state): Requires special access and tooling
2. **Metrics Only**: Good for monitoring but not user-friendly for troubleshooting  
3. **Events**: Could use Kubernetes events, but they're ephemeral and harder to query
4. **Custom Resources**: More complex than annotations and overkill for this use case
5. **Enhanced Logging**: Still requires log access and parsing

**Additional context.**:

**Example Use Cases:**

*Troubleshooting:*
```bash
# Check why a node cannot be scaled down
kubectl get node my-node -o jsonpath='{.metadata.annotations.cluster-autoscaler\.kubernetes\.io/scale-down-disabled-reason}'

# List all nodes with scale-down issues  
kubectl get nodes -o json | jq '.items[] | select(.metadata.annotations."cluster-autoscaler.kubernetes.io/scale-down-disabled-reason" != null) | {name: .metadata.name, reason: .metadata.annotations."cluster-autoscaler.kubernetes.io/scale-down-disabled-reason"}'
```

*Monitoring/Alerting:*
```yaml
# Alert when nodes are blocked by minimum size constraints
- alert: NodeGroupMinSizeBlocking
  expr: count by (reason) (kube_node_annotations{annotation_cluster_autoscaler_kubernetes_io_scale_down_disabled_reason="NodeGroupMinSizeReached"}) > 0
```

**Benefits:**
1. **Improved User Experience**: Clear visibility into why nodes cannot be scaled down
2. **Better Debugging**: No need to access logs or metrics for basic troubleshooting  
3. **Automation Friendly**: Enables kubectl/automation scripts to check scale-down status
4. **Operational Visibility**: Operations teams can quickly identify scaling bottlenecks
5. **Consistent with Kubernetes Patterns**: Uses standard annotation approach for metadata

**Backward Compatibility:**
- Feature would be **opt-in** via feature flag to avoid breaking existing deployments
- No impact on existing functionality when disabled
- All existing PR #7307 functionality remains unchanged

This feature would significantly improve the user experience for cluster-autoscaler users while building upon the excellent foundation provided by PR #7307.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[cluster-autoscaler] Add node annotations for scale-down blocking reasons #8710

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[cluster-autoscaler] Add node annotations for scale-down blocking reasons #8710

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions