Pod eviction timeout is too aggressive on certain configurations

In scale-down, there's a timeout on [initiating eviction of a pod](https://github.com/kubernetes/autoscaler/blob/48dfe75fafdf3d6d5075bede976c9ec6443a6bbb/cluster-autoscaler/core/scaledown/actuation/drain.go#L239). It's controlled by `max-pod-eviction-time` flag, which defaults to 2 minutes. 

In some scenarios this is too aggressive. Recreating a pod protected by PDB can take much longer than that, especially if things like termination grace period, startup probe or readiness probe are configured.

The user can just increase that timeout with the flag, but that's not perfect either. Eviction can fail due to [other issues](https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/#how-api-initiated-eviction-works), like misconfiguration of the PDB or the workload.

Ideally, there would be a smarter mechanism of evicting the pods, possible improvements could include:
- having an overall node drain timeout instead of a timeout for a single pod eviction
- differentiating between 429 and 500 errors from eviction API, retrying only on 429
- making the timeout dynamic based on the workload's termination grace period, readiness probe's `initialDelaySeconds`, and possibly other configurations

/kind feature
/area cluster-autoscaler

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pod eviction timeout is too aggressive on certain configurations #8701

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pod eviction timeout is too aggressive on certain configurations #8701

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions