Skip to content

Pod eviction timeout is too aggressive on certain configurations #8701

@norbertcyran

Description

@norbertcyran

In scale-down, there's a timeout on initiating eviction of a pod. It's controlled by max-pod-eviction-time flag, which defaults to 2 minutes.

In some scenarios this is too aggressive. Recreating a pod protected by PDB can take much longer than that, especially if things like termination grace period, startup probe or readiness probe are configured.

The user can just increase that timeout with the flag, but that's not perfect either. Eviction can fail due to other issues, like misconfiguration of the PDB or the workload.

Ideally, there would be a smarter mechanism of evicting the pods, possible improvements could include:

  • having an overall node drain timeout instead of a timeout for a single pod eviction
  • differentiating between 429 and 500 errors from eviction API, retrying only on 429
  • making the timeout dynamic based on the workload's termination grace period, readiness probe's initialDelaySeconds, and possibly other configurations

/kind feature
/area cluster-autoscaler

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions