-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Thank you @alvaroaleman for adding ReconciliationTimeout. My understanding is that it acts a guardrail to prevent a single reconcile from blocking a worker indefinitely to prevent HOL blocking.
I was trying to enable this flag for some of our internal metrics but realized that we don’t currently surface when the timeout guardrail is breached. If a reconcile exits because ReconciliationTimeout fired, the existing metrics treat it like a normal return.
This issue is to propose a metric that makes those cases visible to alert / have visibility on when the timeout guardrail is exceeded.
Metric Semantics
controller_runtime_reconcile_timeouts{controller="<name>"}
- Incremented once when the context created from ReconciliationTimeout expires
- Labelled with controller as the only facet
Controllers may add their own context deadlines inside Reconcile(). This metric should only increment when the wrapper’s timeout fires, not when user code cancels early to distinguish when this timeout guardrail has been breached.
Handwavey idea to be added to pkg/internal/controller/controller.go in Reconcile method.
var res ctrl.Result
var err error
done := make(chan struct{})
go func() {
res, err = r.Reconcile(ctx, req)
close(done)
}()
select {
case <-done:
return res, err
case <-ctx.Done():
// Only count when the outer deadline elapsed, not on user supplied cancellation
if errors.Is(ctx.Err(), context.DeadlineExceeded) {
metrics.ReconcileTimeouts.WithLabelValues(r.name).Inc()
}
return ctrl.Result{}, ctx.Err()
}LMK if this makes sense and I am happy to add a patch for it!