Fix HPA race condition by reading deployment replicas instead of HPA status #4214

ciarams87 · 2025-11-05T09:56:00Z

Proposed changes

Problem: When autoscaling.enable: true is configured in the Helm chart, the NGF controller updates the deployment and modifies the spec.replicas field in conflict with the HPA. This causes the deployment to scale up and down in the same second, resulting in constant pod churn and preventing the HPA from scaling up or down consistently.

Solution: When HPA is enabled, read the current Deployment.Spec.Replicas directly instead of HPA.Status.DesiredReplicas, which is eventually consistent and lags behind deployment changes. This prevents the controller from overwriting HPA's replica count with stale values, eliminating pod churn and connection drops.

Testing: Unit and local testing

Please focus on (optional): If you any specific areas where you would like reviewers to focus their attention or provide
specific feedback, add them here.

Closes #4007

Checklist

Before creating a PR, run through this checklist and mark each as complete.

I have read the CONTRIBUTING doc
I have added tests that prove my fix is effective or that my feature works
I have checked that all unit tests pass after adding my changes
I have updated necessary documentation
I have rebased my branch onto main
I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

Release notes

If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.

…status When HPA is enabled, read the current Deployment.Spec.Replicas directly instead of HPA.Status.DesiredReplicas, which is eventually consistent and lags behind deployment changes. This prevents the controller from overwriting HPA's replica count with stale values, eliminating pod churn and connection drops. Fixes race condition where HPA scales down → NGF reads stale HPA status → NGF overwrites deployment with old replica count → pods restart.

codecov · 2025-11-05T10:04:35Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.04%. Comparing base (dd60012) to head (c25cdb5).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4214      +/-   ##
==========================================
+ Coverage   86.02%   86.04%   +0.02%     
==========================================
  Files         131      131              
  Lines       14111    14120       +9     
  Branches       35       35              
==========================================
+ Hits        12139    12150      +11     
+ Misses       1770     1769       -1     
+ Partials      202      201       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

salonichf5 · 2025-11-05T18:06:19Z

internal/controller/provisioner/objects.go

+func (p *NginxProvisioner) determineReplicas(
+	objectMeta metav1.ObjectMeta,
+	deploymentCfg ngfAPIv1alpha2.DeploymentSpec,
+) *int32 {
+	replicas := deploymentCfg.Replicas
+
+	if !isAutoscalingEnabled(&deploymentCfg) {
+		return replicas
+	}
+
+	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+	defer cancel()
+
+	hpa := &autoscalingv2.HorizontalPodAutoscaler{}
+	err := p.k8sClient.Get(ctx, types.NamespacedName{
+		Namespace: objectMeta.Namespace,
+		Name:      objectMeta.Name,
+	}, hpa)
+	if err != nil {
+		return replicas
+	}
+
+	existingDeployment := &appsv1.Deployment{}
+	err = p.k8sClient.Get(ctx, types.NamespacedName{
+		Namespace: objectMeta.Namespace,
+		Name:      objectMeta.Name,
+	}, existingDeployment)
+
+	if err == nil && existingDeployment.Spec.Replicas != nil {
+		replicas = existingDeployment.Spec.Replicas
+	}
+
+	return replicas
+}
+


I like this approach but I was wondering if there was a way to not set our spec replica when HPA is enabled (or set it initially to the config value but once HPA is detected, we don't let controller handle it scaling for deployment )and document that, since reconciliation is what leads us to the race issue. I was wanting to understand the limitations of that approach. What I was thinking is not patching replicas to the deployment when HPA enabled?

The issue could be on HPA's deletion in cluster and controller knowing it and falling back.

And I am assuming if HPA when once enabled and used, is deleted then we patch the deployment with that information and default to replica count in config?

internal/controller/provisioner/objects_test.go

bjee19 · 2025-11-05T19:03:01Z

internal/controller/provisioner/objects.go

+// HPA Replicas Management Strategy:
+//
+// When an HPA is managing a deployment, we must read the current deployment's replicas
+// from the cluster and use that value, rather than trying to set our own value or read
+// from HPA.Status.DesiredReplicas (which is eventually consistent and stale).
+//
+// Why we can't use HPA.Status.DesiredReplicas:
+// - HPA.Status updates lag behind Deployment.Spec.Replicas changes
+// - When HPA scales down: HPA writes Deployment.Spec → then updates its own Status
+// - If we read Status during this window, we get the OLD value and overwrite HPA's new value
+// - This creates a race condition causing pod churn
+//
+// Our approach:
+// - When HPA exists: Read current deployment replicas from cluster and use that
+// - When HPA doesn't exist yet: Set replicas for initial deployment creation
+// - When HPA exists but Deployment doesn't exist yet: Set replicas for initial deployment creation
+// - When HPA is disabled: Set replicas normally.


nit: I feel like this is more function implementation details and should go inside the function

i like this as a good overview before jumping into details since its so intertwined. outside makes more sense to me

bjee19

Have you verified that the situation described in the bug report was resolved?

internal/controller/provisioner/objects_test.go

salonichf5 · 2025-11-05T21:24:54Z

Have you verified that the situation described in the bug report was resolved?

Yeah we should try to bottleneck the reconciliation/patch logic and see how that goes(stability eventually happens) since this was a bigger pain point for prod environments

ciarams87 requested a review from a team as a code owner November 5, 2025 09:56

github-project-automation bot added this to NGINX Gateway Fabric Nov 5, 2025

github-project-automation bot moved this to 🆕 New in NGINX Gateway Fabric Nov 5, 2025

github-actions bot added the bug Something isn't working label Nov 5, 2025

salonichf5 requested changes Nov 5, 2025

View reviewed changes

github-project-automation bot moved this from 🆕 New to 🏗 In Progress in NGINX Gateway Fabric Nov 5, 2025

bjee19 reviewed Nov 5, 2025

View reviewed changes

salonichf5 reviewed Nov 5, 2025

View reviewed changes

internal/controller/provisioner/objects_test.go Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix HPA race condition by reading deployment replicas instead of HPA status #4214

Fix HPA race condition by reading deployment replicas instead of HPA status #4214

ciarams87 commented Nov 5, 2025

Uh oh!

codecov bot commented Nov 5, 2025

Uh oh!

salonichf5 Nov 5, 2025

Uh oh!

Uh oh!

bjee19 Nov 5, 2025

Uh oh!

salonichf5 Nov 5, 2025 •

edited

Loading

Uh oh!

bjee19 left a comment

Uh oh!

Uh oh!

salonichf5 commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix HPA race condition by reading deployment replicas instead of HPA status #4214

Are you sure you want to change the base?

Fix HPA race condition by reading deployment replicas instead of HPA status #4214

Conversation

ciarams87 commented Nov 5, 2025

Proposed changes

Checklist

Release notes

Uh oh!

codecov bot commented Nov 5, 2025

Codecov Report

Uh oh!

salonichf5 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bjee19 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

salonichf5 Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjee19 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

salonichf5 commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

salonichf5 Nov 5, 2025 •

edited

Loading