Skip to content

Conversation

@ciarams87
Copy link
Contributor

Proposed changes

Problem: When autoscaling.enable: true is configured in the Helm chart, the NGF controller updates the deployment and modifies the spec.replicas field in conflict with the HPA. This causes the deployment to scale up and down in the same second, resulting in constant pod churn and preventing the HPA from scaling up or down consistently.

Solution: When HPA is enabled, read the current Deployment.Spec.Replicas directly instead of HPA.Status.DesiredReplicas, which is eventually consistent and lags behind deployment changes. This prevents the controller from overwriting HPA's replica count with stale values, eliminating pod churn and connection drops.

Testing: Unit and local testing

Please focus on (optional): If you any specific areas where you would like reviewers to focus their attention or provide
specific feedback, add them here.

Closes #4007

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that all unit tests pass after adding my changes
  • I have updated necessary documentation
  • I have rebased my branch onto main
  • I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

Release notes

If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.


…status

When HPA is enabled, read the current Deployment.Spec.Replicas directly
instead of HPA.Status.DesiredReplicas, which is eventually consistent and
lags behind deployment changes. This prevents the controller from
overwriting HPA's replica count with stale values, eliminating pod churn
and connection drops.

Fixes race condition where HPA scales down → NGF reads stale HPA status
→ NGF overwrites deployment with old replica count → pods restart.
@ciarams87 ciarams87 requested a review from a team as a code owner November 5, 2025 09:56
@github-actions github-actions bot added the bug Something isn't working label Nov 5, 2025
@codecov
Copy link

codecov bot commented Nov 5, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.04%. Comparing base (dd60012) to head (c25cdb5).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4214      +/-   ##
==========================================
+ Coverage   86.02%   86.04%   +0.02%     
==========================================
  Files         131      131              
  Lines       14111    14120       +9     
  Branches       35       35              
==========================================
+ Hits        12139    12150      +11     
+ Misses       1770     1769       -1     
+ Partials      202      201       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines +717 to +751
func (p *NginxProvisioner) determineReplicas(
objectMeta metav1.ObjectMeta,
deploymentCfg ngfAPIv1alpha2.DeploymentSpec,
) *int32 {
replicas := deploymentCfg.Replicas

if !isAutoscalingEnabled(&deploymentCfg) {
return replicas
}

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()

hpa := &autoscalingv2.HorizontalPodAutoscaler{}
err := p.k8sClient.Get(ctx, types.NamespacedName{
Namespace: objectMeta.Namespace,
Name: objectMeta.Name,
}, hpa)
if err != nil {
return replicas
}

existingDeployment := &appsv1.Deployment{}
err = p.k8sClient.Get(ctx, types.NamespacedName{
Namespace: objectMeta.Namespace,
Name: objectMeta.Name,
}, existingDeployment)

if err == nil && existingDeployment.Spec.Replicas != nil {
replicas = existingDeployment.Spec.Replicas
}

return replicas
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach but I was wondering if there was a way to not set our spec replica when HPA is enabled (or set it initially to the config value but once HPA is detected, we don't let controller handle it scaling for deployment )and document that, since reconciliation is what leads us to the race issue. I was wanting to understand the limitations of that approach. What I was thinking is not patching replicas to the deployment when HPA enabled?

The issue could be on HPA's deletion in cluster and controller knowing it and falling back.

And I am assuming if HPA when once enabled and used, is deleted then we patch the deployment with that information and default to replica count in config?

@github-project-automation github-project-automation bot moved this from 🆕 New to 🏗 In Progress in NGINX Gateway Fabric Nov 5, 2025
Comment on lines +700 to +716
// HPA Replicas Management Strategy:
//
// When an HPA is managing a deployment, we must read the current deployment's replicas
// from the cluster and use that value, rather than trying to set our own value or read
// from HPA.Status.DesiredReplicas (which is eventually consistent and stale).
//
// Why we can't use HPA.Status.DesiredReplicas:
// - HPA.Status updates lag behind Deployment.Spec.Replicas changes
// - When HPA scales down: HPA writes Deployment.Spec → then updates its own Status
// - If we read Status during this window, we get the OLD value and overwrite HPA's new value
// - This creates a race condition causing pod churn
//
// Our approach:
// - When HPA exists: Read current deployment replicas from cluster and use that
// - When HPA doesn't exist yet: Set replicas for initial deployment creation
// - When HPA exists but Deployment doesn't exist yet: Set replicas for initial deployment creation
// - When HPA is disabled: Set replicas normally.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I feel like this is more function implementation details and should go inside the function

Copy link
Contributor

@salonichf5 salonichf5 Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this as a good overview before jumping into details since its so intertwined. outside makes more sense to me

Copy link
Contributor

@bjee19 bjee19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified that the situation described in the bug report was resolved?

@salonichf5
Copy link
Contributor

Have you verified that the situation described in the bug report was resolved?

Yeah we should try to bottleneck the reconciliation/patch logic and see how that goes(stability eventually happens) since this was a bigger pain point for prod environments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

Status: 🏗 In Progress

Development

Successfully merging this pull request may close these issues.

HPA and NGF Controller Conflicting

4 participants