Skip to content

Fix the VPA e2e situation #8705

@adrianmoisey

Description

@adrianmoisey

/area vertical-pod-autoscaler

While digging into how the feature gates work with our e2e tests, I discovered a few issues with the VPA's e2e tests, namely:

  1. Not all tests are run on presubmit (PR creation), only the "full-vpa" suite is run
  2. Feature gates are manually configured to be enabled or disabled
  3. Tests are run in series
  4. Some tests wait for absence of anything to determine if they passed (it sets up a scenario, expects nothing to happen, so sleeps for 3 minutes and if nothing has happened, it considers that a pass).

I'd like to fix these problems. This issue is a placeholder for that work. The plan is:

  1. Guard the feature gates correctly such that they don't run by default, but can all be run with a simple flag change (most of this is done already in Switch e2e test with feature flag to use framework.WithFeatureGate #8684)
    1. ...
  2. Configure all of our e2e test suites to run on presubmit (caveat: this will be slow, we could wait up to 1h30 for the actuation tests to pass)
    1. Run all the VPA e2e tests on presubmit test-infra#35823
  3. Configure parallelism for the tests. In local testing I managed to get the 1h30m actuation tests down to 10m. We may need to mark some tests as "[Serial]" if they can't be run in parallel, but my initial testing seemed to show that this was a valuable change
    1. Configure VPA for parallel e2e tests #8715
    2. Increase worker count for VPA actuation jobs test-infra#35832
    3. Guard a few more e2e tests with WithSerial() #8719
  4. Configure e2e tests to run slow tests first
    1. Mark slow e2e tests as slow #8738
  5. Setup a second set of tests (duplicating the existing ones) that will enable all feature gates. So we will end up with 5 suites x 2 (master/presubmit) x 2 (feature gate enabled/disabled), which is a lot
    1. I may look at combining the suites (in some cases we can't run them all together, but we could potentially decrease the number of suites we have5. Fix the "tests that wait for nothing to happen". The idea is that instead of waiting for nothing, the test can pass when happens. For example: one test waits 3 minutes to ensure that a Pod from a Deployment with 1 replica is never evicted. If the VPA updater set a condition on the VPA indicating that it can't be evicted, the test look for that. I plan to figure this out, make an AEP and go through API review
      There may be other improvements that will happen along the way, but I think with these in place, e2e would be much nicer to use.

I guess something else I'd like to do is switch the e2e test manifest apply to using Helm. There are currently some nasty bash scripting hacks to modify the manifests, which is what Helm is designed to solve. That may or may not be in scope of this issue.

/cc maxcao13 omerap12 kamarabbas99


Side note of a few other things to cleanup:

  1. Modify the run_if_changed in the e2e tests to be more specific (ie: vertical-pod-autoscaler/pkg)
  2. Remove print statements from tests

Notes for later:

  1. How to run the tests locally with feature gates
  2. How to change the number of test runners locally
  3. How to run in serial

Metadata

Metadata

Assignees

Labels

area/vertical-pod-autoscalerkind/cleanupCategorizes issue or PR as related to cleaning up code, process, or technical debt.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions