Fix the VPA e2e situation

/area vertical-pod-autoscaler

While digging into how the feature gates work with our e2e tests, I discovered a few issues with the VPA's e2e tests, namely:
1. Not all tests are run on presubmit (PR creation), only the "full-vpa" suite is run
2. Feature gates are manually configured to be enabled or disabled
3. Tests are run in series
4. Some tests wait for absence of anything to determine if they passed (it sets up a scenario, expects nothing to happen, so sleeps for 3 minutes and if nothing has happened, it considers that a pass).


I'd like to fix these problems. This issue is a placeholder for that work. The plan is:
1. Guard the feature gates correctly such that they don't run by default, but can all be run with a simple flag change (most of this is done already in https://github.com/kubernetes/autoscaler/pull/8684)
    1. ...
2. Configure all of our e2e test suites to run on presubmit (caveat: this will be slow, we could wait up to 1h30 for the actuation tests to pass)
    1. https://github.com/kubernetes/test-infra/pull/35823
3. Configure parallelism for the tests. In local testing I managed to get the 1h30m actuation tests down to 10m. We may need to mark some tests as "[Serial]" if they can't be run in parallel, but my initial testing seemed to show that this was a valuable change 
     1. https://github.com/kubernetes/autoscaler/pull/8715
     2. https://github.com/kubernetes/test-infra/pull/35832
     3. https://github.com/kubernetes/autoscaler/pull/8719
4. Configure e2e tests to run slow tests first
    1. https://github.com/kubernetes/autoscaler/pull/8738
6. Setup a second set of tests (duplicating the existing ones) that will enable all feature gates. So we will end up with 5 suites x 2 (master/presubmit) x 2 (feature gate enabled/disabled), which is a lot
    1. I may look at combining the suites (in some cases we can't run them all together, but we could potentially decrease the number of suites we have5. Fix the "tests that wait for nothing to happen". The idea is that instead of waiting for nothing, the test can pass when <something> happens. For example: one test waits 3 minutes to ensure that a Pod from a Deployment with 1 replica is never evicted. If the VPA updater set a condition on the VPA indicating that it can't be evicted, the test look for that. I plan to figure this out, make an AEP and go through API review
There may be other improvements that will happen along the way, but I think with these in place, e2e would be much nicer to use.

I guess something else I'd like to do is switch the e2e test manifest apply to using Helm. There are currently some nasty bash scripting hacks to modify the manifests, which is what Helm is designed to solve.  That may or may not be in scope of this issue.

/cc maxcao13 omerap12 kamarabbas99

---
Side note of a few other things to cleanup:
1. Modify the run_if_changed in the e2e tests to be more specific (ie: `vertical-pod-autoscaler/pkg`)
2. Remove print statements from tests
---
Notes for later:
1. How to run the tests locally with feature gates
2. How to change the number of test runners locally
3. How to run in serial 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix the VPA e2e situation #8705

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fix the VPA e2e situation #8705

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions