-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Open
Labels
area/vertical-pod-autoscalerkind/cleanupCategorizes issue or PR as related to cleaning up code, process, or technical debt.Categorizes issue or PR as related to cleaning up code, process, or technical debt.triage/acceptedIndicates an issue or PR is ready to be actively worked on.Indicates an issue or PR is ready to be actively worked on.
Description
/area vertical-pod-autoscaler
While digging into how the feature gates work with our e2e tests, I discovered a few issues with the VPA's e2e tests, namely:
- Not all tests are run on presubmit (PR creation), only the "full-vpa" suite is run
- Feature gates are manually configured to be enabled or disabled
- Tests are run in series
- Some tests wait for absence of anything to determine if they passed (it sets up a scenario, expects nothing to happen, so sleeps for 3 minutes and if nothing has happened, it considers that a pass).
I'd like to fix these problems. This issue is a placeholder for that work. The plan is:
- Guard the feature gates correctly such that they don't run by default, but can all be run with a simple flag change (most of this is done already in Switch e2e test with feature flag to use framework.WithFeatureGate #8684)
- ...
- Configure all of our e2e test suites to run on presubmit (caveat: this will be slow, we could wait up to 1h30 for the actuation tests to pass)
- Configure parallelism for the tests. In local testing I managed to get the 1h30m actuation tests down to 10m. We may need to mark some tests as "[Serial]" if they can't be run in parallel, but my initial testing seemed to show that this was a valuable change
- Configure e2e tests to run slow tests first
- Setup a second set of tests (duplicating the existing ones) that will enable all feature gates. So we will end up with 5 suites x 2 (master/presubmit) x 2 (feature gate enabled/disabled), which is a lot
- I may look at combining the suites (in some cases we can't run them all together, but we could potentially decrease the number of suites we have5. Fix the "tests that wait for nothing to happen". The idea is that instead of waiting for nothing, the test can pass when happens. For example: one test waits 3 minutes to ensure that a Pod from a Deployment with 1 replica is never evicted. If the VPA updater set a condition on the VPA indicating that it can't be evicted, the test look for that. I plan to figure this out, make an AEP and go through API review
There may be other improvements that will happen along the way, but I think with these in place, e2e would be much nicer to use.
- I may look at combining the suites (in some cases we can't run them all together, but we could potentially decrease the number of suites we have5. Fix the "tests that wait for nothing to happen". The idea is that instead of waiting for nothing, the test can pass when happens. For example: one test waits 3 minutes to ensure that a Pod from a Deployment with 1 replica is never evicted. If the VPA updater set a condition on the VPA indicating that it can't be evicted, the test look for that. I plan to figure this out, make an AEP and go through API review
I guess something else I'd like to do is switch the e2e test manifest apply to using Helm. There are currently some nasty bash scripting hacks to modify the manifests, which is what Helm is designed to solve. That may or may not be in scope of this issue.
/cc maxcao13 omerap12 kamarabbas99
Side note of a few other things to cleanup:
- Modify the run_if_changed in the e2e tests to be more specific (ie:
vertical-pod-autoscaler/pkg) - Remove print statements from tests
Notes for later:
- How to run the tests locally with feature gates
- How to change the number of test runners locally
- How to run in serial
maxcao13, iamzili and omerap12maxcao13
Metadata
Metadata
Assignees
Labels
area/vertical-pod-autoscalerkind/cleanupCategorizes issue or PR as related to cleaning up code, process, or technical debt.Categorizes issue or PR as related to cleaning up code, process, or technical debt.triage/acceptedIndicates an issue or PR is ready to be actively worked on.Indicates an issue or PR is ready to be actively worked on.