-
Notifications
You must be signed in to change notification settings - Fork 611
Parametrize DRA test config #3654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hi @emerbe. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: emerbe The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/assign @alaypatel07 |
65b6445 to
93485a2
Compare
93485a2 to
e231a1a
Compare
7128f8c to
e7621d2
Compare
| {{$LOAD_TEST_THROUGHPUT := DefaultParam .CL2_LOAD_TEST_THROUGHPUT 10}} | ||
| {{$STEADY_STATE_QPS := DefaultParam .CL2_STEADY_STATE_QPS 5}} | ||
| {{$RESOURCE_SLICES_PER_NODE := DefaultParam .CL2_RESOURCE_SLICES_PER_NODE 1}} | ||
| {{$UPSIZE_THRESHOLD := DefaultParam .CL2_UPSIZE_THRESHOLD "10m"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value does not have any impact except from printing the threshold value. Why is the change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those thresholds define the acceptable limits for the pod startup times.
The initial values in this PT were too stretched for 100 nodes but in bigger scale I observed that those need to be higher. Even for 100 nodes in GKE 5 seconds is not enough:
E1102 21:24:30.207908 25461 clusterloader.go:258] Errors: [measurement call PodStartupLatency - FastFillPodStartupLatency error: pod startup: too high latency 99th percentile: got 8.792346318s expected: 5s]
I've been using this test with different than drivers and observed that times needed for different measurements may vary.
| # Node resource configuration | ||
| {{$gpusPerNode := DefaultParam .CL2_GPUS_PER_NODE 8}} | ||
| {{$resourceSlicesPerNode := DefaultParam .CL2_RESOURCE_SLICES_PER_NODE 1}} | ||
| {{$workerNodeCount := MultiplyInt $resourceSlicesPerNode .Nodes}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should this be named $totalResourceSliceCount?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done.
| {{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}} | ||
|
|
||
| # dra | ||
| {{$draNamespace := DefaultParam .CL2_DRA_NAMESPACE "dra-example-driver"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats the point of exposing these as env variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I removed it and made namepsace name in dra.go generic
e7621d2 to
d405c3e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alaypatel07 answered your comments. PTAL
| {{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}} | ||
|
|
||
| # dra | ||
| {{$draNamespace := DefaultParam .CL2_DRA_NAMESPACE "dra-example-driver"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I removed it and made namepsace name in dra.go generic
| # Node resource configuration | ||
| {{$gpusPerNode := DefaultParam .CL2_GPUS_PER_NODE 8}} | ||
| {{$resourceSlicesPerNode := DefaultParam .CL2_RESOURCE_SLICES_PER_NODE 1}} | ||
| {{$workerNodeCount := MultiplyInt $resourceSlicesPerNode .Nodes}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done.
| {{$LOAD_TEST_THROUGHPUT := DefaultParam .CL2_LOAD_TEST_THROUGHPUT 10}} | ||
| {{$STEADY_STATE_QPS := DefaultParam .CL2_STEADY_STATE_QPS 5}} | ||
| {{$RESOURCE_SLICES_PER_NODE := DefaultParam .CL2_RESOURCE_SLICES_PER_NODE 1}} | ||
| {{$UPSIZE_THRESHOLD := DefaultParam .CL2_UPSIZE_THRESHOLD "10m"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those thresholds define the acceptable limits for the pod startup times.
The initial values in this PT were too stretched for 100 nodes but in bigger scale I observed that those need to be higher. Even for 100 nodes in GKE 5 seconds is not enough:
E1102 21:24:30.207908 25461 clusterloader.go:258] Errors: [measurement call PodStartupLatency - FastFillPodStartupLatency error: pod startup: too high latency 99th percentile: got 8.792346318s expected: 5s]
I've been using this test with different than drivers and observed that times needed for different measurements may vary.
| } | ||
|
|
||
| func isResourceSlicesPublished(config *dependency.Config, namespace string) (bool, error) { | ||
| // Get a list of all nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why changes in this file are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noticed that I haven't deleted commented code in the previous PR :/
d405c3e to
cba8b3d
Compare
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
This PR modifies DRA test logic a bit so it's parametrized to simplify running test with different drivers and with different scale.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer: