Skip to content

Conversation

@emerbe
Copy link
Contributor

@emerbe emerbe commented Oct 23, 2025

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR modifies DRA test logic a bit so it's parametrized to simplify running test with different drivers and with different scale.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 23, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @emerbe. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: emerbe
Once this PR has been reviewed and has the lgtm label, please assign marseel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 23, 2025
@emerbe
Copy link
Contributor Author

emerbe commented Oct 23, 2025

/assign @alaypatel07

@emerbe emerbe force-pushed the update-dra-test-config branch 3 times, most recently from 65b6445 to 93485a2 Compare October 24, 2025 17:40
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 30, 2025
@emerbe emerbe force-pushed the update-dra-test-config branch from 93485a2 to e231a1a Compare October 30, 2025 15:53
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 30, 2025
@emerbe emerbe force-pushed the update-dra-test-config branch 3 times, most recently from 7128f8c to e7621d2 Compare October 30, 2025 16:02
{{$LOAD_TEST_THROUGHPUT := DefaultParam .CL2_LOAD_TEST_THROUGHPUT 10}}
{{$STEADY_STATE_QPS := DefaultParam .CL2_STEADY_STATE_QPS 5}}
{{$RESOURCE_SLICES_PER_NODE := DefaultParam .CL2_RESOURCE_SLICES_PER_NODE 1}}
{{$UPSIZE_THRESHOLD := DefaultParam .CL2_UPSIZE_THRESHOLD "10m"}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value does not have any impact except from printing the threshold value. Why is the change needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those thresholds define the acceptable limits for the pod startup times.

The initial values in this PT were too stretched for 100 nodes but in bigger scale I observed that those need to be higher. Even for 100 nodes in GKE 5 seconds is not enough:

E1102 21:24:30.207908 25461 clusterloader.go:258] Errors: [measurement call PodStartupLatency - FastFillPodStartupLatency error: pod startup: too high latency 99th percentile: got 8.792346318s expected: 5s]

I've been using this test with different than drivers and observed that times needed for different measurements may vary.

# Node resource configuration
{{$gpusPerNode := DefaultParam .CL2_GPUS_PER_NODE 8}}
{{$resourceSlicesPerNode := DefaultParam .CL2_RESOURCE_SLICES_PER_NODE 1}}
{{$workerNodeCount := MultiplyInt $resourceSlicesPerNode .Nodes}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be named $totalResourceSliceCount?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done.

{{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}}

# dra
{{$draNamespace := DefaultParam .CL2_DRA_NAMESPACE "dra-example-driver"}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the point of exposing these as env variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I removed it and made namepsace name in dra.go generic

@emerbe emerbe force-pushed the update-dra-test-config branch from e7621d2 to d405c3e Compare November 2, 2025 21:46
Copy link
Contributor Author

@emerbe emerbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alaypatel07 answered your comments. PTAL

{{$namespaces := DivideInt .Nodes $NODES_PER_NAMESPACE}}

# dra
{{$draNamespace := DefaultParam .CL2_DRA_NAMESPACE "dra-example-driver"}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I removed it and made namepsace name in dra.go generic

# Node resource configuration
{{$gpusPerNode := DefaultParam .CL2_GPUS_PER_NODE 8}}
{{$resourceSlicesPerNode := DefaultParam .CL2_RESOURCE_SLICES_PER_NODE 1}}
{{$workerNodeCount := MultiplyInt $resourceSlicesPerNode .Nodes}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done.

{{$LOAD_TEST_THROUGHPUT := DefaultParam .CL2_LOAD_TEST_THROUGHPUT 10}}
{{$STEADY_STATE_QPS := DefaultParam .CL2_STEADY_STATE_QPS 5}}
{{$RESOURCE_SLICES_PER_NODE := DefaultParam .CL2_RESOURCE_SLICES_PER_NODE 1}}
{{$UPSIZE_THRESHOLD := DefaultParam .CL2_UPSIZE_THRESHOLD "10m"}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those thresholds define the acceptable limits for the pod startup times.

The initial values in this PT were too stretched for 100 nodes but in bigger scale I observed that those need to be higher. Even for 100 nodes in GKE 5 seconds is not enough:

E1102 21:24:30.207908 25461 clusterloader.go:258] Errors: [measurement call PodStartupLatency - FastFillPodStartupLatency error: pod startup: too high latency 99th percentile: got 8.792346318s expected: 5s]

I've been using this test with different than drivers and observed that times needed for different measurements may vary.

}

func isResourceSlicesPublished(config *dependency.Config, namespace string) (bool, error) {
// Get a list of all nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why changes in this file are needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed that I haven't deleted commented code in the previous PR :/

@emerbe emerbe force-pushed the update-dra-test-config branch from d405c3e to cba8b3d Compare November 5, 2025 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants