Skip to content

Commit a8afab3

Browse files
tmshortclaude
andcommitted
✨ Add e2e profiling toolchain for heap and CPU analysis
Add comprehensive profiling infrastructure to collect, analyze, and compare heap and CPU profiles during e2e test execution. **Features:** - Automated heap and CPU profile collection from operator-controller and catalogd - Real-time profile capture every 15 seconds during test execution - CPU profiling with 10-second sampling windows - Multi-component profiling with separate analysis for each component - Prometheus alert tracking integrated with profiling reports - Side-by-side comparison of different test runs - Claude Code integration via /e2e-profile command **Tooling:** - `collect-profiles.sh`: Port-forward to pprof endpoints and collect heap/CPU dumps - `analyze-profiles.sh`: Generate detailed analysis with top allocators, growth patterns, and CPU hotspots - `compare-profiles.sh`: Compare two test runs to identify regressions - `run-profiled-test.sh`: Orchestrate full profiled test runs - `e2e-profile.sh`: Main entry point with subcommands (run/analyze/compare) **Usage:** ```bash ./hack/tools/e2e-profiling/e2e-profile.sh run baseline test-experimental-e2e ./hack/tools/e2e-profiling/e2e-profile.sh analyze baseline ./hack/tools/e2e-profiling/e2e-profile.sh compare baseline optimized ``` **Integration:** - Claude Code command: `/e2e-profile` for interactive use - Automatic cleanup of empty profiles from cluster teardown - Prometheus alert extraction from e2e test summaries - Detailed markdown reports with memory growth, CPU usage analysis, and recommendations **Key Implementation Details:** - Fixed deployment wait to poll until deployments are created before checking availability - Fixed PID tracking to only wait for CPU profiling jobs, not port-forward processes - 10-second CPU profiling works correctly with proper wait handling - Port-forwards remain stable throughout entire test duration This tooling was essential for identifying memory optimization opportunities and validating that alert thresholds are correctly calibrated. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 0cbb85c commit a8afab3

14 files changed

+2691
-2
lines changed

.claude/commands/e2e-profile.md

Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
---
2+
description: Profile memory and CPU usage during e2e tests and analyze results
3+
---
4+
5+
# E2E Profiling Plugin
6+
7+
Analyze memory and CPU usage during e2e tests by collecting pprof heap and CPU profiles and generating comprehensive analysis reports.
8+
9+
## Commands
10+
11+
### /e2e-profile run [test-name] [test-target]
12+
13+
Run an e2e test with continuous memory and CPU profiling:
14+
15+
1. Start the specified e2e test (defaults to `make test-experimental-e2e`)
16+
2. Wait for the operator-controller pod to be ready
17+
3. Collect heap and CPU profiles every 15 seconds to `./e2e-profiles/[test-name]/`
18+
4. Continue until the test completes or is interrupted
19+
5. Generate a summary report with memory and CPU analysis
20+
21+
**Test Targets:**
22+
- `test-e2e` - Standard e2e tests
23+
- `test-experimental-e2e` - Experimental e2e tests (default)
24+
- `test-extension-developer-e2e` - Extension developer e2e tests
25+
- `test-upgrade-e2e` - Upgrade e2e tests
26+
- `test-upgrade-experimental-e2e` - Upgrade experimental e2e tests
27+
28+
**Examples:**
29+
```
30+
/e2e-profile run baseline
31+
/e2e-profile run baseline test-e2e
32+
/e2e-profile run with-caching test-experimental-e2e
33+
/e2e-profile run upgrade-test test-upgrade-e2e
34+
```
35+
36+
### /e2e-profile analyze [test-name]
37+
38+
Analyze collected heap profiles for a specific test run:
39+
40+
1. Load all heap profiles from `./e2e-profiles/[test-name]/`
41+
2. Analyze memory growth patterns
42+
3. Identify top allocators
43+
4. Find OpenAPI, JSON, and other hotspots
44+
5. Generate detailed markdown report
45+
46+
**Example:**
47+
```
48+
/e2e-profile analyze baseline
49+
```
50+
51+
### /e2e-profile compare [test1] [test2]
52+
53+
Compare two test runs to measure the impact of changes:
54+
55+
1. Load profiles from both test runs
56+
2. Compare peak memory usage
57+
3. Compare memory growth rates
58+
4. Identify differences in allocation patterns
59+
5. Generate side-by-side comparison report with charts
60+
61+
**Example:**
62+
```
63+
/e2e-profile compare baseline with-caching
64+
```
65+
66+
### /e2e-profile collect
67+
68+
Manually collect a single heap profile from the running operator-controller pod:
69+
70+
1. Find the operator-controller pod
71+
2. Set up port forwarding to pprof endpoint
72+
3. Download heap profile
73+
4. Save to `./e2e-profiles/manual/heap-[timestamp].pprof`
74+
75+
**Example:**
76+
```
77+
/e2e-profile collect
78+
```
79+
80+
## Task Breakdown
81+
82+
When you invoke this command, I will:
83+
84+
1. **Setup Phase**
85+
- Create `./e2e-profiles/[test-name]` directory
86+
- Verify `make test-experimental-e2e` is available
87+
- Check kubectl access to the cluster
88+
89+
2. **Collection Phase**
90+
- Start the e2e test in background
91+
- Monitor for pod readiness
92+
- Set up port forwarding to pprof endpoint (port 6060)
93+
- Collect heap profiles every 15 seconds
94+
- Save profiles with sequential naming (heap0.pprof, heap1.pprof, ...)
95+
96+
3. **Monitoring Phase**
97+
- Track test progress
98+
- Monitor profile file sizes for growth patterns
99+
- Detect if test crashes or completes
100+
101+
4. **Analysis Phase**
102+
- Use `go tool pprof` to analyze profiles
103+
- Extract key metrics:
104+
- Peak memory usage
105+
- Memory growth over time
106+
- Top allocators
107+
- OpenAPI-related allocations
108+
- JSON deserialization overhead
109+
- Informer/cache allocations
110+
111+
5. **Reporting Phase**
112+
- Generate markdown report with:
113+
- Executive summary
114+
- Memory timeline chart
115+
- Top allocators table
116+
- Allocation breakdown
117+
- Recommendations for optimization
118+
119+
## Configuration
120+
121+
The plugin uses these defaults (customizable via environment variables):
122+
123+
```bash
124+
# Namespace where operator-controller runs
125+
E2E_PROFILE_NAMESPACE=olmv1-system
126+
127+
# Collection interval in seconds
128+
E2E_PROFILE_INTERVAL=15
129+
130+
# CPU sampling duration in seconds
131+
E2E_PROFILE_CPU_DURATION=10
132+
133+
# Output directory base
134+
E2E_PROFILE_DIR=./e2e-profiles
135+
136+
# Default test target
137+
E2E_PROFILE_TEST_TARGET=test-experimental-e2e
138+
```
139+
140+
## Output Structure
141+
142+
```
143+
e2e-profiles/
144+
├── baseline/
145+
│ ├── heap0.pprof
146+
│ ├── heap1.pprof
147+
│ ├── ...
148+
│ ├── heap23.pprof
149+
│ ├── test.log
150+
│ └── analysis.md
151+
├── with-caching/
152+
│ ├── heap0.pprof
153+
│ ├── ...
154+
│ └── analysis.md
155+
└── comparisons/
156+
└── baseline-vs-with-caching.md
157+
```
158+
159+
## Tool Location
160+
161+
The memory profiling scripts are located at:
162+
```
163+
hack/tools/e2e-profiling/
164+
├── e2e-profile.sh # Main entry point
165+
├── run-profiled-test.sh # Run test with profiling
166+
├── collect-profiles.sh # Collect heap profiles
167+
├── analyze-profiles.sh # Generate analysis
168+
├── compare-profiles.sh # Compare two runs
169+
├── README.md # Full documentation
170+
└── USAGE_EXAMPLES.md # Real-world examples
171+
```
172+
173+
You can run them directly:
174+
```bash
175+
./hack/tools/e2e-profiling/e2e-profile.sh run baseline
176+
./hack/tools/e2e-profiling/e2e-profile.sh analyze baseline
177+
./hack/tools/e2e-profiling/e2e-profile.sh compare baseline optimized
178+
```
179+
180+
## Requirements
181+
182+
- kubectl with access to the cluster
183+
- go tool pprof
184+
- make (for running tests)
185+
- curl (for fetching profiles)
186+
- Port 6060 available for forwarding
187+
188+
## Example Workflow
189+
190+
```bash
191+
# 1. Run baseline test with profiling
192+
/e2e-profile run baseline
193+
194+
# 2. Make code changes (e.g., add caching)
195+
# ... edit code ...
196+
197+
# 3. Run new test with profiling
198+
/e2e-profile run with-caching
199+
200+
# 4. Compare results
201+
/e2e-profile compare baseline with-caching
202+
203+
# 5. Review the comparison report
204+
# Opens: e2e-profiles/comparisons/baseline-vs-with-caching.md
205+
```
206+
207+
## Notes
208+
209+
- The test will run until completion or manual interruption (Ctrl+C)
210+
- Each heap profile is ~11-150KB depending on memory usage
211+
- Analysis requires all heap files to be present
212+
- Port forwarding runs in background and auto-cleans on exit
213+
- Reports are generated in markdown format for easy viewing

.gitignore

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,13 @@ vendor/
3838
\#*\#
3939
.\#*
4040

41-
# AI temp files files
42-
.claude/
41+
# AI temp/local files
42+
.claude/settings.local.json
43+
.claude/history/
44+
.claude/cache/
45+
.claude/logs/
46+
.claude/.session*
47+
.claude/*.log
4348

4449
# documentation website asset folder
4550
site
@@ -50,3 +55,6 @@ site
5055

5156
# Temporary files and directories
5257
/test/regression/convert/testdata/tmp/*
58+
59+
# E2E profiling artifacts
60+
e2e-profiles/

0 commit comments

Comments
 (0)