Skip to content

Commit afb80ec

Browse files
tmshortclaude
andcommitted
✨ Add memory profiling plugin for e2e tests
Add comprehensive memory profiling and analysis tooling to help identify and optimize memory usage during e2e test execution. Features: - Automatic heap profile collection during test runs - Support for all e2e test types (test-e2e, test-experimental-e2e, test-extension-developer-e2e, test-upgrade-e2e, test-upgrade-experimental-e2e) - Detailed analysis of memory allocators and growth patterns - Side-by-side comparison of test runs - Integration with Claude Code via /memory-profile command Usage: ./hack/tools/memory-profiling/memory-profile.sh run <test-name> [test-target] ./hack/tools/memory-profiling/memory-profile.sh analyze <test-name> ./hack/tools/memory-profiling/memory-profile.sh compare <test1> <test2> 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> ✨ Add Prometheus alerts tracking to memory profiling Enhance memory profiling plugin to capture and analyze Prometheus alerts during e2e test execution: 1. Set E2E_SUMMARY_OUTPUT environment variable during test runs - Captures prometheus alerts, test failures, and other metrics - Saved to e2e-summary.json in the profile directory 2. Updated analyze-profiles.sh to extract and display alerts - Parses e2e-summary.json using jq (if available) - Shows alert names, severities, and descriptions - Includes test failures in the analysis report 3. Updated compare-profiles.sh to compare alerts between runs - Shows alert counts for both tests - Lists alerts detected in each test - Helps identify if optimizations introduced new alerts This allows correlating memory usage with system health metrics, making it easier to identify if memory optimizations have any negative side effects on system stability. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 🐛 Fix: Use absolute path for E2E_SUMMARY_OUTPUT The e2e tests change directories during execution, causing the relative path to e2e-summary.json to fail. Convert OUTPUT_DIR to an absolute path before passing it to E2E_SUMMARY_OUTPUT. This fixes the error: 'failed to write e2e test summary output: no such file or directory' 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> ♻️ Rename e2e-summary.json to e2e-summary.md The E2E test summary file contains Markdown content (with mermaid charts), not JSON. This updates the file extension and parsing logic throughout the memory profiling scripts to correctly handle the Markdown format. Changes: - run-profiled-test.sh: Output to e2e-summary.md instead of .json - analyze-profiles.sh: Parse Markdown Alerts section instead of JSON - compare-profiles.sh: Compare Markdown alert sections between tests ✨ Add multi-component memory profiling support Refactored memory profiling tools to simultaneously analyze both operator-controller and catalogd components: **collect-profiles.sh:** - Hardcoded dual-component collection (operator-controller + catalogd) - Separate port forwards: localhost:6060 → operator-controller:6060, localhost:6061 → catalogd:6060 - Creates component subdirectories for organized profile storage - Collects profiles from both components simultaneously **analyze-profiles.sh:** - Removed backward compatibility for single-component analysis - Now requires both component directories to exist - Added analyze_component() function with stderr redirects to prevent log pollution in captured output - Generates combined analysis.md with sections for each component - Executive Summary shows peak memory for both components - Prometheus Alerts section remains test-wide (not per-component) **run-profiled-test.sh:** - Updated progress display to show both components: "operator-controller: X, catalogd: Y" - Enhanced summary output for dual-component results Directory structure: memory-profiles/ └── test-name/ ├── operator-controller/ │ └── heap*.pprof ├── catalogd/ │ └── heap*.pprof ├── analysis.md └── e2e-summary.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 🐛 Clean output directory before profiling runs Previously, run-profiled-test.sh would create the output directory with mkdir -p but not clean up existing files. This caused old profile files to remain and potentially confuse analysis when mixing single-component and multi-component data. Now the script: - Warns if the output directory already exists - Removes the entire directory to start fresh - Creates a clean directory for the new profiling run This ensures each test run has clean, isolated data. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 🐛 Fix Prometheus alert parsing to include level-3 headers The sed pattern '/^##/' was incorrectly matching both level-2 (##) and level-3 (###) markdown headers, causing it to stop at "### Firing Alerts" instead of continuing to "## Performance". This resulted in empty alert sections even when alerts were present. Changed pattern to '/^## /' (with space) to match only level-2 headers. **Impact:** - analyze-profiles.sh now correctly extracts and displays all alerts - compare-profiles.sh now correctly detects alerts in both test runs **Example:** Before: "No Prometheus alerts detected" (incorrect) After: Shows both pending alerts: - operator-controller-memory-growth: 132.4kB/sec - operator-controller-memory-usage: 107.9MB 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> ♻️ Delete empty heap profiles after collection Empty heap profiles (0 bytes) are created when the cluster is torn down and the pod is killed during the final profile collection attempt. These empty files skew analysis results by showing memory as 0K at the end, incorrectly suggesting all memory was freed. Now collect-profiles.sh deletes any empty heap*.pprof files at the end of collection, ensuring analysis only processes valid profiles captured during actual test execution. **Impact:** - Profile counts now reflect actual valid profiles (e.g., 25 instead of 26) - Memory growth tables no longer show misleading "0K" final entry - Peak memory correctly identified from real execution data **Example:** Before: heap24 (160K) → heap25 (0K) - misleading After: heap24 (160K) as actual peak 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 1f6612c commit afb80ec

16 files changed

+2588
-3
lines changed

.claude/commands/memory-profile.md

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
---
2+
description: Profile memory usage during e2e tests and analyze results
3+
---
4+
5+
# Memory Profiling Plugin
6+
7+
Analyze memory usage during e2e tests by collecting pprof heap profiles and generating comprehensive analysis reports.
8+
9+
## Commands
10+
11+
### /memory-profile run [test-name] [test-target]
12+
13+
Run an e2e test with continuous memory profiling:
14+
15+
1. Start the specified e2e test (defaults to `make test-experimental-e2e`)
16+
2. Wait for the operator-controller pod to be ready
17+
3. Collect heap profiles every 15 seconds to `./memory-profiles/[test-name]/`
18+
4. Continue until the test completes or is interrupted
19+
5. Generate a summary report
20+
21+
**Test Targets:**
22+
- `test-e2e` - Standard e2e tests
23+
- `test-experimental-e2e` - Experimental e2e tests (default)
24+
- `test-extension-developer-e2e` - Extension developer e2e tests
25+
- `test-upgrade-e2e` - Upgrade e2e tests
26+
- `test-upgrade-experimental-e2e` - Upgrade experimental e2e tests
27+
28+
**Examples:**
29+
```
30+
/memory-profile run baseline
31+
/memory-profile run baseline test-e2e
32+
/memory-profile run with-caching test-experimental-e2e
33+
/memory-profile run upgrade-test test-upgrade-e2e
34+
```
35+
36+
### /memory-profile analyze [test-name]
37+
38+
Analyze collected heap profiles for a specific test run:
39+
40+
1. Load all heap profiles from `./memory-profiles/[test-name]/`
41+
2. Analyze memory growth patterns
42+
3. Identify top allocators
43+
4. Find OpenAPI, JSON, and other hotspots
44+
5. Generate detailed markdown report
45+
46+
**Example:**
47+
```
48+
/memory-profile analyze baseline
49+
```
50+
51+
### /memory-profile compare [test1] [test2]
52+
53+
Compare two test runs to measure the impact of changes:
54+
55+
1. Load profiles from both test runs
56+
2. Compare peak memory usage
57+
3. Compare memory growth rates
58+
4. Identify differences in allocation patterns
59+
5. Generate side-by-side comparison report with charts
60+
61+
**Example:**
62+
```
63+
/memory-profile compare baseline with-caching
64+
```
65+
66+
### /memory-profile collect
67+
68+
Manually collect a single heap profile from the running operator-controller pod:
69+
70+
1. Find the operator-controller pod
71+
2. Set up port forwarding to pprof endpoint
72+
3. Download heap profile
73+
4. Save to `./memory-profiles/manual/heap-[timestamp].pprof`
74+
75+
**Example:**
76+
```
77+
/memory-profile collect
78+
```
79+
80+
## Task Breakdown
81+
82+
When you invoke this command, I will:
83+
84+
1. **Setup Phase**
85+
- Create `./memory-profiles/[test-name]` directory
86+
- Verify `make test-experimental-e2e` is available
87+
- Check kubectl access to the cluster
88+
89+
2. **Collection Phase**
90+
- Start the e2e test in background
91+
- Monitor for pod readiness
92+
- Set up port forwarding to pprof endpoint (port 6060)
93+
- Collect heap profiles every 15 seconds
94+
- Save profiles with sequential naming (heap0.pprof, heap1.pprof, ...)
95+
96+
3. **Monitoring Phase**
97+
- Track test progress
98+
- Monitor profile file sizes for growth patterns
99+
- Detect if test crashes or completes
100+
101+
4. **Analysis Phase**
102+
- Use `go tool pprof` to analyze profiles
103+
- Extract key metrics:
104+
- Peak memory usage
105+
- Memory growth over time
106+
- Top allocators
107+
- OpenAPI-related allocations
108+
- JSON deserialization overhead
109+
- Informer/cache allocations
110+
111+
5. **Reporting Phase**
112+
- Generate markdown report with:
113+
- Executive summary
114+
- Memory timeline chart
115+
- Top allocators table
116+
- Allocation breakdown
117+
- Recommendations for optimization
118+
119+
## Configuration
120+
121+
The plugin uses these defaults (customizable via environment variables):
122+
123+
```bash
124+
# Namespace where operator-controller runs
125+
MEMORY_PROFILE_NAMESPACE=olmv1-system
126+
127+
# Deployment name to monitor
128+
MEMORY_PROFILE_DEPLOYMENT=operator-controller-controller-manager
129+
130+
# Label selector for pod
131+
MEMORY_PROFILE_POD_LABEL="app.kubernetes.io/name=operator-controller"
132+
133+
# Pprof endpoint port
134+
MEMORY_PROFILE_PPROF_PORT=6060
135+
136+
# Collection interval in seconds
137+
MEMORY_PROFILE_INTERVAL=15
138+
139+
# Output directory base
140+
MEMORY_PROFILE_DIR=./memory-profiles
141+
```
142+
143+
## Output Structure
144+
145+
```
146+
memory-profiles/
147+
├── baseline/
148+
│ ├── heap0.pprof
149+
│ ├── heap1.pprof
150+
│ ├── ...
151+
│ ├── heap23.pprof
152+
│ ├── test.log
153+
│ └── analysis.md
154+
├── with-caching/
155+
│ ├── heap0.pprof
156+
│ ├── ...
157+
│ └── analysis.md
158+
└── comparisons/
159+
└── baseline-vs-with-caching.md
160+
```
161+
162+
## Tool Location
163+
164+
The memory profiling scripts are located at:
165+
```
166+
hack/tools/memory-profiling/
167+
├── memory-profile.sh # Main entry point
168+
├── run-profiled-test.sh # Run test with profiling
169+
├── collect-profiles.sh # Collect heap profiles
170+
├── analyze-profiles.sh # Generate analysis
171+
├── compare-profiles.sh # Compare two runs
172+
├── README.md # Full documentation
173+
└── USAGE_EXAMPLES.md # Real-world examples
174+
```
175+
176+
You can run them directly:
177+
```bash
178+
./hack/tools/memory-profiling/memory-profile.sh run baseline
179+
./hack/tools/memory-profiling/memory-profile.sh analyze baseline
180+
./hack/tools/memory-profiling/memory-profile.sh compare baseline optimized
181+
```
182+
183+
## Requirements
184+
185+
- kubectl with access to the cluster
186+
- go tool pprof
187+
- make (for running tests)
188+
- curl (for fetching profiles)
189+
- Port 6060 available for forwarding
190+
191+
## Example Workflow
192+
193+
```bash
194+
# 1. Run baseline test with profiling
195+
/memory-profile run baseline
196+
197+
# 2. Make code changes (e.g., add caching)
198+
# ... edit code ...
199+
200+
# 3. Run new test with profiling
201+
/memory-profile run with-caching
202+
203+
# 4. Compare results
204+
/memory-profile compare baseline with-caching
205+
206+
# 5. Review the comparison report
207+
# Opens: memory-profiles/comparisons/baseline-vs-with-caching.md
208+
```
209+
210+
## Notes
211+
212+
- The test will run until completion or manual interruption (Ctrl+C)
213+
- Each heap profile is ~11-150KB depending on memory usage
214+
- Analysis requires all heap files to be present
215+
- Port forwarding runs in background and auto-cleans on exit
216+
- Reports are generated in markdown format for easy viewing

.claude/settings.local.json

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
{
2+
"permissions": {
3+
"allow": [
4+
"Read(//home/tshort/experimental-e2e-testing/**)",
5+
"Bash(go tool pprof:*)",
6+
"Bash(for i in 0 5 10 15 20 23)",
7+
"Bash(do echo \"=== heap$i.pprof ===\")",
8+
"Bash(done)",
9+
"Bash(awk:*)",
10+
"Bash(go doc:*)",
11+
"Bash(go list:*)",
12+
"Read(//home/tshort/go/pkg/mod/k8s.io/**)",
13+
"Bash(go build:*)"
14+
],
15+
"deny": [],
16+
"ask": []
17+
}
18+
}

.gitignore

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,9 +38,6 @@ vendor/
3838
\#*\#
3939
.\#*
4040

41-
# AI temp files files
42-
.claude/
43-
4441
# documentation website asset folder
4542
site
4643

@@ -50,3 +47,6 @@ site
5047

5148
# Temporary files and directories
5249
/test/regression/convert/testdata/tmp/*
50+
51+
# Memory profiling artifacts
52+
memory-profiles/
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Ignore generated analysis files in example directories
2+
memory-profiles/*/analysis.md
3+
memory-profiles/comparisons/

0 commit comments

Comments
 (0)