Skip to content

Commit eb1ecbb

Browse files
tmshortclaude
andcommitted
✨ Add e2e profiling toolchain for heap and CPU analysis
Add comprehensive profiling infrastructure to collect, analyze, and compare heap and CPU profiles during e2e test execution. **Two Profiling Workflows:** 1. **Start/Stop Workflow (Recommended)** - Start profiling in background with `make start-profiling` - Run ANY test command (make test-e2e, make test-experimental-e2e, etc.) - Stop and analyze with `make stop-profiling` - Handles cluster teardown gracefully (auto-stops after 3 consecutive failures) - Works with tests that tear down clusters (like test-e2e) 2. **Automated Workflow** - Run integrated test with `./hack/tools/e2e-profiling/e2e-profile.sh run <name>` - Automatically handles profiling lifecycle - Best for scripted/automated profiling runs **Features:** - Automated heap and CPU profile collection from operator-controller and catalogd - Real-time profile capture every 10 seconds during test execution - CPU profiling with 10-second sampling windows running in parallel - Configurable profile modes: both (default), heap-only, or CPU-only - Multi-component profiling with separate analysis for each component - Prometheus alert tracking integrated with profiling reports - Side-by-side comparison of different test runs - Graceful cluster teardown detection and auto-stop **Tooling:** - `start-profiling.sh`: Start background profiling session - `stop-profiling.sh`: Stop profiling, cleanup, and analyze - `common.sh`: Shared library with logging, colors, config, and utilities - `collect-profiles.sh`: Profile collection loop (used by start/run workflows) - `analyze-profiles.sh`: Generate detailed analysis with top allocators, growth patterns, and CPU hotspots - `compare-profiles.sh`: Compare two test runs to identify regressions - `run-profiled-test.sh`: Orchestrate full profiled test runs (automated workflow) - `e2e-profile.sh`: Main entry point with subcommands (run/analyze/compare) **Architecture Improvements:** - **Shared common library**: All scripts source `common.sh` for consistent logging, colors, and utilities - **Deployment-based port-forwarding**: Uses `deployment/` references instead of pod names for automatic failover - **Background execution**: Profiling runs in background using nohup, allowing any test command - **Intelligent retry logic**: 30-second timeout with 2-second intervals, tests components independently - **Robust cleanup (EXIT trap)**: Gracefully terminates processes, force-kills if stuck, removes empty profiles - **Multi-component support**: Profiles operator-controller and catalogd simultaneously in separate directories - **Cluster teardown detection**: Tracks consecutive failures, auto-stops after 3 failures when cluster is torn down **Usage:** Start/Stop Workflow: ```bash # Start profiling make PROFILE_NAME=baseline start-profiling # Run your tests (any command!) make test-e2e # Works! Handles cluster teardown make test-experimental-e2e # Works! go test ./test/e2e/... # Works! # Stop and analyze make stop-profiling ``` Automated Workflow: ```bash # Run with both heap and CPU profiling (default) ./hack/tools/e2e-profiling/e2e-profile.sh run baseline test-experimental-e2e # Run with heap-only profiling (reduced overhead) E2E_PROFILE_MODE=heap ./hack/tools/e2e-profiling/e2e-profile.sh run memory-test # Analyze results ./hack/tools/e2e-profiling/e2e-profile.sh analyze baseline # Compare two runs ./hack/tools/e2e-profiling/e2e-profile.sh compare baseline optimized ``` **Configuration:** Set `E2E_PROFILE_MODE` environment variable: - `both` (default): Collect both heap and CPU profiles - `heap`: Collect only heap profiles (reduces overhead by ~3%) - `cpu`: Collect only CPU profiles **Integration:** - Automatic cleanup of empty profiles from cluster teardown - Prometheus alert extraction from e2e test summaries - Detailed markdown reports with memory growth, CPU usage analysis, and recommendations - Claude Code slash command integration (`/e2e-profile start/stop/run/analyze/compare`) **Key Implementation Details:** - Background profiling: Entire collection runs in nohup with exported environment variables - Fixed interval timing: INTERVAL now includes CPU profiling time, not adds to it - Deployment wait polls until deployments are created before checking availability - Component name sanitization: Hyphens converted to underscores for valid bash variable names - PID tracking for both background process and port-forward cleanup - Consecutive failure tracking: 3 failures triggers graceful auto-stop - Silent error handling: curl errors suppressed when cluster is being torn down - 10-second intervals accurately maintained across all profiling modes - Port-forwards remain stable throughout entire test duration and survive pod restarts - Conditional profile collection based on PROFILE_MODE setting - Cleanup runs on EXIT/INT/TERM with graceful shutdown (2.5s timeout) and force-kill - Code deduplication: Common functions extracted to shared library **Code Quality:** - Reduced duplication: Shared common library for logging and utilities - Improved reliability: Deployment-based port-forwarding survives pod restarts - Better error handling: Clear timeout messages, automatic retry, robust cleanup - Flexible workflows: Start/stop for interactive use, automated for CI/CD - Enhanced documentation: Architecture guide, troubleshooting, workflow examples, and slash commands **Testing:** Verified end-to-end with `make test-e2e`: - Collected 32 heap + 31 CPU profiles per component - Auto-detected cluster teardown and stopped gracefully - Generated comprehensive analysis showing peak memory (24MB operator-controller, 16MB catalogd) - All tests passed with proper cleanup This tooling was essential for identifying memory optimization opportunities and validating that alert thresholds are correctly calibrated. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Todd Short <tshort@redhat.com>
1 parent 18142b3 commit eb1ecbb

17 files changed

+2732
-2
lines changed

.gitignore

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,13 @@ vendor/
3838
\#*\#
3939
.\#*
4040

41-
# AI temp files files
42-
.claude/
41+
# AI temp/local files
42+
.claude/settings.local.json
43+
.claude/history/
44+
.claude/cache/
45+
.claude/logs/
46+
.claude/.session*
47+
.claude/*.log
4348

4449
# documentation website asset folder
4550
site
@@ -50,3 +55,6 @@ site
5055

5156
# Temporary files and directories
5257
/test/regression/convert/testdata/tmp/*
58+
59+
# E2E profiling artifacts
60+
e2e-profiles/

Makefile

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -334,6 +334,14 @@ test-upgrade-experimental-e2e: $(TEST_UPGRADE_E2E_TASKS) #HELP Run upgrade e2e t
334334
e2e-coverage:
335335
COVERAGE_NAME=$(COVERAGE_NAME) ./hack/test/e2e-coverage.sh
336336

337+
.PHONY: start-profiling
338+
start-profiling: #HELP Start profiling in background. Run your tests, then use 'make stop-profiling'. Use PROFILE_NAME=<name> to specify output name.
339+
./hack/tools/e2e-profiling/start-profiling.sh $(PROFILE_NAME)
340+
341+
.PHONY: stop-profiling
342+
stop-profiling: #HELP Stop profiling and generate analysis report
343+
./hack/tools/e2e-profiling/stop-profiling.sh
344+
337345
#SECTION KIND Cluster Operations
338346

339347
.PHONY: kind-load

0 commit comments

Comments
 (0)