From eb1ecbb44ac1d0d62f0e7d71c9ee517f70a0e537 Mon Sep 17 00:00:00 2001 From: Todd Short Date: Tue, 4 Nov 2025 15:48:56 -0500 Subject: [PATCH] =?UTF-8?q?=E2=9C=A8=20Add=20e2e=20profiling=20toolchain?= =?UTF-8?q?=20for=20heap=20and=20CPU=20analysis?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add comprehensive profiling infrastructure to collect, analyze, and compare heap and CPU profiles during e2e test execution. **Two Profiling Workflows:** 1. **Start/Stop Workflow (Recommended)** - Start profiling in background with `make start-profiling` - Run ANY test command (make test-e2e, make test-experimental-e2e, etc.) - Stop and analyze with `make stop-profiling` - Handles cluster teardown gracefully (auto-stops after 3 consecutive failures) - Works with tests that tear down clusters (like test-e2e) 2. **Automated Workflow** - Run integrated test with `./hack/tools/e2e-profiling/e2e-profile.sh run ` - Automatically handles profiling lifecycle - Best for scripted/automated profiling runs **Features:** - Automated heap and CPU profile collection from operator-controller and catalogd - Real-time profile capture every 10 seconds during test execution - CPU profiling with 10-second sampling windows running in parallel - Configurable profile modes: both (default), heap-only, or CPU-only - Multi-component profiling with separate analysis for each component - Prometheus alert tracking integrated with profiling reports - Side-by-side comparison of different test runs - Graceful cluster teardown detection and auto-stop **Tooling:** - `start-profiling.sh`: Start background profiling session - `stop-profiling.sh`: Stop profiling, cleanup, and analyze - `common.sh`: Shared library with logging, colors, config, and utilities - `collect-profiles.sh`: Profile collection loop (used by start/run workflows) - `analyze-profiles.sh`: Generate detailed analysis with top allocators, growth patterns, and CPU hotspots - `compare-profiles.sh`: Compare two test runs to identify regressions - `run-profiled-test.sh`: Orchestrate full profiled test runs (automated workflow) - `e2e-profile.sh`: Main entry point with subcommands (run/analyze/compare) **Architecture Improvements:** - **Shared common library**: All scripts source `common.sh` for consistent logging, colors, and utilities - **Deployment-based port-forwarding**: Uses `deployment/` references instead of pod names for automatic failover - **Background execution**: Profiling runs in background using nohup, allowing any test command - **Intelligent retry logic**: 30-second timeout with 2-second intervals, tests components independently - **Robust cleanup (EXIT trap)**: Gracefully terminates processes, force-kills if stuck, removes empty profiles - **Multi-component support**: Profiles operator-controller and catalogd simultaneously in separate directories - **Cluster teardown detection**: Tracks consecutive failures, auto-stops after 3 failures when cluster is torn down **Usage:** Start/Stop Workflow: ```bash # Start profiling make PROFILE_NAME=baseline start-profiling # Run your tests (any command!) make test-e2e # Works! Handles cluster teardown make test-experimental-e2e # Works! go test ./test/e2e/... # Works! # Stop and analyze make stop-profiling ``` Automated Workflow: ```bash # Run with both heap and CPU profiling (default) ./hack/tools/e2e-profiling/e2e-profile.sh run baseline test-experimental-e2e # Run with heap-only profiling (reduced overhead) E2E_PROFILE_MODE=heap ./hack/tools/e2e-profiling/e2e-profile.sh run memory-test # Analyze results ./hack/tools/e2e-profiling/e2e-profile.sh analyze baseline # Compare two runs ./hack/tools/e2e-profiling/e2e-profile.sh compare baseline optimized ``` **Configuration:** Set `E2E_PROFILE_MODE` environment variable: - `both` (default): Collect both heap and CPU profiles - `heap`: Collect only heap profiles (reduces overhead by ~3%) - `cpu`: Collect only CPU profiles **Integration:** - Automatic cleanup of empty profiles from cluster teardown - Prometheus alert extraction from e2e test summaries - Detailed markdown reports with memory growth, CPU usage analysis, and recommendations - Claude Code slash command integration (`/e2e-profile start/stop/run/analyze/compare`) **Key Implementation Details:** - Background profiling: Entire collection runs in nohup with exported environment variables - Fixed interval timing: INTERVAL now includes CPU profiling time, not adds to it - Deployment wait polls until deployments are created before checking availability - Component name sanitization: Hyphens converted to underscores for valid bash variable names - PID tracking for both background process and port-forward cleanup - Consecutive failure tracking: 3 failures triggers graceful auto-stop - Silent error handling: curl errors suppressed when cluster is being torn down - 10-second intervals accurately maintained across all profiling modes - Port-forwards remain stable throughout entire test duration and survive pod restarts - Conditional profile collection based on PROFILE_MODE setting - Cleanup runs on EXIT/INT/TERM with graceful shutdown (2.5s timeout) and force-kill - Code deduplication: Common functions extracted to shared library **Code Quality:** - Reduced duplication: Shared common library for logging and utilities - Improved reliability: Deployment-based port-forwarding survives pod restarts - Better error handling: Clear timeout messages, automatic retry, robust cleanup - Flexible workflows: Start/stop for interactive use, automated for CI/CD - Enhanced documentation: Architecture guide, troubleshooting, workflow examples, and slash commands **Testing:** Verified end-to-end with `make test-e2e`: - Collected 32 heap + 31 CPU profiles per component - Auto-detected cluster teardown and stopped gracefully - Generated comprehensive analysis showing peak memory (24MB operator-controller, 16MB catalogd) - All tests passed with proper cleanup This tooling was essential for identifying memory optimization opportunities and validating that alert thresholds are correctly calibrated. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Signed-off-by: Todd Short --- .gitignore | 12 +- Makefile | 8 + hack/tools/e2e-profiling/README.md | 759 ++++++++++++++++++ hack/tools/e2e-profiling/analyze-profiles.sh | 391 +++++++++ hack/tools/e2e-profiling/collect-profiles.sh | 69 ++ hack/tools/e2e-profiling/common.sh | 81 ++ hack/tools/e2e-profiling/compare-profiles.sh | 535 ++++++++++++ hack/tools/e2e-profiling/e2e-profile.sh | 113 +++ hack/tools/e2e-profiling/run-profiled-test.sh | 208 +++++ hack/tools/e2e-profiling/start-profiling.sh | 347 ++++++++ hack/tools/e2e-profiling/stop-profiling.sh | 197 +++++ helm/e2e.yaml | 2 + ...mv1-system-catalogd-controller-manager.yml | 3 + ...operator-controller-controller-manager.yml | 3 + helm/olmv1/values.yaml | 2 + manifests/experimental-e2e.yaml | 2 + manifests/standard-e2e.yaml | 2 + 17 files changed, 2732 insertions(+), 2 deletions(-) create mode 100644 hack/tools/e2e-profiling/README.md create mode 100755 hack/tools/e2e-profiling/analyze-profiles.sh create mode 100755 hack/tools/e2e-profiling/collect-profiles.sh create mode 100755 hack/tools/e2e-profiling/common.sh create mode 100755 hack/tools/e2e-profiling/compare-profiles.sh create mode 100755 hack/tools/e2e-profiling/e2e-profile.sh create mode 100755 hack/tools/e2e-profiling/run-profiled-test.sh create mode 100755 hack/tools/e2e-profiling/start-profiling.sh create mode 100755 hack/tools/e2e-profiling/stop-profiling.sh diff --git a/.gitignore b/.gitignore index abd509dafb..3eb851804d 100644 --- a/.gitignore +++ b/.gitignore @@ -38,8 +38,13 @@ vendor/ \#*\# .\#* -# AI temp files files -.claude/ +# AI temp/local files +.claude/settings.local.json +.claude/history/ +.claude/cache/ +.claude/logs/ +.claude/.session* +.claude/*.log # documentation website asset folder site @@ -50,3 +55,6 @@ site # Temporary files and directories /test/regression/convert/testdata/tmp/* + +# E2E profiling artifacts +e2e-profiles/ diff --git a/Makefile b/Makefile index cf7b1d6508..b4df31e4e0 100644 --- a/Makefile +++ b/Makefile @@ -334,6 +334,14 @@ test-upgrade-experimental-e2e: $(TEST_UPGRADE_E2E_TASKS) #HELP Run upgrade e2e t e2e-coverage: COVERAGE_NAME=$(COVERAGE_NAME) ./hack/test/e2e-coverage.sh +.PHONY: start-profiling +start-profiling: #HELP Start profiling in background. Run your tests, then use 'make stop-profiling'. Use PROFILE_NAME= to specify output name. + ./hack/tools/e2e-profiling/start-profiling.sh $(PROFILE_NAME) + +.PHONY: stop-profiling +stop-profiling: #HELP Stop profiling and generate analysis report + ./hack/tools/e2e-profiling/stop-profiling.sh + #SECTION KIND Cluster Operations .PHONY: kind-load diff --git a/hack/tools/e2e-profiling/README.md b/hack/tools/e2e-profiling/README.md new file mode 100644 index 0000000000..4d2780a125 --- /dev/null +++ b/hack/tools/e2e-profiling/README.md @@ -0,0 +1,759 @@ +# E2E Profiling Tools + +Automated e2e profiling and analysis tools for operator-controller e2e tests. + +## Overview + +This plugin helps you: +- **Run e2e tests** with automatic profiling +- **Collect heap and CPU profiles** at regular intervals during test execution +- **Analyze memory usage** patterns and identify allocators +- **Analyze CPU performance** bottlenecks and hotspots +- **Compare test runs** to measure optimization impact +- **Generate reports** with actionable insights + +## Quick Start + +### Option 1: Simple Start/Stop Workflow (Recommended) + +```bash +# Start profiling in the background (uses timestamp as name) +make start-profiling + +# Or specify a custom name +make PROFILE_NAME=baseline start-profiling + +# Run your tests (any command, including full test targets) +make test-e2e # Works! Profiler handles cluster teardown +make test-experimental-e2e # Works! +go test ./test/e2e/... # Works! + +# Stop profiling and view analysis +make stop-profiling +``` + +This workflow: +- Starts port-forwards and profile collection in the background +- Waits for cluster to be ready (if not already running) +- Collects profiles during test execution +- Automatically detects and handles cluster teardown +- Gracefully stops when cluster is torn down +- Lets you run any test command you want + +### Option 2: Automated Test Runner + +```bash +# Run baseline test (automated start/stop) +./hack/tools/e2e-profiling/e2e-profile.sh run baseline + +# Make code changes... + +# Run optimized test +./hack/tools/e2e-profiling/e2e-profile.sh run with-caching + +# Compare results +./hack/tools/e2e-profiling/e2e-profile.sh compare baseline with-caching +``` + +### 3. View Reports + +```bash +# Individual analysis +cat e2e-profiles/baseline/analysis.md + +# Comparison +cat e2e-profiles/comparisons/baseline-vs-with-caching.md +``` + +## Commands + +### `run [test-target]` + +Run an e2e test with continuous e2e profiling. + +```bash +# Run with default test (test-experimental-e2e) +./hack/tools/e2e-profiling/e2e-profile.sh run my-test + +# Run with specific test target +./hack/tools/e2e-profiling/e2e-profile.sh run my-test test-e2e +./hack/tools/e2e-profiling/e2e-profile.sh run my-test test-upgrade-e2e +``` + +**Test Targets:** +- `test-e2e` - Standard e2e tests +- `test-experimental-e2e` - Experimental e2e tests (default) +- `test-extension-developer-e2e` - Extension developer e2e tests +- `test-upgrade-e2e` - Upgrade e2e tests +- `test-upgrade-experimental-e2e` - Upgrade experimental e2e tests + +**What it does:** +1. Starts the specified make test target in the background +2. Waits for operator-controller and catalogd deployments to be ready +3. Establishes port-forwards to deployment endpoints (survives pod restarts) +4. Retries connection to pprof endpoints (30s timeout, 2s intervals) +5. Collects heap and CPU profiles every 10 seconds +6. Continues until test completes or is interrupted (Ctrl+C) +7. Automatically cleans up port-forwards and empty profile files +8. Automatically analyzes results and generates report + +**Output:** +- `e2e-profiles/my-test/operator-controller/heap*.pprof` - Heap profile snapshots +- `e2e-profiles/my-test/operator-controller/cpu*.pprof` - CPU profile snapshots +- `e2e-profiles/my-test/catalogd/heap*.pprof` - Catalogd heap profiles +- `e2e-profiles/my-test/catalogd/cpu*.pprof` - Catalogd CPU profiles +- `e2e-profiles/my-test/test.log` - Test output +- `e2e-profiles/my-test/collection.log` - Collection log +- `e2e-profiles/my-test/analysis.md` - Automated analysis + +### `analyze ` + +Analyze previously collected profiles. + +```bash +./hack/tools/e2e-profiling/e2e-profile.sh analyze my-test +``` + +**What it analyzes:** +- Peak memory usage +- Memory growth patterns +- Top allocators +- OpenAPI-specific allocations +- JSON deserialization overhead +- Dynamic client operations + +**Output:** +- `e2e-profiles/my-test/analysis.md` - Detailed report + +### `compare ` + +Compare two test runs side-by-side. + +```bash +./hack/tools/e2e-profiling/e2e-profile.sh compare baseline optimized +``` + +**What it compares:** +- Peak memory usage +- File size progression +- Top allocators +- OpenAPI allocations +- JSON operations +- Informer/cache usage + +**Output:** +- `e2e-profiles/comparisons/baseline-vs-optimized.md` - Comparison report + +### `start-profiling.sh [output-name]` + +Start profiling in daemon mode (background collection). + +```bash +# Start profiling (uses timestamp as name) +./hack/tools/e2e-profiling/start-profiling.sh + +# Or specify a custom name +./hack/tools/e2e-profiling/start-profiling.sh my-test + +# Or use make target +make start-profiling +``` + +**What it does:** +1. Sets up port-forwards to operator-controller and catalogd +2. Starts profile collection in the background +3. Saves state to `.profiling-state` file +4. Continues collecting until you run `stop-profiling.sh` + +**Use case:** When you want to profile any test command, not just make targets + +### `stop-profiling.sh [--no-analyze]` + +Stop profiling daemon and cleanup. + +```bash +# Stop profiling and run analysis +./hack/tools/e2e-profiling/stop-profiling.sh + +# Or use make target +make stop-profiling + +# Stop without analysis +./hack/tools/e2e-profiling/stop-profiling.sh --no-analyze +``` + +**What it does:** +1. Stops background collection process +2. Kills port-forward processes +3. Cleans up empty profiles +4. Optionally runs analysis (default: yes) + +**Output:** +- `e2e-profiles/[output-name]/*/heap*.pprof` - Collected profiles +- `e2e-profiles/[output-name]/analysis.md` - Analysis report + +### `collect` + +Manually collect a single heap profile. + +```bash +./hack/tools/e2e-profiling/e2e-profile.sh collect +``` + +**Use case:** Quick snapshot during manual testing + +**Output:** +- `e2e-profiles/manual/heap-[timestamp].pprof` + +## Configuration + +Set environment variables to customize behavior: + +```bash +# Namespace where operator-controller runs +export E2E_PROFILE_NAMESPACE=olmv1-system + +# Collection interval in seconds (time between profile snapshots) +# Note: This is the total time including CPU profiling +export E2E_PROFILE_INTERVAL=10 + +# CPU sampling duration in seconds +# Note: CPU profiles are collected in parallel with heap profiles +export E2E_PROFILE_CPU_DURATION=10 + +# Profile collection mode (both, heap, cpu) +# both: Collect both heap and CPU profiles (default) +# heap: Collect only heap profiles (reduces overhead) +# cpu: Collect only CPU profiles +export E2E_PROFILE_MODE=both + +# Output directory +export E2E_PROFILE_DIR=./e2e-profiles + +# Default test target (if not specified on command line) +export E2E_PROFILE_TEST_TARGET=test-experimental-e2e +``` + +**Important:** If `E2E_PROFILE_CPU_DURATION` is set to a value greater than or equal to `E2E_PROFILE_INTERVAL`, CPU profiling will continuously run with no gap between samples. For example: +- `INTERVAL=10, CPU_DURATION=10`: CPU profiles continuously, heap snapshots every 10s +- `INTERVAL=20, CPU_DURATION=10`: 10s CPU sample, 10s idle, heap every 20s +- `INTERVAL=5, CPU_DURATION=10`: Warning - CPU profiling takes longer than interval! + +## Output Structure + +``` +e2e-profiles/ +├── baseline/ +│ ├── operator-controller/ +│ │ ├── heap0.pprof # Initial heap snapshot +│ │ ├── heap1.pprof # +10s +│ │ ├── cpu0.pprof # Initial CPU profile +│ │ ├── cpu1.pprof # +10s +│ │ └── ... +│ ├── catalogd/ +│ │ ├── heap0.pprof +│ │ ├── cpu0.pprof +│ │ └── ... +│ ├── test.log # Test output +│ ├── collection.log # Collection log +│ └── analysis.md # Analysis report +├── with-caching/ +│ └── ... +├── manual/ +│ └── heap-20251028-113000.pprof +└── comparisons/ + └── baseline-vs-with-caching.md +``` + +## Integration with Claude Code + +Use the `/e2e-profile` slash command in Claude Code: + +``` +/e2e-profile run baseline +``` + +Claude Code will: +1. Execute the profiling script +2. Monitor progress +3. Analyze results +4. Present findings +5. Suggest optimizations + +## Real-World Results + +Using this tooling on the OpenAPI caching optimization revealed: + +**Memory Reduction:** +- Peak usage: 49.6 MB → 41.2 MB (-16.9%) +- OpenAPI allocations: 13 MB → 3.5 MB (-73%) +- Test duration: +2 snapshots (improved stability) + +**Insights Discovered:** +- Repeated schema fetching was the #1 memory consumer +- JSON unmarshaling happened multiple times per schema +- Caching eliminated 73% of OpenAPI-related allocations +- Secondary benefits in JSON decoding (-73%) and reflection (-50%) + +## Architecture + +### Script Organization + +The profiling tools use a modular architecture with shared components: + +- **`common.sh`**: Shared library containing: + - Logging functions (`log_info`, `log_success`, `log_error`, `log_warn`) + - Color definitions for terminal output + - Default configuration values + - Utility functions (`to_absolute_path`, `get_script_dir`) + +- **`e2e-profile.sh`**: Main entry point and command dispatcher +- **`run-profiled-test.sh`**: Orchestrates test execution and profiling +- **`collect-profiles.sh`**: Handles profile collection from deployments +- **`analyze-profiles.sh`**: Generates analysis reports +- **`compare-profiles.sh`**: Creates comparison reports + +### Key Implementation Features + +**Deployment-Based Port-Forwarding** +- Uses `kubectl port-forward deployment/` instead of pod references +- Automatically follows pod restarts and replacements +- More reliable for long-running profiling sessions + +**Intelligent Connection Retry** +- 30-second timeout with 2-second retry intervals +- Tests each component independently +- Exits early when all components connect successfully +- Clear timeout messages if connection fails + +**Robust Cleanup (EXIT Trap)** +- Automatically runs on script exit, interruption (Ctrl+C), or termination +- Gracefully terminates port-forward processes with 2.5s timeout +- Force-kills stuck processes if needed +- Removes empty profile files created during errors or cluster teardown +- Preserves original exit code for proper automation + +**Multi-Component Support** +- Profiles both operator-controller and catalogd simultaneously +- Separate profile directories for each component +- Parallel profile collection with synchronized timing + +## Requirements + +- **kubectl**: Access to Kubernetes cluster +- **go**: For `go tool pprof` +- **make**: For running e2e tests +- **curl**: For fetching profiles +- **bash**: Version 4.0+ (requires associative arrays) + +## Troubleshooting + +### No profiles collected + +**Problem:** `collection.log` shows connection errors + +**Solution:** +1. Check deployment is ready: `kubectl get deployment -n olmv1-system operator-controller-controller-manager` +2. Verify pprof is enabled (port 6060) +3. Review retry log in `collection.log` - shows connection attempts every 2 seconds for 30 seconds +4. Test port forwarding manually: `kubectl port-forward -n olmv1-system deployment/operator-controller-controller-manager 6060:6060` +5. Check if pprof endpoint responds: `curl http://localhost:6060/debug/pprof/` + +**Note:** The tool automatically retries connections for 30 seconds. If it fails, check the deployment logs for startup errors. + +### Test exits early + +**Problem:** `test.log` shows test failure before profiling starts + +**Solution:** +1. Run test manually first: `make test-experimental-e2e` +2. Fix test issues before profiling +3. Increase initialization wait time + +### Analysis fails + +**Problem:** `analyze` command errors on pprof + +**Solution:** +1. Ensure all heap files are valid: `file e2e-profiles/*/heap*.pprof` +2. Check for empty files: `find memory-profiles -name "*.pprof" -size 0` +3. Verify go tool pprof works: `go tool pprof --help` + +### Comparison shows no difference + +**Problem:** Both tests show identical memory usage + +**Solution:** +1. Verify code changes were built and deployed +2. Check test is actually using new code +3. Ensure both tests ran under same conditions + +### Port-forward dies during collection + +**Problem:** Port-forward process terminates unexpectedly + +**Solution:** +- Using deployment-based port-forwarding: `kubectl` automatically reconnects if pods restart +- Check `collection.log` for connection retry messages +- Verify deployment is stable: `kubectl get deployment -n olmv1-system` +- If persistent, check cluster network or kubectl version + +### Interrupted profiling leaves processes running + +**Problem:** After Ctrl+C, port-forward processes still running + +**Solution:** +- The enhanced cleanup (EXIT trap) should handle this automatically +- If processes persist, check: `ps aux | grep "kubectl port-forward"` +- Manual cleanup: `pkill -f "kubectl port-forward.*olmv1-system"` +- Empty profile files are automatically removed by the cleanup handler + +### Debug Port Forwarding Issues + +```bash +# Test manual port forward to operator-controller +kubectl port-forward -n olmv1-system \ + deployment/operator-controller-controller-manager \ + 6060:6060 & + +# Test pprof endpoint +curl http://localhost:6060/debug/pprof/ + +# If that works, try collecting manually +curl http://localhost:6060/debug/pprof/heap > test.pprof +go tool pprof -top test.pprof +``` + +### Verify Test is Using New Code + +```bash +# Check image in deployment +kubectl get deployment -n olmv1-system operator-controller-controller-manager -o jsonpath='{.spec.template.spec.containers[0].image}' + +# Check pod is running new image +kubectl get pod -n olmv1-system -l app.kubernetes.io/name=operator-controller -o jsonpath='{.items[0].spec.containers[0].image}' +``` + +### Clean Up After Failed Test + +```bash +# Kill port-forwards +pkill -f "kubectl port-forward.*6060" + +# Clean up partial results +rm -rf e2e-profiles/failed-test + +# Check for hung processes +ps aux | grep -E "(e2e-profile|collect-profiles)" +``` + +## Examples + +### Example 1: Simple Start/Stop Workflow + +```bash +# Start profiling with custom name +make PROFILE_NAME=my-test start-profiling + +# Run any test command +make test-e2e + +# Stop and analyze +make stop-profiling + +# Review results +cat e2e-profiles/my-test/analysis.md +``` + +### Example 2: Baseline Measurement (Automated Runner) + +```bash +# Measure current memory usage +./hack/tools/e2e-profiling/e2e-profile.sh run baseline + +# Review results +cat e2e-profiles/baseline/analysis.md +``` + +### Example 3: Test Optimization with Start/Stop + +```bash +# Profile baseline +make PROFILE_NAME=baseline start-profiling +make test-e2e +make stop-profiling + +# Make code changes +# ... implement caching ... + +# Profile optimized version +make PROFILE_NAME=optimized start-profiling +make test-e2e +make stop-profiling + +# Compare +./hack/tools/e2e-profiling/e2e-profile.sh compare baseline optimized +``` + +### Example 4: Test Optimization (Automated Runner) + +```bash +# Run baseline +./hack/tools/e2e-profiling/e2e-profile.sh run before-optimization + +# Make code changes +# ... implement caching ... + +# Rebuild and redeploy +make docker-build docker-push deploy + +# Run optimized test +./hack/tools/e2e-profiling/e2e-profile.sh run after-optimization + +# Compare +./hack/tools/e2e-profiling/e2e-profile.sh compare before-optimization after-optimization +``` + +### Example 5: Heap-Only or CPU-Only Profiling + +```bash +# Collect only heap profiles (reduced overhead for memory analysis) +E2E_PROFILE_MODE=heap ./hack/tools/e2e-profiling/e2e-profile.sh run memory-only + +# Collect only CPU profiles (for performance analysis) +E2E_PROFILE_MODE=cpu ./hack/tools/e2e-profiling/e2e-profile.sh run cpu-only + +# Default: collect both +./hack/tools/e2e-profiling/e2e-profile.sh run both-profiles + +# Or with start/stop workflow +E2E_PROFILE_MODE=heap make PROFILE_NAME=memory-test start-profiling +make test-e2e +make stop-profiling +``` + +**Why use heap-only mode:** +- Reduces profiling overhead from ~5-6% to ~2-3% +- More accurate memory measurements (no CPU profiling interference) +- Faster collection cycles +- When you only need memory analysis + +**Why use CPU-only mode:** +- Focus on performance bottlenecks +- No heap profiling allocation hooks +- When memory is not a concern + +### Example 6: Testing Different Test Suites + +```bash +# Test standard e2e +./hack/tools/e2e-profiling/e2e-profile.sh run standard-e2e test-e2e + +# Test extension developer e2e +./hack/tools/e2e-profiling/e2e-profile.sh run extension-dev test-extension-developer-e2e + +# Test upgrade scenarios +./hack/tools/e2e-profiling/e2e-profile.sh run upgrade test-upgrade-e2e + +# Compare different test suites +./hack/tools/e2e-profiling/e2e-profile.sh compare standard-e2e extension-dev +``` + +### Example 7: Multiple Optimization Attempts + +```bash +# Try different approaches +./hack/tools/e2e-profiling/e2e-profile.sh run attempt1-caching +./hack/tools/e2e-profiling/e2e-profile.sh run attempt2-pooling +./hack/tools/e2e-profiling/e2e-profile.sh run attempt3-both + +# Compare all against baseline +./hack/tools/e2e-profiling/e2e-profile.sh compare baseline attempt1-caching +./hack/tools/e2e-profiling/e2e-profile.sh compare baseline attempt2-pooling +./hack/tools/e2e-profiling/e2e-profile.sh compare baseline attempt3-both +``` + +## Best Practices + +1. **Always run baseline first**: Establish a reference point before making changes + +2. **Use descriptive names**: Use test names that describe what changed + - ✅ `baseline`, `with-openapi-cache`, `with-informer-limit` + - ❌ `test1`, `test2`, `final` + +3. **Run multiple times**: Memory patterns can vary, run 2-3 times for consistency + +4. **Review raw profiles**: Use `go tool pprof` interactively for deep dives + ```bash + cd e2e-profiles/baseline + go tool pprof heap23.pprof + ``` + +5. **Keep test conditions consistent**: Same cluster, same data, same duration + +6. **Document changes**: Add notes to analysis.md about what changed + +## Tips and Tricks + +### Quick Peak Finding + +```bash +# Find the profile with highest memory usage +ls -lSh e2e-profiles/my-test/operator-controller/heap*.pprof | head -1 +``` + +### Track Memory Growth Rate + +```bash +# Show file size progression +for f in e2e-profiles/my-test/operator-controller/heap*.pprof; do + echo "$(basename $f): $(stat -c%s $f) bytes" +done | column -t +``` + +### Extract Metrics for Graphing + +```bash +# Create CSV of memory over time +echo "snapshot,bytes" > memory-over-time.csv +for f in e2e-profiles/my-test/operator-controller/heap*.pprof; do + num=$(basename "$f" | sed 's/heap\([0-9]*\).pprof/\1/') + size=$(stat -c%s "$f") + echo "$num,$size" >> memory-over-time.csv +done +``` + +### Alert on Memory Threshold + +```bash +# Check if any profile exceeds threshold +THRESHOLD=$((100 * 1024 * 1024)) # 100 MB + +for f in e2e-profiles/my-test/operator-controller/heap*.pprof; do + size=$(stat -c%s "$f") + if [ $size -gt $THRESHOLD ]; then + echo "WARNING: $(basename $f) exceeds threshold: $size bytes" + fi +done +``` + +### Generate Summary Report + +```bash +# Quick summary of all tests +for test in e2e-profiles/*/; do + if [ -f "$test/analysis.md" ]; then + test_name=$(basename "$test") + peak=$(grep "Peak Memory Usage:" "$test/analysis.md" || echo "N/A") + echo "$test_name: $peak" + fi +done +``` + +## Learning Resources + +### Understanding pprof +- Profiles are gzip-compressed protobuf files +- Two main metrics: `inuse_space` (default) and `alloc_space` +- Flat = allocations in this function +- Cum = allocations in this + called functions + +### Reading Reports +- **Top allocators** = where memory is being allocated +- **Growth analysis** = what changed between snapshots +- **Negative growth** = memory was freed +- **Zero flat, high cum** = memory allocated in child functions + +### Common Patterns +- High `json.Unmarshal` → Consider caching or typed structs +- High `dynamic.List` → Add pagination or field selectors +- High `openapi` calls → Implement caching +- High `Informer` → Deduplicate informers + +## Advanced Usage + +### Interactive Analysis + +```bash +cd e2e-profiles/my-test/operator-controller + +# Top allocators +go tool pprof -top heap23.pprof + +# Call graph (requires graphviz) +go tool pprof -pdf heap23.pprof > analysis.pdf + +# Interactive mode +go tool pprof heap23.pprof +# Use commands: top, list, web, etc. + +# Compare two profiles +go tool pprof -base=heap0.pprof -top heap23.pprof +``` + +### Focus on Specific Patterns + +```bash +cd e2e-profiles/my-test/operator-controller + +# Analyze just OpenAPI allocations +go tool pprof -text heap23.pprof | grep -i openapi > openapi-allocations.txt + +# Analyze just JSON operations +go tool pprof -text heap23.pprof | grep -iE "(json|unmarshal)" > json-allocations.txt + +# Analyze informer overhead +go tool pprof -text heap23.pprof | grep -iE "(informer|cache|watch)" > informer-allocations.txt + +# Find all allocations over 1MB +go tool pprof -text heap23.pprof | awk '$1 ~ /[0-9]+kB/ && $1+0 > 1024' +``` + +### Custom Collection Interval + +```bash +# Collect every 5 seconds +E2E_PROFILE_INTERVAL=5 ./hack/tools/e2e-profiling/e2e-profile.sh run quick-test + +# Collect every 60 seconds +E2E_PROFILE_INTERVAL=60 ./hack/tools/e2e-profiling/e2e-profile.sh run long-test +``` + +### Multiple Namespaces + +```bash +# Profile different namespace +E2E_PROFILE_NAMESPACE=my-namespace \ +./hack/tools/e2e-profiling/e2e-profile.sh run my-controller-test +``` + +## Contributing + +Improvements welcome! Key areas: + +**Completed:** +- [x] Add CPU profiling support +- [x] Add separate heap-only and CPU-only modes +- [x] Deployment-based port-forwarding (survives pod restarts) +- [x] Intelligent connection retry with timeout +- [x] Robust cleanup with graceful shutdown +- [x] Shared common library for code reuse +- [x] Automatic cleanup of empty profile files + +**Future Enhancements:** +- [ ] Add goroutine profiling +- [ ] Support multiple pods (replicas) +- [ ] Add real-time dashboard +- [ ] Support different output formats (JSON, CSV) +- [ ] Add mutex profiling +- [ ] Support custom component configurations + +## License + +See main repository license. + +## See Also + +- [Go pprof documentation](https://pkg.go.dev/net/http/pprof) +- [Profiling Go Programs](https://go.dev/blog/pprof) +- [Kubernetes kubectl port-forward](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/) diff --git a/hack/tools/e2e-profiling/analyze-profiles.sh b/hack/tools/e2e-profiling/analyze-profiles.sh new file mode 100755 index 0000000000..8ce7de6483 --- /dev/null +++ b/hack/tools/e2e-profiling/analyze-profiles.sh @@ -0,0 +1,391 @@ +#!/bin/bash +# +# Analyze collected heap profiles and generate report +# Supports both single-component and multi-component analysis +# +# Usage: analyze-profiles.sh +# + +set -euo pipefail + +# Source common functions +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +source "${SCRIPT_DIR}/common.sh" + +# Configuration +TEST_NAME="${1:-default}" +OUTPUT_DIR="${E2E_PROFILE_DIR}/${TEST_NAME}" +# Convert to absolute path to avoid issues with cd +OUTPUT_DIR="$(to_absolute_path "${OUTPUT_DIR}")" +REPORT_FILE="${OUTPUT_DIR}/analysis.md" + +# Function to analyze a single component's profiles +# Arguments: component_name component_dir +# Returns: 0 on success, 1 on error +# Appends analysis sections to REPORT_FILE +# Outputs peak memory total to stdout (for capture) +analyze_component() { + local component_name="$1" + local component_dir="$2" + + log_info "Analyzing ${component_name}..." >&2 + + # Check if profiles exist for this component + local profile_count=$(find -L "${component_dir}" -maxdepth 1 -name "heap*.pprof" -type f 2>/dev/null | wc -l) + if [ "${profile_count}" -eq 0 ]; then + log_error "No heap profiles found for ${component_name} in ${component_dir}" >&2 + return 1 + fi + + log_info " Found ${profile_count} profiles for ${component_name}" >&2 + + # Find the largest profile (peak memory) + local peak_profile=$(ls -lS "${component_dir}"/heap*.pprof 2>/dev/null | head -1 | awk '{print $NF}') + local peak_name=$(basename "${peak_profile}") + local peak_size=$(du -h "${peak_profile}" | cut -f1) + + log_info " Peak profile: ${peak_name} (${peak_size})" >&2 + + # Get baseline profile + local baseline_profile="${component_dir}/heap0.pprof" + + # Extract peak memory stats + log_info " Extracting peak memory statistics..." >&2 + local peak_stats=$(cd "${component_dir}" && go tool pprof -top "${peak_name}" 2>/dev/null | head -6) + local peak_total=$(echo "${peak_stats}" | grep "^Showing" | awk '{print $7, $8}') + + # Write component header + cat >> "${REPORT_FILE}" << EOF + +## ${component_name} Analysis + +**Profiles Collected:** ${profile_count} +**Peak Profile:** ${peak_name} (${peak_size}) +**Peak Memory Usage:** ${peak_total} + +### Memory Growth + +| Snapshot | File Size | Growth from Previous | +|----------|-----------|---------------------| +EOF + + # Add file sizes with growth + local prev_size=0 + for f in $(ls "${component_dir}"/heap*.pprof 2>/dev/null | sort -V); do + local name=$(basename "$f") + local size=$(stat -c%s "$f") + local size_kb=$((size / 1024)) + + if [ $prev_size -eq 0 ]; then + local growth="baseline" + else + local growth_kb=$((size_kb - prev_size)) + if [ $growth_kb -gt 0 ]; then + growth="+${growth_kb}K" + elif [ $growth_kb -lt 0 ]; then + growth="${growth_kb}K" + else + growth="0" + fi + fi + + echo "| ${name} | ${size_kb}K | ${growth} |" >> "${REPORT_FILE}" + prev_size=$size_kb + done + + # Top allocators from peak + log_info " Extracting top allocators..." >&2 + cat >> "${REPORT_FILE}" << 'EOF' + +### Top Memory Allocators (Peak Profile) + +``` +EOF + + cd "${component_dir}" && go tool pprof -top "${peak_name}" 2>/dev/null | head -20 >> "${REPORT_FILE}" + + cat >> "${REPORT_FILE}" << 'EOF' +``` + +EOF + + # OpenAPI-specific analysis + log_info " Analyzing OpenAPI allocations..." >&2 + cat >> "${REPORT_FILE}" << 'EOF' + +### OpenAPI-Related Allocations + +``` +EOF + + cd "${component_dir}" && go tool pprof -text "${peak_name}" 2>/dev/null | grep -i openapi | head -20 >> "${REPORT_FILE}" || echo "No OpenAPI allocations found" >> "${REPORT_FILE}" + + cat >> "${REPORT_FILE}" << 'EOF' +``` + +EOF + + # Growth analysis (baseline to peak) + if [ -f "${baseline_profile}" ]; then + log_info " Analyzing growth from baseline to peak..." >&2 + cat >> "${REPORT_FILE}" << 'EOF' + +### Memory Growth Analysis (Baseline to Peak) + +#### Top Growth Contributors + +``` +EOF + + cd "${component_dir}" && go tool pprof -base="${baseline_profile}" -top "${peak_profile}" 2>/dev/null | head -20 >> "${REPORT_FILE}" + + cat >> "${REPORT_FILE}" << 'EOF' +``` + +#### OpenAPI Growth + +``` +EOF + + cd "${component_dir}" && go tool pprof -base="${baseline_profile}" -text "${peak_profile}" 2>/dev/null | grep -i openapi | head -20 >> "${REPORT_FILE}" || echo "No OpenAPI growth detected" >> "${REPORT_FILE}" + + cat >> "${REPORT_FILE}" << 'EOF' +``` + +#### JSON Deserialization Growth + +``` +EOF + + cd "${component_dir}" && go tool pprof -base="${baseline_profile}" -text "${peak_profile}" 2>/dev/null | grep -iE "(json|unmarshal|decode)" | head -20 >> "${REPORT_FILE}" || echo "No JSON deserialization growth detected" >> "${REPORT_FILE}" + + cat >> "${REPORT_FILE}" << 'EOF' +``` + +#### Dynamic Client Growth + +``` +EOF + + cd "${component_dir}" && go tool pprof -base="${baseline_profile}" -text "${peak_profile}" 2>/dev/null | grep -iE "(dynamic|List|Informer)" | head -20 >> "${REPORT_FILE}" || echo "No dynamic client growth detected" >> "${REPORT_FILE}" + + cat >> "${REPORT_FILE}" << 'EOF' +``` +EOF + fi + + # Return peak memory for summary + echo "${peak_total}" +} + +# Function to analyze a single component's CPU profiles +# Arguments: component_name component_dir +# Returns: 0 on success, 1 on skip (no profiles), 2 on error +# Appends analysis sections to REPORT_FILE +analyze_cpu_component() { + local component_name="$1" + local component_dir="$2" + + log_info "Analyzing ${component_name} CPU profiles..." >&2 + + # Check if CPU profiles exist for this component + local cpu_profile_count=$(find -L "${component_dir}" -maxdepth 1 -name "cpu*.pprof" -type f 2>/dev/null | wc -l) + if [ "${cpu_profile_count}" -eq 0 ]; then + log_info " No CPU profiles found for ${component_name} (skipping CPU analysis)" >&2 + return 1 + fi + + log_info " Found ${cpu_profile_count} CPU profiles for ${component_name}" >&2 + + # Find the largest CPU profile (typically represents most activity) + local peak_cpu_profile=$(ls -lS "${component_dir}"/cpu*.pprof 2>/dev/null | head -1 | awk '{print $NF}') + local peak_cpu_name=$(basename "${peak_cpu_profile}") + local peak_cpu_size=$(du -h "${peak_cpu_profile}" | cut -f1) + + log_info " Peak CPU profile: ${peak_cpu_name} (${peak_cpu_size})" >&2 + + # Extract CPU stats + log_info " Extracting CPU profile statistics..." >&2 + local cpu_stats=$(cd "${component_dir}" && go tool pprof -top "${peak_cpu_name}" 2>/dev/null | head -6) + local cpu_total=$(echo "${cpu_stats}" | grep "^Showing" | awk '{print $7, $8}') + + # Write CPU analysis header + cat >> "${REPORT_FILE}" << EOF + +### CPU Profile Analysis + +**CPU Profiles Collected:** ${cpu_profile_count} +**Peak CPU Profile:** ${peak_cpu_name} (${peak_cpu_size}) +**Total CPU Time:** ${cpu_total} + +#### Top CPU Consumers (Peak Profile) + +\`\`\` +EOF + + cd "${component_dir}" && go tool pprof -top "${peak_cpu_name}" 2>/dev/null | head -20 >> "${REPORT_FILE}" + + cat >> "${REPORT_FILE}" << 'EOF' +``` + +#### CPU-Intensive Functions + +\`\`\` +EOF + + # Look for controller reconciliation and other hot paths + cd "${component_dir}" && go tool pprof -text "${peak_cpu_name}" 2>/dev/null | grep -iE "(Reconcile|sync|watch|cache|list)" | head -20 >> "${REPORT_FILE}" || echo "No reconciliation functions found in top CPU consumers" >> "${REPORT_FILE}" + + cat >> "${REPORT_FILE}" << 'EOF' +``` + +#### JSON/Serialization CPU Usage + +\`\`\` +EOF + + cd "${component_dir}" && go tool pprof -text "${peak_cpu_name}" 2>/dev/null | grep -iE "(json|unmarshal|decode|marshal|encode)" | head -15 >> "${REPORT_FILE}" || echo "No significant JSON/serialization CPU usage detected" >> "${REPORT_FILE}" + + cat >> "${REPORT_FILE}" << 'EOF' +``` +EOF + + return 0 +} + +# Check if output directory exists +if [ ! -d "${OUTPUT_DIR}" ]; then + log_error "Directory not found: ${OUTPUT_DIR}" + exit 1 +fi + +# Check for required component directories +if [ ! -d "${OUTPUT_DIR}/operator-controller" ] || [ ! -d "${OUTPUT_DIR}/catalogd" ]; then + log_error "Expected component directories not found!" + log_error "Directory must contain: operator-controller/ and catalogd/ subdirectories" + log_error "Found in ${OUTPUT_DIR}:" + ls -la "${OUTPUT_DIR}" >&2 + exit 1 +fi + +COMPONENTS=("operator-controller" "catalogd") + +# Generate report header +log_info "Generating analysis report..." + +cat > "${REPORT_FILE}" << EOF +# Memory Profile Analysis + +**Test Name:** ${TEST_NAME} +**Date:** $(date '+%Y-%m-%d %H:%M:%S') + +--- + +## Executive Summary + +EOF + +declare -A PEAK_MEMORY + +# Analyze each component +for component in "${COMPONENTS[@]}"; do + component_dir="${OUTPUT_DIR}/${component}" + peak_mem=$(analyze_component "${component}" "${component_dir}") + PEAK_MEMORY[$component]="${peak_mem}" + + # Analyze CPU profiles if available + analyze_cpu_component "${component}" "${component_dir}" || true + + echo "" >> "${REPORT_FILE}" + echo "---" >> "${REPORT_FILE}" +done + +# Insert executive summary after the header +# Create a temporary file with the summary +TEMP_SUMMARY=$(mktemp) +for component in "${COMPONENTS[@]}"; do + echo "- **${component}**: ${PEAK_MEMORY[$component]}" >> "${TEMP_SUMMARY}" +done +echo "" >> "${TEMP_SUMMARY}" + +# Insert the summary after "## Executive Summary" line +awk '/^## Executive Summary/ {print; system("cat '"${TEMP_SUMMARY}"'"); next} 1' "${REPORT_FILE}" > "${REPORT_FILE}.tmp" +mv "${REPORT_FILE}.tmp" "${REPORT_FILE}" +rm "${TEMP_SUMMARY}" + +# Prometheus Alerts Analysis (applies to entire test run, not per-component) +if [ -f "${OUTPUT_DIR}/e2e-summary.md" ]; then + log_info "Analyzing Prometheus alerts..." + cat >> "${REPORT_FILE}" << 'EOF' + +--- + +## Prometheus Alerts + +EOF + + # Extract the Alerts section from the markdown file + # Look for "## Alerts" section and extract until next ## (level-2 header) section + # Note: Use '/^## /' with space to match level-2 headers only, not level-3 (###) + ALERTS_SECTION=$(sed -n '/^## Alerts/,/^## /p' "${OUTPUT_DIR}/e2e-summary.md" | sed '$d' | tail -n +2) + + if [ -n "${ALERTS_SECTION}" ] && [ "${ALERTS_SECTION}" != "None." ]; then + cat >> "${REPORT_FILE}" << EOF +${ALERTS_SECTION} + +Full E2E test summary available at: \`e2e-summary.md\` + +EOF + else + cat >> "${REPORT_FILE}" << 'EOF' +No Prometheus alerts detected during test execution. + +Full E2E test summary available at: `e2e-summary.md` + +EOF + fi +else + cat >> "${REPORT_FILE}" << 'EOF' + +--- + +## Prometheus Alerts + +E2E summary not available. Set `E2E_SUMMARY_OUTPUT` environment variable when running tests to capture alerts. + +EOF +fi + +# Recommendations +cat >> "${REPORT_FILE}" << 'EOF' + +--- + +## Recommendations + +Based on the analysis above, consider: + +1. **OpenAPI Schema Caching**: If OpenAPI allocations are significant, implement caching +2. **Informer Optimization**: Review and deduplicate informer creation +3. **List Operation Limits**: Add pagination or field selectors to reduce list overhead +4. **JSON Optimization**: Consider using typed clients instead of unstructured where possible + +EOF + +log_success "Analysis complete!" +log_info "Report saved to: ${REPORT_FILE}" + +# Display summary +echo "" +echo "=== Quick Summary ===" +echo "Test: ${TEST_NAME}" +for component in "${COMPONENTS[@]}"; do + component_dir="${OUTPUT_DIR}/${component}" + profile_count=$(find -L "${component_dir}" -maxdepth 1 -name "heap*.pprof" -type f 2>/dev/null | wc -l) + peak_profile=$(ls -lS "${component_dir}"/heap*.pprof 2>/dev/null | head -1 | awk '{print $NF}') + peak_name=$(basename "${peak_profile}") + peak_size=$(du -h "${peak_profile}" | cut -f1) + echo "${component}: ${profile_count} profiles, peak ${peak_name} (${peak_size})" +done +echo "" +echo "Full report: ${REPORT_FILE}" diff --git a/hack/tools/e2e-profiling/collect-profiles.sh b/hack/tools/e2e-profiling/collect-profiles.sh new file mode 100755 index 0000000000..57cceeb840 --- /dev/null +++ b/hack/tools/e2e-profiling/collect-profiles.sh @@ -0,0 +1,69 @@ +#!/bin/bash +# +# Collect heap profiles from operator-controller and catalogd during e2e test +# +# Usage: collect-profiles.sh +# +# This script is a wrapper around start-profiling.sh and stop-profiling.sh +# that provides the original collect-profiles.sh interface for backward compatibility. +# + +set -euo pipefail + +# Source common functions +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +source "${SCRIPT_DIR}/common.sh" + +# Configuration +TEST_NAME="${1:-default}" +STATE_FILE="${E2E_PROFILE_DIR}/.profiling-state" + +# Cleanup function - calls stop-profiling.sh +cleanup() { + local exit_code=$? + + # Check if profiling is actually running + if [ -f "${STATE_FILE}" ]; then + log_info "Stopping profiling..." + # Call stop-profiling.sh without analysis (run-profiled-test.sh handles that) + "${SCRIPT_DIR}/stop-profiling.sh" --no-analyze || true + fi + + exit $exit_code +} + +trap cleanup EXIT INT TERM + +log_info "Starting profile collection for test: ${TEST_NAME}" + +# Start profiling using start-profiling.sh +if ! "${SCRIPT_DIR}/start-profiling.sh" "${TEST_NAME}"; then + log_error "Failed to start profiling" + exit 1 +fi + +log_success "Profile collection started" +log_info "Collecting profiles until interrupted..." +log_info "Press Ctrl+C to stop" + +# Read the state file to get collection PID +if [ -f "${STATE_FILE}" ]; then + source "${STATE_FILE}" + + # Wait for collection process to finish (or be interrupted) + # The collection process runs indefinitely until killed + if [ -n "${COLLECTION_PID:-}" ]; then + # Monitor the collection process + while kill -0 "${COLLECTION_PID}" 2>/dev/null; do + sleep 5 + done + + log_warn "Collection process exited unexpectedly" + else + log_error "No collection PID found in state file" + exit 1 + fi +else + log_error "State file not found after starting profiling" + exit 1 +fi diff --git a/hack/tools/e2e-profiling/common.sh b/hack/tools/e2e-profiling/common.sh new file mode 100755 index 0000000000..09c48a5004 --- /dev/null +++ b/hack/tools/e2e-profiling/common.sh @@ -0,0 +1,81 @@ +#!/bin/bash +# +# Common functions and variables for e2e profiling scripts +# +# This file should be sourced by other scripts in this directory. +# Usage: source "$(dirname "${BASH_SOURCE[0]}")/common.sh" +# + +# Prevent multiple sourcing +if [ -n "${E2E_PROFILING_COMMON_LOADED:-}" ]; then + return 0 +fi +E2E_PROFILING_COMMON_LOADED=1 + +# Color codes for output +RED='\033[0;31m' +GREEN='\033[0;32m' +BLUE='\033[0;34m' +YELLOW='\033[1;33m' +NC='\033[0m' # No Color + +# Logging functions +log_info() { + echo -e "${BLUE}[INFO]${NC} $*" +} + +log_success() { + echo -e "${GREEN}[SUCCESS]${NC} $*" +} + +log_error() { + echo -e "${RED}[ERROR]${NC} $*" +} + +log_warn() { + echo -e "${YELLOW}[WARN]${NC} $*" +} + +# Get the directory containing the profiling scripts +# This should work regardless of which script sources this file +get_script_dir() { + local source="${BASH_SOURCE[0]}" + while [ -h "$source" ]; do + local dir="$(cd -P "$(dirname "$source")" && pwd)" + source="$(readlink "$source")" + [[ $source != /* ]] && source="$dir/$source" + done + cd -P "$(dirname "$source")" && pwd +} + +# Convert a path to absolute path +# Arguments: path +# Returns: absolute path (stdout) +to_absolute_path() { + local path="$1" + if [ -d "$path" ]; then + (cd "$path" && pwd) + elif [ -e "$path" ]; then + local dir=$(dirname "$path") + local base=$(basename "$path") + echo "$(cd "$dir" && pwd)/$base" + else + # Path doesn't exist yet - try to resolve parent directory + local parent=$(dirname "$path") + local base=$(basename "$path") + if [ -d "$parent" ]; then + echo "$(cd "$parent" && pwd)/$base" + else + echo "$path" + fi + fi +} + +# Default configuration values +# These can be overridden by environment variables +E2E_PROFILE_NAMESPACE="${E2E_PROFILE_NAMESPACE:-olmv1-system}" +E2E_PROFILE_INTERVAL="${E2E_PROFILE_INTERVAL:-10}" +E2E_PROFILE_DIR="${E2E_PROFILE_DIR:-./e2e-profiles}" +E2E_PROFILE_CPU_DURATION="${E2E_PROFILE_CPU_DURATION:-10}" +E2E_PROFILE_MODE="${E2E_PROFILE_MODE:-both}" +E2E_PROFILE_TEST_TARGET="${E2E_PROFILE_TEST_TARGET:-test-experimental-e2e}" diff --git a/hack/tools/e2e-profiling/compare-profiles.sh b/hack/tools/e2e-profiling/compare-profiles.sh new file mode 100755 index 0000000000..a7068b07c0 --- /dev/null +++ b/hack/tools/e2e-profiling/compare-profiles.sh @@ -0,0 +1,535 @@ +#!/bin/bash +# +# Compare two sets of heap profiles +# +# Usage: compare-profiles.sh +# + +set -euo pipefail + +# Source common functions +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +source "${SCRIPT_DIR}/common.sh" + +# Configuration +TEST1="${1:-}" +TEST2="${2:-}" +BASE_DIR="${E2E_PROFILE_DIR}" +# Convert to absolute paths +BASE_DIR="$(to_absolute_path "${BASE_DIR}")" +TEST1_DIR="${BASE_DIR}/${TEST1}" +TEST2_DIR="${BASE_DIR}/${TEST2}" +COMPARE_DIR="${BASE_DIR}/comparisons" +REPORT_FILE="${COMPARE_DIR}/${TEST1}-vs-${TEST2}.md" + +# Validate inputs +if [ -z "${TEST1}" ] || [ -z "${TEST2}" ]; then + log_error "Usage: $0 " + exit 1 +fi + +if [ ! -d "${TEST1_DIR}" ]; then + log_error "Test directory not found: ${TEST1_DIR}" + exit 1 +fi + +if [ ! -d "${TEST2_DIR}" ]; then + log_error "Test directory not found: ${TEST2_DIR}" + exit 1 +fi + +# Create comparison directory +mkdir -p "${COMPARE_DIR}" + +log_info "Comparing ${TEST1} vs ${TEST2}..." + +# Find peak profiles for each test (look in operator-controller subdirectory) +PEAK1=$(ls -lS "${TEST1_DIR}"/operator-controller/heap*.pprof 2>/dev/null | head -1 | awk '{print $NF}') +PEAK2=$(ls -lS "${TEST2_DIR}"/operator-controller/heap*.pprof 2>/dev/null | head -1 | awk '{print $NF}') + +# Find peak profiles for catalogd +PEAK1_CATALOGD=$(ls -lS "${TEST1_DIR}"/catalogd/heap*.pprof 2>/dev/null | head -1 | awk '{print $NF}') +PEAK2_CATALOGD=$(ls -lS "${TEST2_DIR}"/catalogd/heap*.pprof 2>/dev/null | head -1 | awk '{print $NF}') + +PEAK1_NAME=$(basename "${PEAK1}") +PEAK2_NAME=$(basename "${PEAK2}") +PEAK1_SIZE=$(du -h "${PEAK1}" | cut -f1) +PEAK2_SIZE=$(du -h "${PEAK2}" | cut -f1) + +PEAK1_CATALOGD_NAME=$(basename "${PEAK1_CATALOGD}") +PEAK2_CATALOGD_NAME=$(basename "${PEAK2_CATALOGD}") +PEAK1_CATALOGD_SIZE=$(du -h "${PEAK1_CATALOGD}" | cut -f1) +PEAK2_CATALOGD_SIZE=$(du -h "${PEAK2_CATALOGD}" | cut -f1) + +# Count profiles for both components +COUNT1=$(find "${TEST1_DIR}/operator-controller" -name "heap*.pprof" -type f 2>/dev/null | wc -l) +COUNT2=$(find "${TEST2_DIR}/operator-controller" -name "heap*.pprof" -type f 2>/dev/null | wc -l) +COUNT1_CATALOGD=$(find "${TEST1_DIR}/catalogd" -name "heap*.pprof" -type f 2>/dev/null | wc -l) +COUNT2_CATALOGD=$(find "${TEST2_DIR}/catalogd" -name "heap*.pprof" -type f 2>/dev/null | wc -l) + +log_info "Test 1 operator-controller: ${PEAK1_NAME} (${PEAK1_SIZE}) - ${COUNT1} profiles" +log_info "Test 2 operator-controller: ${PEAK2_NAME} (${PEAK2_SIZE}) - ${COUNT2} profiles" +log_info "Test 1 catalogd: ${PEAK1_CATALOGD_NAME} (${PEAK1_CATALOGD_SIZE}) - ${COUNT1_CATALOGD} profiles" +log_info "Test 2 catalogd: ${PEAK2_CATALOGD_NAME} (${PEAK2_CATALOGD_SIZE}) - ${COUNT2_CATALOGD} profiles" + +# Generate comparison report +log_info "Generating comparison report..." + +cat > "${REPORT_FILE}" << EOF +# Memory Profile Comparison: ${TEST1} vs ${TEST2} + +**Date:** $(date '+%Y-%m-%d %H:%M:%S') + +--- + +## Overview + +### operator-controller + +| Metric | ${TEST1} | ${TEST2} | Change | +|--------|----------|----------|--------| +| Profiles Collected | ${COUNT1} | ${COUNT2} | $((COUNT2 - COUNT1)) | +| Peak Profile | ${PEAK1_NAME} | ${PEAK2_NAME} | - | +| Peak File Size | ${PEAK1_SIZE} | ${PEAK2_SIZE} | - | + +### catalogd + +| Metric | ${TEST1} | ${TEST2} | Change | +|--------|----------|----------|--------| +| Profiles Collected | ${COUNT1_CATALOGD} | ${COUNT2_CATALOGD} | $((COUNT2_CATALOGD - COUNT1_CATALOGD)) | +| Peak Profile | ${PEAK1_CATALOGD_NAME} | ${PEAK2_CATALOGD_NAME} | - | +| Peak File Size | ${PEAK1_CATALOGD_SIZE} | ${PEAK2_CATALOGD_SIZE} | - | + +--- + +## Peak Memory Comparison (operator-controller) + +### ${TEST1} (Baseline) + +\`\`\` +EOF + +cd "${TEST1_DIR}/operator-controller" && go tool pprof -top "${PEAK1_NAME}" 2>/dev/null | head -20 >> "${REPORT_FILE}" + +cat >> "${REPORT_FILE}" << EOF +\`\`\` + +### ${TEST2} (Optimized) + +\`\`\` +EOF + +cd "${TEST2_DIR}/operator-controller" && go tool pprof -top "${PEAK2_NAME}" 2>/dev/null | head -20 >> "${REPORT_FILE}" + +cat >> "${REPORT_FILE}" << EOF +\`\`\` + +--- + +## operator-controller File Size Timeline + +| Snapshot | ${TEST1} | ${TEST2} | Difference | +|----------|----------|----------|------------| +EOF + +# Get list of all heap numbers from both tests (operator-controller) +ALL_NUMS=$( + (ls "${TEST1_DIR}"/operator-controller/heap*.pprof 2>/dev/null | sed 's/.*heap\([0-9]*\)\.pprof/\1/'; + ls "${TEST2_DIR}"/operator-controller/heap*.pprof 2>/dev/null | sed 's/.*heap\([0-9]*\)\.pprof/\1/') \ + | sort -n | uniq +) + +for num in ${ALL_NUMS}; do + f1="${TEST1_DIR}/operator-controller/heap${num}.pprof" + f2="${TEST2_DIR}/operator-controller/heap${num}.pprof" + + if [ -f "${f1}" ]; then + size1_bytes=$(stat -c%s "${f1}") + size1_kb=$((size1_bytes / 1024)) + size1="${size1_kb}K" + else + size1="-" + size1_kb=0 + fi + + if [ -f "${f2}" ]; then + size2_bytes=$(stat -c%s "${f2}") + size2_kb=$((size2_bytes / 1024)) + size2="${size2_kb}K" + else + size2="-" + size2_kb=0 + fi + + if [ "${size1}" != "-" ] && [ "${size2}" != "-" ]; then + diff_kb=$((size2_kb - size1_kb)) + if [ $diff_kb -gt 0 ]; then + diff="+${diff_kb}K" + elif [ $diff_kb -lt 0 ]; then + diff="${diff_kb}K" + else + diff="0" + fi + else + diff="-" + fi + + echo "| heap${num}.pprof | ${size1} | ${size2} | ${diff} |" >> "${REPORT_FILE}" +done + +cat >> "${REPORT_FILE}" << 'EOF' + +--- + +## catalogd File Size Timeline + +| Snapshot | TEST1_NAME | TEST2_NAME | Difference | +|----------|----------|----------|------------| +EOF + +# Get list of all catalogd heap numbers from both tests +ALL_NUMS_CATALOGD=$( + (ls "${TEST1_DIR}"/catalogd/heap*.pprof 2>/dev/null | sed 's/.*heap\([0-9]*\)\.pprof/\1/'; + ls "${TEST2_DIR}"/catalogd/heap*.pprof 2>/dev/null | sed 's/.*heap\([0-9]*\)\.pprof/\1/') \ + | sort -n | uniq +) + +for num in ${ALL_NUMS_CATALOGD}; do + f1="${TEST1_DIR}/catalogd/heap${num}.pprof" + f2="${TEST2_DIR}/catalogd/heap${num}.pprof" + + if [ -f "${f1}" ]; then + size1_bytes=$(stat -c%s "${f1}") + size1_kb=$((size1_bytes / 1024)) + size1="${size1_kb}K" + else + size1="-" + size1_kb=0 + fi + + if [ -f "${f2}" ]; then + size2_bytes=$(stat -c%s "${f2}") + size2_kb=$((size2_bytes / 1024)) + size2="${size2_kb}K" + else + size2="-" + size2_kb=0 + fi + + if [ "${size1}" != "-" ] && [ "${size2}" != "-" ]; then + diff_kb=$((size2_kb - size1_kb)) + if [ $diff_kb -gt 0 ]; then + diff="+${diff_kb}K" + elif [ $diff_kb -lt 0 ]; then + diff="${diff_kb}K" + else + diff="0" + fi + else + diff="-" + fi + + echo "| heap${num}.pprof | ${size1} | ${size2} | ${diff} |" >> "${REPORT_FILE}" +done + +cat >> "${REPORT_FILE}" << 'EOF' + +--- + +## operator-controller Analysis + +### OpenAPI Allocations Comparison + +#### TEST1_NAME (Baseline) + +``` +EOF + +cd "${TEST1_DIR}/operator-controller" && go tool pprof -text "${PEAK1_NAME}" 2>/dev/null | grep -i openapi | head -30 >> "${REPORT_FILE}" || echo "No OpenAPI allocations found" >> "${REPORT_FILE}" + +cat >> "${REPORT_FILE}" << 'EOF' +``` + +#### TEST2_NAME (Optimized) + +``` +EOF + +cd "${TEST2_DIR}/operator-controller" && go tool pprof -text "${PEAK2_NAME}" 2>/dev/null | grep -i openapi | head -30 >> "${REPORT_FILE}" || echo "No OpenAPI allocations found" >> "${REPORT_FILE}" + +cat >> "${REPORT_FILE}" << 'EOF' +``` + +### Growth Analysis Comparison + +#### TEST1_NAME Growth (heap0 to peak) + +``` +EOF + +BASELINE1="${TEST1_DIR}/operator-controller/heap0.pprof" +if [ -f "${BASELINE1}" ]; then + cd "${TEST1_DIR}/operator-controller" && go tool pprof -base="heap0.pprof" -top "${PEAK1}" 2>/dev/null | head -20 >> "${REPORT_FILE}" +else + echo "Baseline not available" >> "${REPORT_FILE}" +fi + +cat >> "${REPORT_FILE}" << 'EOF' +``` + +#### TEST2_NAME Growth (heap0 to peak) + +``` +EOF + +BASELINE2="${TEST2_DIR}/operator-controller/heap0.pprof" +if [ -f "${BASELINE2}" ]; then + cd "${TEST2_DIR}/operator-controller" && go tool pprof -base="heap0.pprof" -top "${PEAK2}" 2>/dev/null | head -20 >> "${REPORT_FILE}" +else + echo "Baseline not available" >> "${REPORT_FILE}" +fi + +cat >> "${REPORT_FILE}" << 'EOF' +``` + +### JSON Deserialization Comparison + +#### TEST1_NAME + +``` +EOF + +cd "${TEST1_DIR}/operator-controller" && go tool pprof -text "${PEAK1_NAME}" 2>/dev/null | grep -iE "(json|unmarshal|decode)" | head -20 >> "${REPORT_FILE}" || echo "No JSON operations found" >> "${REPORT_FILE}" + +cat >> "${REPORT_FILE}" << 'EOF' +``` + +#### TEST2_NAME + +``` +EOF + +cd "${TEST2_DIR}/operator-controller" && go tool pprof -text "${PEAK2_NAME}" 2>/dev/null | grep -iE "(json|unmarshal|decode)" | head -20 >> "${REPORT_FILE}" || echo "No JSON operations found" >> "${REPORT_FILE}" + +cat >> "${REPORT_FILE}" << 'EOF' +``` + +### Dynamic Client / Informer Comparison + +#### TEST1_NAME + +``` +EOF + +cd "${TEST1_DIR}/operator-controller" && go tool pprof -text "${PEAK1_NAME}" 2>/dev/null | grep -iE "(dynamic|List|Informer|cache)" | head -20 >> "${REPORT_FILE}" || echo "No dynamic client operations found" >> "${REPORT_FILE}" + +cat >> "${REPORT_FILE}" << 'EOF' +``` + +#### TEST2_NAME + +``` +EOF + +cd "${TEST2_DIR}/operator-controller" && go tool pprof -text "${PEAK2_NAME}" 2>/dev/null | grep -iE "(dynamic|List|Informer|cache)" | head -20 >> "${REPORT_FILE}" || echo "No dynamic client operations found" >> "${REPORT_FILE}" + +cat >> "${REPORT_FILE}" << 'EOF' +``` + +--- + +## catalogd Analysis + +### Peak Memory Comparison + +#### TEST1_NAME (Baseline) + +``` +EOF + +PEAK1_CATALOGD_NAME=$(basename "${PEAK1_CATALOGD}") +cd "${TEST1_DIR}/catalogd" && go tool pprof -top "${PEAK1_CATALOGD_NAME}" 2>/dev/null | head -20 >> "${REPORT_FILE}" + +cat >> "${REPORT_FILE}" << 'EOF' +``` + +#### TEST2_NAME (Optimized) + +``` +EOF + +PEAK2_CATALOGD_NAME=$(basename "${PEAK2_CATALOGD}") +cd "${TEST2_DIR}/catalogd" && go tool pprof -top "${PEAK2_CATALOGD_NAME}" 2>/dev/null | head -20 >> "${REPORT_FILE}" + +cat >> "${REPORT_FILE}" << 'EOF' +``` + +### Growth Analysis + +#### TEST1_NAME Growth (heap0 to peak) + +``` +EOF + +BASELINE1_CATALOGD="${TEST1_DIR}/catalogd/heap0.pprof" +if [ -f "${BASELINE1_CATALOGD}" ]; then + cd "${TEST1_DIR}/catalogd" && go tool pprof -base="heap0.pprof" -top "${PEAK1_CATALOGD}" 2>/dev/null | head -20 >> "${REPORT_FILE}" +else + echo "Baseline not available" >> "${REPORT_FILE}" +fi + +cat >> "${REPORT_FILE}" << 'EOF' +``` + +#### TEST2_NAME Growth (heap0 to peak) + +``` +EOF + +BASELINE2_CATALOGD="${TEST2_DIR}/catalogd/heap0.pprof" +if [ -f "${BASELINE2_CATALOGD}" ]; then + cd "${TEST2_DIR}/catalogd" && go tool pprof -base="heap0.pprof" -top "${PEAK2_CATALOGD}" 2>/dev/null | head -20 >> "${REPORT_FILE}" +else + echo "Baseline not available" >> "${REPORT_FILE}" +fi + +cat >> "${REPORT_FILE}" << 'EOF' +``` + +--- + +## Prometheus Alerts Comparison + +EOF + +# Compare prometheus alerts if available +if [ -f "${TEST1_DIR}/e2e-summary.md" ] || [ -f "${TEST2_DIR}/e2e-summary.md" ]; then + # Test 1 alerts + if [ -f "${TEST1_DIR}/e2e-summary.md" ]; then + # Use '/^## /' with space to match level-2 headers only, not level-3 (###) + ALERTS1_SECTION=$(sed -n '/^## Alerts/,/^## /p' "${TEST1_DIR}/e2e-summary.md" | sed '$d' | tail -n +2) + if [ -n "${ALERTS1_SECTION}" ] && [ "${ALERTS1_SECTION}" != "None." ]; then + ALERTS1="Present" + else + ALERTS1="None" + fi + else + ALERTS1="N/A" + fi + + # Test 2 alerts + if [ -f "${TEST2_DIR}/e2e-summary.md" ]; then + # Use '/^## /' with space to match level-2 headers only, not level-3 (###) + ALERTS2_SECTION=$(sed -n '/^## Alerts/,/^## /p' "${TEST2_DIR}/e2e-summary.md" | sed '$d' | tail -n +2) + if [ -n "${ALERTS2_SECTION}" ] && [ "${ALERTS2_SECTION}" != "None." ]; then + ALERTS2="Present" + else + ALERTS2="None" + fi + else + ALERTS2="N/A" + fi + + cat >> "${REPORT_FILE}" << EOF +### Alert Summary + +| Metric | ${TEST1} | ${TEST2} | +|--------|----------|----------| +| Alerts | ${ALERTS1} | ${ALERTS2} | + +EOF + + if [ "${ALERTS1}" = "Present" ] || [ "${ALERTS2}" = "Present" ]; then + cat >> "${REPORT_FILE}" << 'EOF' +### Alert Details + +EOF + + if [ "${ALERTS1}" = "Present" ]; then + cat >> "${REPORT_FILE}" << EOF +**${TEST1}:** +${ALERTS1_SECTION} + +EOF + fi + + if [ "${ALERTS2}" = "Present" ]; then + cat >> "${REPORT_FILE}" << EOF +**${TEST2}:** +${ALERTS2_SECTION} + +EOF + fi + else + cat >> "${REPORT_FILE}" << 'EOF' +No alerts detected in either test. + +EOF + fi +else + cat >> "${REPORT_FILE}" << 'EOF' +E2E summary not available for comparison. + +EOF +fi + +cat >> "${REPORT_FILE}" << 'EOF' + +--- + +## Key Findings + +**Memory Impact:** +- Test duration change: DURATION_CHANGE +- Peak profile size change: SIZE_CHANGE + +**Recommendations:** +1. Review the allocation differences above +2. Look for patterns in eliminated allocations +3. Check if optimization goals were met +4. Identify remaining high allocators + +--- + +## Next Steps + +Based on this comparison, consider: + +1. If memory usage improved: Document the change and create PR +2. If memory usage increased: Investigate unexpected allocations +3. If no change: Review whether optimization was correctly applied + +EOF + +# Replace placeholders +sed -i "s/TEST1_NAME/${TEST1}/g" "${REPORT_FILE}" +sed -i "s/TEST2_NAME/${TEST2}/g" "${REPORT_FILE}" + +# Calculate some statistics +DURATION_CHANGE="$((COUNT2 - COUNT1)) snapshots" +sed -i "s/DURATION_CHANGE/${DURATION_CHANGE}/g" "${REPORT_FILE}" + +PEAK1_BYTES=$(stat -c%s "${PEAK1}") +PEAK2_BYTES=$(stat -c%s "${PEAK2}") +DIFF_KB=$(((PEAK2_BYTES - PEAK1_BYTES) / 1024)) +if [ $DIFF_KB -gt 0 ]; then + SIZE_CHANGE="+${DIFF_KB}K (+$(( (DIFF_KB * 100) / (PEAK1_BYTES / 1024) ))%)" +elif [ $DIFF_KB -lt 0 ]; then + SIZE_CHANGE="${DIFF_KB}K ($(( (DIFF_KB * 100) / (PEAK1_BYTES / 1024) ))%)" +else + SIZE_CHANGE="No change" +fi +sed -i "s/SIZE_CHANGE/${SIZE_CHANGE}/g" "${REPORT_FILE}" + +log_success "Comparison complete!" +log_info "Report saved to: ${REPORT_FILE}" + +# Display summary +echo "" +echo "=== Comparison Summary ===" +echo "Test 1: ${TEST1} (${COUNT1} profiles, peak: ${PEAK1_SIZE})" +echo "Test 2: ${TEST2} (${COUNT2} profiles, peak: ${PEAK2_SIZE})" +echo "Duration change: ${DURATION_CHANGE}" +echo "Peak size change: ${SIZE_CHANGE}" +echo "" +echo "Full report: ${REPORT_FILE}" diff --git a/hack/tools/e2e-profiling/e2e-profile.sh b/hack/tools/e2e-profiling/e2e-profile.sh new file mode 100755 index 0000000000..87be1b6a15 --- /dev/null +++ b/hack/tools/e2e-profiling/e2e-profile.sh @@ -0,0 +1,113 @@ +#!/bin/bash +# +# E2E profiling wrapper script +# Main entry point for the e2e profiling plugin +# +# Usage: +# ./e2e-profile.sh run +# ./e2e-profile.sh analyze +# ./e2e-profile.sh compare +# ./e2e-profile.sh collect +# + +set -euo pipefail + +# Source common functions +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +source "${SCRIPT_DIR}/common.sh" + +usage() { + cat << EOF +E2E Profiling Tool + +USAGE: + $0 [arguments] + +COMMANDS: + run [test-target] Run e2e test with heap and CPU profiling + analyze Analyze collected profiles + compare Compare two test runs + collect Manually collect a single profile + help Show this help message + +TEST TARGETS (for 'run' command): + test-e2e Standard e2e tests + test-experimental-e2e Experimental e2e tests (default) + test-extension-developer-e2e Extension developer e2e tests + test-upgrade-e2e Upgrade e2e tests + test-upgrade-experimental-e2e Upgrade experimental e2e tests + +EXAMPLES: + $0 run baseline + $0 run baseline test-e2e + $0 run with-caching test-experimental-e2e + $0 compare baseline with-caching + $0 analyze baseline + +ENVIRONMENT VARIABLES: + E2E_PROFILE_NAMESPACE Namespace (default: olmv1-system) + E2E_PROFILE_INTERVAL Collection interval in seconds (default: 15) + E2E_PROFILE_CPU_DURATION CPU sampling duration in seconds (default: 10) + E2E_PROFILE_DIR Output directory (default: ./e2e-profiles) + E2E_PROFILE_TEST_TARGET Default test target (default: test-experimental-e2e) + +EOF +} + +# Parse command +COMMAND="${1:-}" + +case "${COMMAND}" in + run) + TEST_NAME="${2:-}" + TEST_TARGET="${3:-}" + if [ -z "${TEST_NAME}" ]; then + log_error "Test name required" + echo "Usage: $0 run [test-target]" + exit 1 + fi + exec "${SCRIPT_DIR}/run-profiled-test.sh" "${TEST_NAME}" "${TEST_TARGET}" + ;; + + analyze) + TEST_NAME="${2:-}" + if [ -z "${TEST_NAME}" ]; then + log_error "Test name required" + echo "Usage: $0 analyze " + exit 1 + fi + exec "${SCRIPT_DIR}/analyze-profiles.sh" "${TEST_NAME}" + ;; + + compare) + TEST1="${2:-}" + TEST2="${3:-}" + if [ -z "${TEST1}" ] || [ -z "${TEST2}" ]; then + log_error "Two test names required" + echo "Usage: $0 compare " + exit 1 + fi + exec "${SCRIPT_DIR}/compare-profiles.sh" "${TEST1}" "${TEST2}" + ;; + + collect) + exec "${SCRIPT_DIR}/collect-profiles.sh" "manual" + ;; + + help|--help|-h) + usage + exit 0 + ;; + + "") + log_error "No command specified" + usage + exit 1 + ;; + + *) + log_error "Unknown command: ${COMMAND}" + usage + exit 1 + ;; +esac diff --git a/hack/tools/e2e-profiling/run-profiled-test.sh b/hack/tools/e2e-profiling/run-profiled-test.sh new file mode 100755 index 0000000000..b28dd9e9a5 --- /dev/null +++ b/hack/tools/e2e-profiling/run-profiled-test.sh @@ -0,0 +1,208 @@ +#!/bin/bash +# +# Run e2e test with memory profiling +# +# Usage: run-profiled-test.sh [test-target] +# +# test-target options: +# - test-e2e (standard e2e) +# - test-experimental-e2e (experimental e2e) [default] +# - test-extension-developer-e2e (extension developer e2e) +# - test-upgrade-e2e (upgrade e2e) +# - test-upgrade-experimental-e2e (upgrade experimental e2e) +# + +set -euo pipefail + +# Source common functions +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +source "${SCRIPT_DIR}/common.sh" + +# Configuration +TEST_NAME="${1:-default}" +TEST_TARGET="${2:-${E2E_PROFILE_TEST_TARGET}}" +OUTPUT_DIR="${E2E_PROFILE_DIR}/${TEST_NAME}" + +# Get absolute path for OUTPUT_DIR +# This is needed because E2E tests may change directories +OUTPUT_DIR_ABS="$(to_absolute_path "${OUTPUT_DIR}")" + +# Create output directory (clean up if it exists) +if [ -d "${OUTPUT_DIR}" ]; then + log_warn "Output directory already exists: ${OUTPUT_DIR}" + log_warn "Removing old profiles to start fresh..." + rm -rf "${OUTPUT_DIR}" +fi +mkdir -p "${OUTPUT_DIR}" + +# Update OUTPUT_DIR_ABS now that directory exists +OUTPUT_DIR_ABS="$(cd "${OUTPUT_DIR}" && pwd)" + +# PIDs to track +TEST_PID="" +COLLECT_PID="" + +# Cleanup function +cleanup() { + log_info "Cleaning up..." + + if [ -n "${COLLECT_PID}" ] && kill -0 "${COLLECT_PID}" 2>/dev/null; then + log_info "Stopping profile collection (PID: ${COLLECT_PID})" + kill "${COLLECT_PID}" 2>/dev/null || true + wait "${COLLECT_PID}" 2>/dev/null || true + fi + + if [ -n "${TEST_PID}" ] && kill -0 "${TEST_PID}" 2>/dev/null; then + log_warn "Test is still running (PID: ${TEST_PID})" + log_warn "Leaving test running. Use 'kill ${TEST_PID}' to stop it." + fi +} + +trap cleanup EXIT INT TERM + +# Check if make target exists +if ! make -n "${TEST_TARGET}" >/dev/null 2>&1; then + log_error "Make target '${TEST_TARGET}' not found" + log_error "Ensure you're in the project root directory" + log_error "Available e2e targets: test-e2e, test-experimental-e2e, test-extension-developer-e2e, test-upgrade-e2e, test-upgrade-experimental-e2e" + exit 1 +fi + +log_info "Starting profiled test: ${TEST_NAME}" +log_info "Test target: ${TEST_TARGET}" +log_info "Output directory: ${OUTPUT_DIR}" + +# Start the e2e test +log_info "Starting e2e test (${TEST_TARGET})..." +# Set E2E_SUMMARY_OUTPUT to capture prometheus alerts and other test metrics +# Use absolute path because e2e tests may change directories +E2E_SUMMARY_OUTPUT="${OUTPUT_DIR_ABS}/e2e-summary.md" make "${TEST_TARGET}" > "${OUTPUT_DIR}/test.log" 2>&1 & +TEST_PID=$! +log_info "Test started (PID: ${TEST_PID})" + +# Give the test some time to start +log_info "Waiting for test to initialize (30 seconds)..." +sleep 30 + +# Check if test is still running +if ! kill -0 "${TEST_PID}" 2>/dev/null; then + # Capture the exit code of the test process + wait "${TEST_PID}" + TEST_EXIT_CODE=$? + log_error "Test exited early with exit code ${TEST_EXIT_CODE}!" + log_error "Check ${OUTPUT_DIR}/test.log for details" + exit "${TEST_EXIT_CODE}" +fi + +# Start profile collection +log_info "Starting profile collection..." +"${SCRIPT_DIR}/collect-profiles.sh" "${TEST_NAME}" > "${OUTPUT_DIR}/collection.log" 2>&1 & +COLLECT_PID=$! +log_info "Profile collection started (PID: ${COLLECT_PID})" + +# Monitor both processes +log_info "Monitoring test and collection..." +log_info "Test PID: ${TEST_PID}" +log_info "Collection PID: ${COLLECT_PID}" +log_info "" +log_info "Press Ctrl+C to stop collection (test will continue)" +log_info "" + +# Wait for either process to finish +while true; do + # Check if test finished + if ! kill -0 "${TEST_PID}" 2>/dev/null; then + log_info "Test completed" + TEST_EXIT=$? + + # Give collection a few more seconds + log_info "Collecting final profiles..." + sleep 30 + + # Stop collection + if [ -n "${COLLECT_PID}" ] && kill -0 "${COLLECT_PID}" 2>/dev/null; then + kill "${COLLECT_PID}" 2>/dev/null || true + fi + + break + fi + + # Check if collection stopped + if ! kill -0 "${COLLECT_PID}" 2>/dev/null; then + log_warn "Profile collection stopped" + log_info "Test is still running (PID: ${TEST_PID})" + log_info "Waiting for test to complete..." + + # Wait for test to finish + wait "${TEST_PID}" 2>/dev/null || true + TEST_EXIT=$? + break + fi + + # Display progress + if [ -d "${OUTPUT_DIR}/operator-controller" ] && [ -d "${OUTPUT_DIR}/catalogd" ]; then + # Multi-component progress + OC_COUNT=$(find -L "${OUTPUT_DIR}/operator-controller" -maxdepth 1 -name "heap*.pprof" -type f 2>/dev/null | wc -l) + CAT_COUNT=$(find -L "${OUTPUT_DIR}/catalogd" -maxdepth 1 -name "heap*.pprof" -type f 2>/dev/null | wc -l) + echo -ne "\r${BLUE}[PROGRESS]${NC} operator-controller: ${OC_COUNT}, catalogd: ${CAT_COUNT} " + else + # Single-component progress + PROFILE_COUNT=$(find -L "${OUTPUT_DIR}" -maxdepth 1 -name "heap*.pprof" -type f 2>/dev/null | wc -l) + if [ "${PROFILE_COUNT}" -gt 0 ]; then + LATEST=$(ls -t "${OUTPUT_DIR}"/heap*.pprof 2>/dev/null | head -1) + LATEST_SIZE=$(du -h "${LATEST}" 2>/dev/null | cut -f1 || echo "?") + echo -ne "\r${BLUE}[PROGRESS]${NC} Profiles: ${PROFILE_COUNT}, Latest: $(basename "${LATEST}") (${LATEST_SIZE}) " + fi + fi + + sleep 5 +done + +echo "" # New line after progress + +# Count collected profiles +if [ -d "${OUTPUT_DIR}/operator-controller" ] && [ -d "${OUTPUT_DIR}/catalogd" ]; then + OC_FINAL=$(find -L "${OUTPUT_DIR}/operator-controller" -maxdepth 1 -name "heap*.pprof" -type f 2>/dev/null | wc -l) + CAT_FINAL=$(find -L "${OUTPUT_DIR}/catalogd" -maxdepth 1 -name "heap*.pprof" -type f 2>/dev/null | wc -l) + FINAL_COUNT=$((OC_FINAL + CAT_FINAL)) + log_success "Profiling complete!" + log_info "Collected ${OC_FINAL} operator-controller profiles and ${CAT_FINAL} catalogd profiles" + log_info "Profiles saved to: ${OUTPUT_DIR}" +else + FINAL_COUNT=$(find -L "${OUTPUT_DIR}" -maxdepth 1 -name "heap*.pprof" -type f 2>/dev/null | wc -l) + log_success "Profiling complete!" + log_info "Collected ${FINAL_COUNT} profiles" + log_info "Profiles saved to: ${OUTPUT_DIR}" +fi + +# Run analysis +log_info "Running analysis..." +if "${SCRIPT_DIR}/analyze-profiles.sh" "${TEST_NAME}"; then + log_success "Analysis complete!" +else + log_error "Analysis failed" +fi + +# Display summary +echo "" +echo "=== Test Summary ===" +echo "Test: ${TEST_NAME}" +if [ -d "${OUTPUT_DIR}/operator-controller" ] && [ -d "${OUTPUT_DIR}/catalogd" ]; then + echo "operator-controller: ${OC_FINAL} profiles" + echo "catalogd: ${CAT_FINAL} profiles" +else + echo "Profiles: ${FINAL_COUNT}" +fi +echo "Output: ${OUTPUT_DIR}" +echo "Test Log: ${OUTPUT_DIR}/test.log" +echo "Collection Log: ${OUTPUT_DIR}/collection.log" +echo "Analysis: ${OUTPUT_DIR}/analysis.md" +echo "" + +if [ "${FINAL_COUNT}" -eq 0 ]; then + log_error "No profiles collected!" + log_error "Check collection.log for errors" + exit 1 +fi + +log_success "All done! Review the analysis in ${OUTPUT_DIR}/analysis.md" diff --git a/hack/tools/e2e-profiling/start-profiling.sh b/hack/tools/e2e-profiling/start-profiling.sh new file mode 100755 index 0000000000..5b61fc022a --- /dev/null +++ b/hack/tools/e2e-profiling/start-profiling.sh @@ -0,0 +1,347 @@ +#!/bin/bash +# +# Start profiling in daemon mode +# +# Usage: start-profiling.sh [output-name] +# +# Starts port-forwarding and profile collection in the background. +# Run your test commands, then use stop-profiling.sh to finish. +# +# Example: +# ./start-profiling.sh my-test +# make test-e2e +# ./stop-profiling.sh +# + +set -euo pipefail + +# Source common functions +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +source "${SCRIPT_DIR}/common.sh" + +# Configuration +OUTPUT_NAME="${1:-$(date +%Y%m%d-%H%M%S)}" +NAMESPACE="${E2E_PROFILE_NAMESPACE}" +INTERVAL="${E2E_PROFILE_INTERVAL}" +OUTPUT_DIR="${E2E_PROFILE_DIR}/${OUTPUT_NAME}" +CPU_PROFILE_DURATION="${E2E_PROFILE_CPU_DURATION}" +PROFILE_MODE="${E2E_PROFILE_MODE}" +STATE_FILE="${E2E_PROFILE_DIR}/.profiling-state" + +# Check if already running +if [ -f "${STATE_FILE}" ]; then + log_error "Profiling is already running!" + log_error "State file exists: ${STATE_FILE}" + log_error "Run 'stop-profiling.sh' first to stop the current session." + exit 1 +fi + +# Validate profile mode +case "${PROFILE_MODE}" in + both|heap|cpu) + ;; + *) + log_error "Invalid E2E_PROFILE_MODE: ${PROFILE_MODE}" + log_error "Valid options: both, heap, cpu" + exit 1 + ;; +esac + +# Create output directory +mkdir -p "${OUTPUT_DIR}" +mkdir -p "$(dirname "${STATE_FILE}")" + +log_info "Starting profiling session: ${OUTPUT_NAME}" +log_info "Output directory: ${OUTPUT_DIR}" +log_info "Profile mode: ${PROFILE_MODE}" +log_info "Interval: ${INTERVAL}s, CPU duration: ${CPU_PROFILE_DURATION}s" + +# Export variables for background script +export PROFILE_SCRIPT_DIR="${SCRIPT_DIR}" +export PROFILE_OUTPUT_NAME="${OUTPUT_NAME}" +export PROFILE_NAMESPACE="${NAMESPACE}" +export PROFILE_INTERVAL="${INTERVAL}" +export PROFILE_OUTPUT_DIR="${OUTPUT_DIR}" +export PROFILE_CPU_DURATION="${CPU_PROFILE_DURATION}" +export PROFILE_MODE_VAR="${PROFILE_MODE}" +export PROFILE_STATE_FILE="${STATE_FILE}" + +# Start everything in background +nohup bash -c ' +set -euo pipefail + +# Source common functions +source "${PROFILE_SCRIPT_DIR}/common.sh" + +# Component configurations +declare -A COMPONENTS=( + ["operator-controller"]="deployment=operator-controller-controller-manager;port=6060;local_port=6060" + ["catalogd"]="deployment=catalogd-controller-manager;port=6060;local_port=6061" +) + +log_info "Background process started, waiting for cluster components..." + +# Wait for namespace to exist +log_info "Waiting for namespace ${PROFILE_NAMESPACE} to exist..." +TIMEOUT=300 +ELAPSED=0 +NAMESPACE_CHECK_INTERVAL=5 +while [ $ELAPSED -lt $TIMEOUT ]; do + if kubectl get namespace "${PROFILE_NAMESPACE}" >/dev/null 2>&1; then + log_success "Namespace ${PROFILE_NAMESPACE} exists" + break + fi + sleep $NAMESPACE_CHECK_INTERVAL + ELAPSED=$((ELAPSED + NAMESPACE_CHECK_INTERVAL)) +done + +if [ $ELAPSED -ge $TIMEOUT ]; then + log_error "Namespace ${PROFILE_NAMESPACE} did not appear within ${TIMEOUT} seconds" + rm -f "${PROFILE_STATE_FILE}" + exit 1 +fi + +# Set up port forwarding for all components +declare -A PF_PIDS +for component in "${!COMPONENTS[@]}"; do + # Parse component configuration + IFS=";" read -ra CONFIG <<< "${COMPONENTS[$component]}" + DEPLOYMENT="" + PPROF_PORT="" + LOCAL_PORT="" + + for item in "${CONFIG[@]}"; do + key="${item%%=*}" + value="${item#*=}" + case "$key" in + deployment) DEPLOYMENT="$value" ;; + port) PPROF_PORT="$value" ;; + local_port) LOCAL_PORT="$value" ;; + esac + done + + log_info "Setting up ${component}..." + + # Wait for deployment to exist + log_info "Waiting for deployment ${DEPLOYMENT} to be created..." + TIMEOUT=300 + ELAPSED=0 + while ! kubectl get deployment -n "${PROFILE_NAMESPACE}" "${DEPLOYMENT}" &> /dev/null; do + if [ $ELAPSED -ge $TIMEOUT ]; then + log_error "Deployment ${DEPLOYMENT} was not created within ${TIMEOUT} seconds" + rm -f "${PROFILE_STATE_FILE}" + exit 1 + fi + sleep 2 + ELAPSED=$((ELAPSED + 2)) + done + log_success "Deployment ${DEPLOYMENT} exists" + + # Wait for deployment to become available + log_info "Waiting for deployment ${DEPLOYMENT} to become available..." + if ! kubectl wait --for=condition=Available -n "${PROFILE_NAMESPACE}" deployment "${DEPLOYMENT}" --timeout=300s; then + log_error "Deployment ${DEPLOYMENT} did not become available" + rm -f "${PROFILE_STATE_FILE}" + exit 1 + fi + log_success "Deployment ${DEPLOYMENT} is available" + + # Create component output directory + mkdir -p "${PROFILE_OUTPUT_DIR}/${component}" + + # Set up port forwarding + log_info "Setting up port forwarding to deployment/${DEPLOYMENT}:${PPROF_PORT} -> localhost:${LOCAL_PORT}..." + kubectl port-forward -n "${PROFILE_NAMESPACE}" "deployment/${DEPLOYMENT}" "${LOCAL_PORT}:${PPROF_PORT}" > "${PROFILE_OUTPUT_DIR}/${component}/port-forward.log" 2>&1 & + PF_PIDS[$component]=$! + log_info "Port-forward started (PID: ${PF_PIDS[$component]})" +done + +# Wait for port forwards to be ready +log_info "Waiting for port forwards to initialize..." +RETRY_TIMEOUT=30 +RETRY_INTERVAL=2 +ELAPSED=0 + +declare -A READY_COMPONENTS + +while [ ${ELAPSED} -lt ${RETRY_TIMEOUT} ]; do + ALL_READY=true + + for component in "${!COMPONENTS[@]}"; do + if [ "${READY_COMPONENTS[$component]:-}" = "true" ]; then + continue + fi + + IFS=";" read -ra CONFIG <<< "${COMPONENTS[$component]}" + LOCAL_PORT="" + for item in "${CONFIG[@]}"; do + key="${item%%=*}" + value="${item#*=}" + if [ "$key" = "local_port" ]; then + LOCAL_PORT="$value" + break + fi + done + + if curl -s --max-time 2 "http://localhost:${LOCAL_PORT}/debug/pprof/" > /dev/null 2>&1; then + READY_COMPONENTS[$component]="true" + log_success "Connected to ${component} pprof endpoint" + else + ALL_READY=false + fi + done + + if [ "${ALL_READY}" = "true" ]; then + break + fi + + sleep ${RETRY_INTERVAL} + ELAPSED=$((ELAPSED + RETRY_INTERVAL)) +done + +# Verify all components are ready +for component in "${!COMPONENTS[@]}"; do + if [ "${READY_COMPONENTS[$component]:-}" != "true" ]; then + log_error "Failed to connect to ${component} pprof endpoint after ${RETRY_TIMEOUT}s" + log_error "Check ${PROFILE_OUTPUT_DIR}/${component}/port-forward.log for details" + + # Clean up port forwards + for comp in "${!PF_PIDS[@]}"; do + kill "${PF_PIDS[$comp]}" 2>/dev/null || true + done + rm -f "${PROFILE_STATE_FILE}" + exit 1 + fi +done + +log_success "All port forwards ready" + +# Save state with PF PIDs +cat > "${PROFILE_STATE_FILE}" <> "${PROFILE_STATE_FILE}" +done + +# Start profile collection loop +log_info "Starting profile collection (interval: ${PROFILE_INTERVAL}s)" + +n=0 +consecutive_failures=0 +max_consecutive_failures=3 + +while true; do + iteration_start=$(date +%s) + cpu_pids=() + iteration_success=false + + for component in "${!COMPONENTS[@]}"; do + IFS=";" read -ra CONFIG <<< "${COMPONENTS[$component]}" + LOCAL_PORT="" + for item in "${CONFIG[@]}"; do + key="${item%%=*}" + value="${item#*=}" + if [ "$key" = "local_port" ]; then + LOCAL_PORT="$value" + break + fi + done + + # Collect heap profile (if enabled) + if [ "${PROFILE_MODE_VAR}" = "both" ] || [ "${PROFILE_MODE_VAR}" = "heap" ]; then + HEAP_FILE="${PROFILE_OUTPUT_DIR}/${component}/heap${n}.pprof" + if curl -s --max-time 10 "http://localhost:${LOCAL_PORT}/debug/pprof/heap" > "${HEAP_FILE}" 2>/dev/null; then + if [ -s "${HEAP_FILE}" ]; then + SIZE=$(du -h "${HEAP_FILE}" | cut -f1) + log_success "Collected ${component}/heap${n}.pprof (${SIZE})" + iteration_success=true + else + rm "${HEAP_FILE}" 2>/dev/null || true + fi + else + # Silently ignore curl failures - cluster may be down/restarting + rm "${HEAP_FILE}" 2>/dev/null || true + fi + fi + + # Collect CPU profile (in background, if enabled) + if [ "${PROFILE_MODE_VAR}" = "both" ] || [ "${PROFILE_MODE_VAR}" = "cpu" ]; then + CPU_FILE="${PROFILE_OUTPUT_DIR}/${component}/cpu${n}.pprof" + ( + if curl -s --max-time $((PROFILE_CPU_DURATION + 5)) "http://localhost:${LOCAL_PORT}/debug/pprof/profile?seconds=${PROFILE_CPU_DURATION}" > "${CPU_FILE}" 2>/dev/null; then + if [ -s "${CPU_FILE}" ]; then + SIZE=$(du -h "${CPU_FILE}" | cut -f1) + log_success "Collected ${component}/cpu${n}.pprof (${SIZE})" + else + rm "${CPU_FILE}" 2>/dev/null || true + fi + else + # Silently ignore curl failures + rm "${CPU_FILE}" 2>/dev/null || true + fi + ) & + cpu_pids+=($!) + fi + done + + n=$((n + 1)) + + # Wait for CPU profiling to complete + for pid in "${cpu_pids[@]}"; do + wait "$pid" 2>/dev/null || true + done + + # Track consecutive failures to detect cluster teardown + if [ "${iteration_success}" = "true" ]; then + consecutive_failures=0 + else + consecutive_failures=$((consecutive_failures + 1)) + if [ ${consecutive_failures} -ge ${max_consecutive_failures} ]; then + log_info "Detected ${max_consecutive_failures} consecutive collection failures - cluster may be down. Stopping collection." + break + fi + fi + + # Maintain consistent interval + iteration_end=$(date +%s) + elapsed=$((iteration_end - iteration_start)) + sleep_time=$((PROFILE_INTERVAL - elapsed)) + + if [ $sleep_time -gt 0 ]; then + sleep "${sleep_time}" + fi +done + +log_info "Profile collection ended" +' > "${OUTPUT_DIR}/startup.log" 2>&1 & + +# Get the PID of the background process +BACKGROUND_PID=$! + +# Wait a moment for the background process to initialize +sleep 1 + +# Create initial state file with background PID +cat > "${STATE_FILE}" </dev/null; then + log_info "Stopping profiling process (PID: ${PID_TO_STOP})" + kill "${PID_TO_STOP}" 2>/dev/null || true + + # Wait for graceful shutdown + wait_count=0 + while kill -0 "${PID_TO_STOP}" 2>/dev/null && [ $wait_count -lt 5 ]; do + sleep 0.5 + wait_count=$((wait_count + 1)) + done + + # Force kill if still running + if kill -0 "${PID_TO_STOP}" 2>/dev/null; then + log_warn "Force killing profiling process" + kill -9 "${PID_TO_STOP}" 2>/dev/null || true + fi + + log_success "Profiling stopped" + else + log_info "Profiling process already stopped (PID: ${PID_TO_STOP})" + fi +else + log_info "No profiling process to stop" +fi + +# Stop port-forward processes +log_info "Stopping port-forward processes..." +pf_stopped=0 +pf_already_stopped=0 +while IFS='=' read -r key value; do + if [[ $key == PF_PID_* ]]; then + component="${key#PF_PID_}" + if [ -n "$value" ]; then + if kill -0 "$value" 2>/dev/null; then + log_info "Stopping port-forward for ${component} (PID: ${value})" + kill "$value" 2>/dev/null || true + + # Wait for graceful shutdown + wait_count=0 + while kill -0 "$value" 2>/dev/null && [ $wait_count -lt 5 ]; do + sleep 0.5 + wait_count=$((wait_count + 1)) + done + + # Force kill if still running + if kill -0 "$value" 2>/dev/null; then + log_warn "Force killing port-forward for ${component}" + kill -9 "$value" 2>/dev/null || true + fi + pf_stopped=$((pf_stopped + 1)) + else + # Already stopped - likely due to cluster teardown + pf_already_stopped=$((pf_already_stopped + 1)) + fi + fi + fi +done < "${STATE_FILE}" + +if [ $pf_stopped -gt 0 ]; then + log_success "Stopped ${pf_stopped} port-forward(s)" +fi +if [ $pf_already_stopped -gt 0 ]; then + log_info "${pf_already_stopped} port-forward(s) already stopped" +fi + +# Clean up empty profile files +log_info "Cleaning up empty profile files..." +if [ -d "${OUTPUT_DIR}" ]; then + for component_dir in "${OUTPUT_DIR}"/*/; do + if [ -d "${component_dir}" ]; then + component_name=$(basename "${component_dir}") + + # Clean up empty heap profiles + empty_heap_count=$(find "${component_dir}" -name "heap*.pprof" -type f -size 0 2>/dev/null | wc -l) + if [ "${empty_heap_count}" -gt 0 ]; then + log_info " Removing ${empty_heap_count} empty heap profiles from ${component_name}" + find "${component_dir}" -name "heap*.pprof" -type f -size 0 -delete + fi + + # Clean up empty CPU profiles + empty_cpu_count=$(find "${component_dir}" -name "cpu*.pprof" -type f -size 0 2>/dev/null | wc -l) + if [ "${empty_cpu_count}" -gt 0 ]; then + log_info " Removing ${empty_cpu_count} empty CPU profiles from ${component_name}" + find "${component_dir}" -name "cpu*.pprof" -type f -size 0 -delete + fi + fi + done +fi + +# Count collected profiles +if [ -d "${OUTPUT_DIR}" ]; then + for component_dir in "${OUTPUT_DIR}"/*/; do + if [ -d "${component_dir}" ]; then + component_name=$(basename "${component_dir}") + heap_count=$(find "${component_dir}" -maxdepth 1 -name "heap*.pprof" -type f 2>/dev/null | wc -l) + cpu_count=$(find "${component_dir}" -maxdepth 1 -name "cpu*.pprof" -type f 2>/dev/null | wc -l) + + if [ "${heap_count}" -gt 0 ] || [ "${cpu_count}" -gt 0 ]; then + log_info " ${component_name}: ${heap_count} heap profiles, ${cpu_count} CPU profiles" + fi + fi + done +fi + +# Remove state file +rm -f "${STATE_FILE}" +log_success "Profiling session stopped" + +# Run analysis if requested +if [ "${RUN_ANALYSIS}" = true ]; then + log_info "Running analysis..." + if "${SCRIPT_DIR}/analyze-profiles.sh" "${OUTPUT_NAME}"; then + log_success "Analysis complete!" + echo "" + echo "=== Results ===" + echo "Profiles: ${OUTPUT_DIR}" + echo "Analysis: ${OUTPUT_DIR}/analysis.md" + echo "" + log_info "View analysis:" + log_info " cat ${OUTPUT_DIR}/analysis.md" + else + log_error "Analysis failed" + echo "" + echo "=== Results ===" + echo "Profiles: ${OUTPUT_DIR}" + echo "" + log_info "You can run analysis manually:" + log_info " ./hack/tools/e2e-profiling/analyze-profiles.sh ${OUTPUT_NAME}" + fi +else + log_info "Skipping analysis (use --analyze to enable)" + echo "" + echo "=== Results ===" + echo "Profiles: ${OUTPUT_DIR}" + echo "" + log_info "Run analysis manually:" + log_info " ./hack/tools/e2e-profiling/analyze-profiles.sh ${OUTPUT_NAME}" +fi diff --git a/helm/e2e.yaml b/helm/e2e.yaml index 11d51ddad9..eebf3265bc 100644 --- a/helm/e2e.yaml +++ b/helm/e2e.yaml @@ -6,3 +6,5 @@ options: e2e: enabled: true + profiling: + enabled: true diff --git a/helm/olmv1/templates/deployment-olmv1-system-catalogd-controller-manager.yml b/helm/olmv1/templates/deployment-olmv1-system-catalogd-controller-manager.yml index b3df12139c..092cb7a24e 100644 --- a/helm/olmv1/templates/deployment-olmv1-system-catalogd-controller-manager.yml +++ b/helm/olmv1/templates/deployment-olmv1-system-catalogd-controller-manager.yml @@ -44,6 +44,9 @@ spec: - --leader-elect {{- end }} - --metrics-bind-address=:7443 + {{- if .Values.options.profiling.enabled }} + - --pprof-bind-address=:6060 + {{- end }} - --external-address=catalogd-service.{{ .Values.namespaces.olmv1.name }}.svc {{- range .Values.options.catalogd.features.enabled }} - --feature-gates={{- . -}}=true diff --git a/helm/olmv1/templates/deployment-olmv1-system-operator-controller-controller-manager.yml b/helm/olmv1/templates/deployment-olmv1-system-operator-controller-controller-manager.yml index 9ec405a3e0..249610244d 100644 --- a/helm/olmv1/templates/deployment-olmv1-system-operator-controller-controller-manager.yml +++ b/helm/olmv1/templates/deployment-olmv1-system-operator-controller-controller-manager.yml @@ -41,6 +41,9 @@ spec: - args: - --health-probe-bind-address=:8081 - --metrics-bind-address=:8443 + {{- if .Values.options.profiling.enabled }} + - --pprof-bind-address=:6060 + {{- end }} {{- if not .Values.options.tilt.enabled }} - --leader-elect {{- end }} diff --git a/helm/olmv1/values.yaml b/helm/olmv1/values.yaml index 0704f43ef3..5ab9d76721 100644 --- a/helm/olmv1/values.yaml +++ b/helm/olmv1/values.yaml @@ -24,6 +24,8 @@ options: enabled: false e2e: enabled: false + profiling: + enabled: false tilt: enabled: false openshift: diff --git a/manifests/experimental-e2e.yaml b/manifests/experimental-e2e.yaml index 1efa8b8d99..db03c11a8d 100644 --- a/manifests/experimental-e2e.yaml +++ b/manifests/experimental-e2e.yaml @@ -2037,6 +2037,7 @@ spec: - args: - --leader-elect - --metrics-bind-address=:7443 + - --pprof-bind-address=:6060 - --external-address=catalogd-service.olmv1-system.svc - --feature-gates=APIV1MetasHandler=true - --tls-cert=/var/certs/tls.crt @@ -2187,6 +2188,7 @@ spec: - args: - --health-probe-bind-address=:8081 - --metrics-bind-address=:8443 + - --pprof-bind-address=:6060 - --leader-elect - --feature-gates=SingleOwnNamespaceInstallSupport=true - --feature-gates=PreflightPermissions=true diff --git a/manifests/standard-e2e.yaml b/manifests/standard-e2e.yaml index 783beec515..5c95907841 100644 --- a/manifests/standard-e2e.yaml +++ b/manifests/standard-e2e.yaml @@ -1784,6 +1784,7 @@ spec: - args: - --leader-elect - --metrics-bind-address=:7443 + - --pprof-bind-address=:6060 - --external-address=catalogd-service.olmv1-system.svc - --tls-cert=/var/certs/tls.crt - --tls-key=/var/certs/tls.key @@ -1933,6 +1934,7 @@ spec: - args: - --health-probe-bind-address=:8081 - --metrics-bind-address=:8443 + - --pprof-bind-address=:6060 - --leader-elect - --tls-cert=/var/certs/tls.crt - --tls-key=/var/certs/tls.key