Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
309 changes: 309 additions & 0 deletions .claude/commands/e2e-profile.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,309 @@
---
description: Profile memory and CPU usage during e2e tests and analyze results
---

# E2E Profiling Plugin

Analyze memory and CPU usage during e2e tests by collecting pprof heap and CPU profiles and generating comprehensive analysis reports.

## Commands

### /e2e-profile start [test-name]

Start profiling in background mode (recommended workflow):

1. Start port-forwards to operator-controller and catalogd
2. Begin collecting heap and CPU profiles every 10 seconds
3. Run in background, allowing you to run any test command
4. Auto-detect cluster teardown and stop gracefully
5. Use `/e2e-profile stop` to finish and analyze

**Examples:**
```
/e2e-profile start baseline
# Then run: make test-e2e
# Then run: /e2e-profile stop
```

This workflow:
- Works with ANY test command (make test-e2e, make test-experimental-e2e, custom commands)
- Handles cluster teardown gracefully (test-e2e tears down cluster)
- Auto-stops after 3 consecutive collection failures
- Lets you run tests your way

### /e2e-profile stop

Stop background profiling session and generate analysis:

1. Stop profile collection process
2. Kill port-forward processes (or detect they're already stopped)
3. Clean up empty profile files
4. Generate comprehensive analysis report

**Example:**
```
/e2e-profile stop
```

### /e2e-profile run [test-name] [test-target]

Run an e2e test with continuous memory and CPU profiling (automated workflow):

1. Start the specified e2e test (defaults to `make test-experimental-e2e`)
2. Wait for the operator-controller pod to be ready
3. Collect heap and CPU profiles every 10 seconds to `./e2e-profiles/[test-name]/`
4. Continue until the test completes or is interrupted
5. Generate a summary report with memory and CPU analysis

**Test Targets:**
- `test-e2e` - Standard e2e tests
- `test-experimental-e2e` - Experimental e2e tests (default)
- `test-extension-developer-e2e` - Extension developer e2e tests
- `test-upgrade-e2e` - Upgrade e2e tests
- `test-upgrade-experimental-e2e` - Upgrade experimental e2e tests

**Examples:**
```
/e2e-profile run baseline
/e2e-profile run baseline test-e2e
/e2e-profile run with-caching test-experimental-e2e
/e2e-profile run upgrade-test test-upgrade-e2e
```

### /e2e-profile analyze [test-name]

Analyze collected heap profiles for a specific test run:

1. Load all heap profiles from `./e2e-profiles/[test-name]/`
2. Analyze memory growth patterns
3. Identify top allocators
4. Find OpenAPI, JSON, and other hotspots
5. Generate detailed markdown report

**Example:**
```
/e2e-profile analyze baseline
```

### /e2e-profile compare [test1] [test2]

Compare two test runs to measure the impact of changes:

1. Load profiles from both test runs
2. Compare peak memory usage
3. Compare memory growth rates
4. Identify differences in allocation patterns
5. Generate side-by-side comparison report with charts

**Example:**
```
/e2e-profile compare baseline with-caching
```

### /e2e-profile collect

Manually collect a single heap profile from the running operator-controller pod:

1. Find the operator-controller pod
2. Set up port forwarding to pprof endpoint
3. Download heap profile
4. Save to `./e2e-profiles/manual/heap-[timestamp].pprof`

**Example:**
```
/e2e-profile collect
```

## Task Breakdown

When you invoke this command, I will:

1. **Setup Phase**
- Create `./e2e-profiles/[test-name]` directory
- Verify `make test-experimental-e2e` is available
- Check kubectl access to the cluster

2. **Collection Phase**
- Start the e2e test in background
- Monitor for pod readiness
- Set up port forwarding to pprof endpoint (port 6060)
- Collect heap profiles every 10 seconds
- Save profiles with sequential naming (heap0.pprof, heap1.pprof, ...)

3. **Monitoring Phase**
- Track test progress
- Monitor profile file sizes for growth patterns
- Detect if test crashes or completes

4. **Analysis Phase**
- Use `go tool pprof` to analyze profiles
- Extract key metrics:
- Peak memory usage
- Memory growth over time
- Top allocators
- OpenAPI-related allocations
- JSON deserialization overhead
- Informer/cache allocations

5. **Reporting Phase**
- Generate markdown report with:
- Executive summary
- Memory timeline chart
- Top allocators table
- Allocation breakdown
- Recommendations for optimization

## Configuration

The plugin uses these defaults (customizable via environment variables):

```bash
# Namespace where operator-controller runs
E2E_PROFILE_NAMESPACE=olmv1-system

# Collection interval in seconds
E2E_PROFILE_INTERVAL=10

# CPU sampling duration in seconds
E2E_PROFILE_CPU_DURATION=10

# Profile collection mode (both, heap, cpu)
E2E_PROFILE_MODE=both

# Output directory base
E2E_PROFILE_DIR=./e2e-profiles

# Default test target
E2E_PROFILE_TEST_TARGET=test-experimental-e2e
```

**Profile Modes:**
- `both` (default): Collect both heap and CPU profiles
- `heap`: Collect only heap profiles (reduces overhead by ~3%)
- `cpu`: Collect only CPU profiles

## Output Structure

```
e2e-profiles/
├── baseline/
│ ├── operator-controller/
│ │ ├── heap0.pprof
│ │ ├── heap1.pprof
│ │ ├── cpu0.pprof
│ │ ├── cpu1.pprof
│ │ └── ...
│ ├── catalogd/
│ │ ├── heap0.pprof
│ │ ├── cpu0.pprof
│ │ └── ...
│ ├── test.log
│ ├── collection.log
│ └── analysis.md
├── with-caching/
│ └── ...
└── comparisons/
└── baseline-vs-with-caching.md
```

## Tool Location

The memory profiling scripts are located at:
```
hack/tools/e2e-profiling/
├── e2e-profile.sh # Main entry point
├── start-profiling.sh # Start background profiling
├── stop-profiling.sh # Stop profiling and analyze
├── run-profiled-test.sh # Run test with profiling (automated)
├── collect-profiles.sh # Profile collection loop
├── analyze-profiles.sh # Generate analysis reports
├── compare-profiles.sh # Compare two runs
├── common.sh # Shared utilities
└── README.md # Full documentation
```

You can run them directly:
```bash
# Start/Stop workflow
make start-profiling # or ./hack/tools/e2e-profiling/start-profiling.sh
make test-e2e
make stop-profiling # or ./hack/tools/e2e-profiling/stop-profiling.sh

# Automated workflow
./hack/tools/e2e-profiling/e2e-profile.sh run baseline
./hack/tools/e2e-profiling/e2e-profile.sh analyze baseline
./hack/tools/e2e-profiling/e2e-profile.sh compare baseline optimized
```

## Requirements

- kubectl with access to the cluster
- go tool pprof
- make (for running tests)
- curl (for fetching profiles)
- Port 6060 available for forwarding

## Example Workflows

### Recommended: Start/Stop Workflow

```bash
# 1. Start profiling in background
/e2e-profile start baseline

# 2. Run your test (any command!)
make test-e2e # Works! Handles cluster teardown
make test-experimental-e2e # Works!
go test ./test/e2e/... # Works!

# 3. Stop profiling and get analysis
/e2e-profile stop

# 4. Make code changes and test again
# ... edit code ...
/e2e-profile start optimized
make test-e2e
/e2e-profile stop

# 5. Compare results
/e2e-profile compare baseline optimized
```

### Alternative: Automated Workflow

```bash
# 1. Run baseline test with profiling (automated)
/e2e-profile run baseline

# 2. Make code changes (e.g., add caching)
# ... edit code ...

# 3. Run new test with profiling
/e2e-profile run with-caching

# 4. Compare results
/e2e-profile compare baseline with-caching

# 5. Review the comparison report
# Opens: e2e-profiles/comparisons/baseline-vs-with-caching.md
```

## Notes

**Start/Stop Workflow:**
- Profiler runs in background, letting you run any test command
- Auto-detects cluster teardown after 3 consecutive collection failures
- Port-forwards and collection process stop gracefully
- Works with test-e2e (which tears down cluster), test-experimental-e2e, and custom commands

**Automated Workflow:**
- Test will run until completion or manual interruption (Ctrl+C)
- Automatically handles profiling setup and teardown

**General:**
- Each heap profile is ~11-150KB depending on memory usage
- Each CPU profile is ~4-40KB depending on activity
- Analysis requires all profile files to be present
- Port forwarding uses deployments (survives pod restarts)
- Reports are generated in markdown format for easy viewing
- Empty profile files are automatically cleaned up