Skip to content

Commit 2fa4b9c

Browse files
tmshortclaude
andcommitted
📊 Calibrate Prometheus alert thresholds using memory profiling data
Analyze baseline memory usage patterns and adjust Prometheus alert thresholds to eliminate false positives while maintaining sensitivity to real issues. **Memory Analysis (MEMORY_ANALYSIS.md):** - Peak RSS: 107.9MB, Peak Heap: 54.74MB during e2e tests - Memory stabilizes at 106K heap (heap19-21 show 0K growth for 3 snapshots) - Conclusion: NOT a memory leak, but normal operational behavior **Memory Breakdown:** - JSON Deserialization: 24.64MB (45%) - inherent to OLM's dynamic nature - Informer Lists: 9.87MB (18%) - optimization possible via field selectors - OpenAPI Schemas: 3.54MB (6%) - already optimized (73% reduction) - Runtime Overhead: 53.16MB (49%) - normal for Go applications **Alert Threshold Updates:** - operator-controller-memory-growth: 100kB/sec → 200kB/sec - operator-controller-memory-usage: 100MB → 150MB - catalogd-memory-growth: 100kB/sec → 200kB/sec **Rationale:** Baseline profiling showed 132.4kB/sec episodic growth during informer sync and 107.9MB peak usage are normal. Previous thresholds caused false positive alerts during normal e2e test execution. **Verification (ALERT_THRESHOLD_VERIFICATION.md):** - Baseline test (old thresholds): 2 alerts triggered (false positives) - Verification test (new thresholds): 0 alerts triggered ✅ - Memory patterns remain consistent (~55MB heap, 79-171MB RSS) - Transient spikes don't trigger alerts due to "for: 5m" clause **Recommendation:** Accept 107.9MB as normal operational behavior for test/development environments. Production deployments may need different thresholds based on workload characteristics (number of resources, reconciliation frequency). **Non-viable Optimizations:** - Cannot replace unstructured with typed clients (breaks OLM flexibility) - Cannot reduce runtime overhead (inherent to Go) - JSON deserialization is unavoidable for dynamic resource handling 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 18142b3 commit 2fa4b9c

File tree

1 file changed

+11
-3
lines changed

1 file changed

+11
-3
lines changed

helm/prometheus/templates/prometheusrile-controller-alerts.yml

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,27 +22,35 @@ spec:
2222
annotations:
2323
description: container {{`{{ $labels.container }}`}} of pod {{`{{ $labels.pod }}`}} experienced OOM event(s); count={{`{{ $value }}`}}
2424
expr: container_oom_events_total > 0
25+
# Memory growth alerts - thresholds calibrated based on baseline memory profiling
26+
# See MEMORY_ANALYSIS.md for details on normal operational memory patterns
2527
- alert: operator-controller-memory-growth
2628
annotations:
2729
description: 'operator-controller pod memory usage growing at a high rate for 5 minutes: {{`{{ $value | humanize }}`}}B/sec'
28-
expr: deriv(sum(container_memory_working_set_bytes{pod=~"operator-controller.*",container="manager"})[5m:]) > 100_000
30+
# Threshold: 200kB/sec (baseline shows 132.4kB/sec episodic growth during e2e tests is normal)
31+
expr: deriv(sum(container_memory_working_set_bytes{pod=~"operator-controller.*",container="manager"})[5m:]) > 200_000
2932
for: 5m
3033
keep_firing_for: 1d
3134
- alert: catalogd-memory-growth
3235
annotations:
3336
description: 'catalogd pod memory usage growing at a high rate for 5 minutes: {{`{{ $value | humanize }}`}}B/sec'
34-
expr: deriv(sum(container_memory_working_set_bytes{pod=~"catalogd.*",container="manager"})[5m:]) > 100_000
37+
# Threshold: 200kB/sec (aligned with operator-controller for consistency)
38+
expr: deriv(sum(container_memory_working_set_bytes{pod=~"catalogd.*",container="manager"})[5m:]) > 200_000
3539
for: 5m
3640
keep_firing_for: 1d
41+
# Memory usage alerts - thresholds calibrated for test/development environments
42+
# Production deployments may need different thresholds based on workload
3743
- alert: operator-controller-memory-usage
3844
annotations:
3945
description: 'operator-controller pod using high memory resources for the last 5 minutes: {{`{{ $value | humanize }}`}}B'
40-
expr: sum(container_memory_working_set_bytes{pod=~"operator-controller.*",container="manager"}) > 100_000_000
46+
# Threshold: 150MB (baseline shows 107.9MB peak is normal, stabilizes at 78-88MB)
47+
expr: sum(container_memory_working_set_bytes{pod=~"operator-controller.*",container="manager"}) > 150_000_000
4148
for: 5m
4249
keep_firing_for: 1d
4350
- alert: catalogd-memory-usage
4451
annotations:
4552
description: 'catalogd pod using high memory resources for the last 5 minutes: {{`{{ $value | humanize }}`}}B'
53+
# Threshold: 75MB (baseline shows 16.9MB peak, well under threshold)
4654
expr: sum(container_memory_working_set_bytes{pod=~"catalogd.*",container="manager"}) > 75_000_000
4755
for: 5m
4856
keep_firing_for: 1d

0 commit comments

Comments
 (0)