You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
📊 Calibrate Prometheus alert thresholds using memory profiling data
Analyze baseline memory usage patterns and adjust Prometheus alert thresholds
to eliminate false positives while maintaining sensitivity to real issues.
This is based on memory profiling done against BoxcutterRuntime, which has
increased memory load.
**Memory Analysis:**
- Peak RSS: 107.9MB, Peak Heap: 54.74MB during e2e tests
- Memory stabilizes at 106K heap (heap19-21 show 0K growth for 3 snapshots)
- Conclusion: NOT a memory leak, but normal operational behavior
**Memory Breakdown:**
- JSON Deserialization: 24.64MB (45%) - inherent to OLM's dynamic nature
- Informer Lists: 9.87MB (18%) - optimization possible via field selectors
- OpenAPI Schemas: 3.54MB (6%) - already optimized (73% reduction)
- Runtime Overhead: 53.16MB (49%) - normal for Go applications
**Alert Threshold Updates:**
- operator-controller-memory-growth: 100kB/sec → 200kB/sec
- operator-controller-memory-usage: 100MB → 150MB
- catalogd-memory-growth: 100kB/sec → 200kB/sec
**Rationale:**
Baseline profiling showed 132.4kB/sec episodic growth during informer sync
and 107.9MB peak usage are normal. Previous thresholds caused false positive
alerts during normal e2e test execution.
**Verification:**
- Baseline test (old thresholds): 2 alerts triggered (false positives)
- Verification test (new thresholds): 0 alerts triggered ✅
- Memory patterns remain consistent (~55MB heap, 79-171MB RSS)
- Transient spikes don't trigger alerts due to "for: 5m" clause
**Recommendation:**
Accept 107.9MB as normal operational behavior for test/development
environments. Production deployments may need different thresholds based
on workload characteristics (number of resources, reconciliation frequency).
**Non-viable Optimizations:**
- Cannot replace unstructured with typed clients (breaks OLM flexibility)
- Cannot reduce runtime overhead (inherent to Go)
- JSON deserialization is unavoidable for dynamic resource handling
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Todd Short <tshort@redhat.com>
0 commit comments