-
Notifications
You must be signed in to change notification settings - Fork 112
add feature to configurate promhttp error handling #411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
add feature to configurate promhttp error handling #411
Conversation
Signed-off-by: Nikita Popov <nikita.popov@semrush.com>
b5a9fc6 to
9bfc60d
Compare
|
cc: @SuperQ @kgeckhart |
|
I'm not sure we should do this, if there are errors, we don't want to silently ignore them. They should be fixed. |
|
First of all, promlib has that option, and it looks convenient to have it here. And user is able to define how he wants to handle errors. Google API has many errors, which couldn't be fixed as simple, as their support is not an example of the fast or customer-faced approach, we have a lot of examples when they deployed some descriptors in BETA stage and exporter has started to fail. So this setting is a good way to save collecting metrics which can be collected. The problem we are trying to solve is similar to this Also this exporter has metrics to identify such errors, so users still noticed about them only if provided setting is As i mentioned, the default behaviour is still the same. P.S. Now we are using forked version with this setting and everything goes fine, we haven't lose any metrics we expected, the exporter returns it's own metrics in any case and we notified if any problems happened |
|
The problem is that the linked error is an actual problem. By ignoring errors like that you're not going to get the data you expect. This is a correctness problem that I don't think we want to paper over.
Are you sure about that? Because the linked issue is very likely to be a real problem. |
This is a very bad idea. Partial data is very difficult to track and debug. This is why the default behavior is the way it is. As well as when Prometheus polls targets it treats any non-200 status code or timeout as "no data". This violates "fail fast" principal and is not something we are likely to want to put in an official exporter without some serious discussion. |
Yes. In the Obviously that we can split
I completely agree with that. But as i said above, for us it's not okay to have "no data" when we can have "some non-broken data".
Usually those metrics are broken on Google side and they in some |
|
@SuperQ, I completely agree with your reasoning. However, in some configurations, teams add a metric that is ALPHA/BETA that knowingly could be problematic. And don't want to panic the whole stackdriver exporter for the sake of its behavior. Of course, that fail-silent (only with log message) should not be default, but an opt-in flag as this PR implements it. This PR will not change anything for the users that already use the exporter as is, just would enable the configuration for whoever wants to accept the risk. Issues such as this would not be raised. |
|
Sorry I'm a bit late on this one. @SuperQ I do agree with the folks in this PR that this exporter is a special case where it's safe to allow continue on error as there are many factors outside of the control of the exporter. We have the error logger hooked up as well so the errors won't go completely unnoticed. I also noticed it's been the default behavior in node_exporter for quite some time now. I didn't track back why in history but thought it was interesting. |
|
In the node_exporter we expose individual collector scrape success status metrics. The node_exporter also existed from a very early time where we hadn't discovered all the pitfalls of soft failures and developed policies around them. So, having partial response in the node_exporter is more of a historical artifact rather than a specific design choice. |
|
I think having a feature flag here is probably fine. As long as we strongly word the documentation about the impact it can have. Also, I don't remember if this exporter has existing meta-metrics for individual feature failures or not. |
👍 @nnikitos95 I know it's been awhile but is this something you would be willing to do while resolving the conflicts?
We do expose scrape errors vs failing the whole scrape when there's an issue collecting from GCP, https://github.com/prometheus-community/stackdriver_exporter/blob/master/collectors/monitoring_collector.go#L235-L249, which seems like it fits the definition? |
Motivation
The current implementation uses PromHTTP's default error handling, which may not suit all deployment scenarios. By allowing customization, users can:
ContinueOnError) in high-availability environments.Default Behavior