add feature to configurate promhttp error handling #411

nnikitos95 · 2025-02-10T12:41:21Z

Motivation

The current implementation uses PromHTTP's default error handling, which may not suit all deployment scenarios. By allowing customization, users can:

Opt for more lenient error handling (e.g., ContinueOnError) in high-availability environments.
Ensure better alignment with their operational requirements.

Default Behavior

Maintains backward compatibility by retaining the current default error handling behavior unless explicitly configured.

Signed-off-by: Nikita Popov <nikita.popov@semrush.com>

nnikitos95 · 2025-02-11T09:47:21Z

cc: @SuperQ @kgeckhart

SuperQ · 2025-03-10T12:56:38Z

I'm not sure we should do this, if there are errors, we don't want to silently ignore them. They should be fixed.

nnikitos95 · 2025-03-10T13:46:54Z

First of all, promlib has that option, and it looks convenient to have it here. And user is able to define how he wants to handle errors.

Google API has many errors, which couldn't be fixed as simple, as their support is not an example of the fast or customer-faced approach, we have a lot of examples when they deployed some descriptors in BETA stage and exporter has started to fail. So this setting is a good way to save collecting metrics which can be collected.

The problem we are trying to solve is similar to this

Also this exporter has metrics to identify such errors, so users still noticed about them only if provided setting is ContinueOnError

As i mentioned, the default behaviour is still the same.

P.S. Now we are using forked version with this setting and everything goes fine, we haven't lose any metrics we expected, the exporter returns it's own metrics in any case and we notified if any problems happened

SuperQ · 2025-03-10T13:50:11Z

The problem is that the linked error is an actual problem. By ignoring errors like that you're not going to get the data you expect.

This is a correctness problem that I don't think we want to paper over.

we haven't lose any metrics we expected

Are you sure about that? Because the linked issue is very likely to be a real problem.

SuperQ · 2025-03-10T13:52:29Z

So this setting is a good way to save collecting metrics which can be collected.

This is a very bad idea. Partial data is very difficult to track and debug. This is why the default behavior is the way it is. As well as when Prometheus polls targets it treats any non-200 status code or timeout as "no data".

This violates "fail fast" principal and is not something we are likely to want to put in an official exporter without some serious discussion.

nnikitos95 · 2025-03-10T16:48:58Z

Are you sure about that? Because the linked issue is very likely to be a real problem.

Yes. In the fail-fast approach we are loosing all the metrics in any case, as /metrics returns 500 or they outdated as we haven't collected them in-time and haven't reacted within possible alerts rely on them.

Obviously that we can split type-prefixes of broken metrics to different instances of stackdriver, to not affect reliable prefixes. But it takes time in any case, even if we use fail-fast and can react in short time. But usually it happens in night time, so we can lose the data.

As well as when Prometheus polls targets it treats any non-200 status code or timeout as "no data".

I completely agree with that. But as i said above, for us it's not okay to have "no data" when we can have "some non-broken data".

By ignoring errors like that you're not going to get the data you expect.

Usually those metrics are broken on Google side and they in some ALPHA/BETA stages which can't be filtered so we can't do anything with it.

eenchev · 2025-09-11T19:12:37Z

@SuperQ, I completely agree with your reasoning. However, in some configurations, teams add a metric that is ALPHA/BETA that knowingly could be problematic. And don't want to panic the whole stackdriver exporter for the sake of its behavior. Of course, that fail-silent (only with log message) should not be default, but an opt-in flag as this PR implements it.

This PR will not change anything for the users that already use the exporter as is, just would enable the configuration for whoever wants to accept the risk. Issues such as this would not be raised.

kgeckhart · 2025-10-21T15:44:15Z

Sorry I'm a bit late on this one. @SuperQ I do agree with the folks in this PR that this exporter is a special case where it's safe to allow continue on error as there are many factors outside of the control of the exporter. We have the error logger hooked up as well so the errors won't go completely unnoticed.

I also noticed it's been the default behavior in node_exporter for quite some time now. I didn't track back why in history but thought it was interesting.

SuperQ · 2025-10-21T16:17:15Z

In the node_exporter we expose individual collector scrape success status metrics.

The node_exporter also existed from a very early time where we hadn't discovered all the pitfalls of soft failures and developed policies around them. So, having partial response in the node_exporter is more of a historical artifact rather than a specific design choice.

SuperQ · 2025-10-21T16:19:06Z

I think having a feature flag here is probably fine. As long as we strongly word the documentation about the impact it can have.

Also, I don't remember if this exporter has existing meta-metrics for individual feature failures or not.

kgeckhart · 2025-10-31T18:57:39Z

I think having a feature flag here is probably fine. As long as we strongly word the documentation about the impact it can have.

👍 @nnikitos95 I know it's been awhile but is this something you would be willing to do while resolving the conflicts?

Also, I don't remember if this exporter has existing meta-metrics for individual feature failures or not.

We do expose scrape errors vs failing the whole scrape when there's an issue collecting from GCP, https://github.com/prometheus-community/stackdriver_exporter/blob/master/collectors/monitoring_collector.go#L235-L249, which seems like it fits the definition?

add feature to configurate promhttp error handling

9bfc60d

Signed-off-by: Nikita Popov <nikita.popov@semrush.com>

nnikitos95 force-pushed the feat/add_promhttp_error_handling_flag branch from b5a9fc6 to 9bfc60d Compare February 10, 2025 12:44

SuperQ requested a review from kgeckhart October 21, 2025 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add feature to configurate promhttp error handling #411

add feature to configurate promhttp error handling #411

Uh oh!

nnikitos95 commented Feb 10, 2025

Uh oh!

nnikitos95 commented Feb 11, 2025

Uh oh!

SuperQ commented Mar 10, 2025

Uh oh!

nnikitos95 commented Mar 10, 2025

Uh oh!

SuperQ commented Mar 10, 2025

Uh oh!

SuperQ commented Mar 10, 2025

Uh oh!

nnikitos95 commented Mar 10, 2025

Uh oh!

eenchev commented Sep 11, 2025

Uh oh!

kgeckhart commented Oct 21, 2025

Uh oh!

SuperQ commented Oct 21, 2025

Uh oh!

SuperQ commented Oct 21, 2025

Uh oh!

kgeckhart commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

add feature to configurate promhttp error handling #411

Are you sure you want to change the base?

add feature to configurate promhttp error handling #411

Uh oh!

Conversation

nnikitos95 commented Feb 10, 2025

Uh oh!

nnikitos95 commented Feb 11, 2025

Uh oh!

SuperQ commented Mar 10, 2025

Uh oh!

nnikitos95 commented Mar 10, 2025

Uh oh!

SuperQ commented Mar 10, 2025

Uh oh!

SuperQ commented Mar 10, 2025

Uh oh!

nnikitos95 commented Mar 10, 2025

Uh oh!

eenchev commented Sep 11, 2025

Uh oh!

kgeckhart commented Oct 21, 2025

Uh oh!

SuperQ commented Oct 21, 2025

Uh oh!

SuperQ commented Oct 21, 2025

Uh oh!

kgeckhart commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants