Skip to content

Add PostgreSQL observability telemetry exposure#1808

Open
DmytroPI-dev wants to merge 11 commits intofeature/database-controllersfrom
postgres-operator-monitoring
Open

Add PostgreSQL observability telemetry exposure#1808
DmytroPI-dev wants to merge 11 commits intofeature/database-controllersfrom
postgres-operator-monitoring

Conversation

@DmytroPI-dev
Copy link
Copy Markdown

@DmytroPI-dev DmytroPI-dev commented Apr 1, 2026

Description

Adds PostgreSQL observability telemetry for PostgresCluster using Prometheus pod-annotation-based scraping. Metrics are exposed by CNPG's built-in exporters on PostgreSQL pods (port 9187) and PgBouncer pooler pods (port 9127). The operator controls whether annotations are injected via class- and cluster-level configuration, with no dedicated metrics Service or ServiceMonitor required for PostgreSQL or PgBouncer scraping.

A ServiceMonitor is still supported for operator-controller metrics as an optional step.

Key Changes

api/v4/postgresclusterclass_types.go
Added class-level observability configuration (monitoring.postgresqlMetrics.enabled, monitoring.connectionPoolerMetrics.enabled) that controls whether scrape annotations are injected into CNPG pods.

api/v4/postgrescluster_types.go
Added cluster-level disable-only overrides (spec.monitoring.postgresqlMetrics.disabled, spec.monitoring.connectionPoolerMetrics.disabled) allowing per-cluster opt-out without changing the class.

pkg/postgresql/cluster/core/cluster.go
Wired observability flag resolution into PostgresCluster reconciliation. When enabled, sets InheritedMetadata.Annotations on the CNPG Cluster (for PostgreSQL pods) and Template.ObjectMeta.Annotations on CNPG Pooler resources (for PgBouncer pods).

pkg/postgresql/cluster/core/monitoring.go
Added isPostgreSQLMetricsEnabled / isConnectionPoolerMetricsEnabled flag resolution helpers.
Added buildPostgresScrapeAnnotations / buildPoolerScrapeAnnotations annotation builders.
Added removeScrapeAnnotations for the disable path.

pkg/postgresql/cluster/core/monitoring_unit_test.go
Added unit tests for flag resolution, scrape annotation builders, and annotation removal.

internal/controller/postgrescluster_controller_test.go
Added integration tests verifying that InheritedMetadata annotations are set on the CNPG Cluster when monitoring is enabled and removed when disabled by cluster override.

docs/PostgreSQLObservabilityDashboard.json
Reference Grafana dashboard covering PostgreSQL target count, RW/RO PgBouncer availability, WAL activity, database sizes, PgBouncer client load, controller reconcile metrics, and domain fleet metrics.

docs/postgresSQLMonitoring-e2e.md
End-to-end validation guide for the annotation-based scraping flow on KIND.

Testing and Verification

Added unit tests in pkg/postgresql/cluster/core/monitoring_unit_test.go for:

  • class/cluster observability enablement logic
  • scrape annotation builders for PostgreSQL (port 9187) and PgBouncer (port 9127)
  • annotation removal on the disable path

Added integration tests in internal/controller/postgrescluster_controller_test.go verifying:

  • InheritedMetadata.Annotations presence when monitoring is enabled
  • annotation removal when disabled by cluster-level override

Related Issues

CPI-1853 — related JIRA ticket.

Grafana screenshot:

Screenshot 2026-04-15 at 14 33 08

PR Checklist

  • Code changes adhere to the project's coding standards.
  • Relevant unit and integration tests are included.
  • Documentation has been updated accordingly.
  • All tests pass locally.
  • The PR description follows the project's guidelines.

@DmytroPI-dev DmytroPI-dev force-pushed the postgres-operator-monitoring branch from a1b796f to 976ecd1 Compare April 2, 2026 14:08
@DmytroPI-dev DmytroPI-dev changed the title Create ServiceMonitor and basic Grafana dashboard for metrics Add PostgreSQL observability telemetry exposure via ServiceMonitors Apr 2, 2026
Comment thread docs/PostgreSQLObservabilityDashboard.md Outdated
Comment thread pkg/postgresql/cluster/core/cluster.go
Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
Comment thread docs/PostgreSQLObservabilityDashboard.md
Comment thread pkg/postgresql/cluster/core/monitoring.go Outdated
Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
Comment thread pkg/postgresql/cluster/core/cluster.go
@github-actions
Copy link
Copy Markdown
Contributor

CLA Assistant Lite bot:
Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contribution License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment with the exact sentence copied from below.


I have read the CLA Document and I hereby sign the CLA


1 out of 2 committers have signed the CLA.
@DmytroPI-dev
@limak9182
You can retrigger this bot by commenting recheck in this Pull Request

Comment thread api/v4/postgrescluster_types.go Outdated
Comment thread api/v4/postgrescluster_types.go Outdated
Comment thread api/v4/postgresclusterclass_types.go Outdated
Comment thread pkg/postgresql/cluster/core/monitoring.go Outdated
Comment thread pkg/postgresql/cluster/core/monitoring.go Outdated
Comment thread pkg/postgresql/cluster/core/monitoring.go Outdated
Comment thread pkg/postgresql/cluster/core/monitoring.go Outdated
@DmytroPI-dev DmytroPI-dev force-pushed the postgres-operator-monitoring branch from d710f58 to 63b5937 Compare April 13, 2026 09:54
@DmytroPI-dev DmytroPI-dev changed the title Add PostgreSQL observability telemetry exposure via ServiceMonitors Add PostgreSQL observability telemetry exposure Apr 15, 2026
@DmytroPI-dev DmytroPI-dev force-pushed the postgres-operator-monitoring branch from 988138d to 08dfa16 Compare April 15, 2026 12:59
@DmytroPI-dev DmytroPI-dev marked this pull request as ready for review April 15, 2026 18:52
ConnectionPoolerMetrics *FeatureDisableOverride `json:"connectionPoolerMetrics,omitempty"`
}

type FeatureDisableOverride struct {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,321 @@
# PostgreSQL Monitoring E2E on KIND
Copy link
Copy Markdown
Collaborator

@mploski mploski Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file probably shouldn't be here, it sound like a internal testing approach, maybe move it to confluence?

@@ -0,0 +1,434 @@
# PostgreSQL Monitoring E2E with OTel Collector
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

-f test/postgresql/monitoring/prometheus-via-otel-values.yaml
```

This is important: Prometheus should scrape the OTel Collector exporter, not the PostgreSQL and PgBouncer pods directly. Otherwise Grafana will bypass OTel or you will get duplicate series.
Copy link
Copy Markdown
Collaborator

@mploski mploski Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by this statement? We dont have Prometheus in this setup?

- `cnpg_pgbouncer_last_collection_error`
- `cnpg_pgbouncer_pools_cl_active`

## 7. Verify Prometheus is scraping OTel
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prometheus shouldnt be involved at all here otel collector should send it directly to grafana: https://grafana.com/docs/opentelemetry/#ingest-otlp-data


func (r *PostgresClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
rc := &clustercore.ReconcileContext{Client: r.Client, Scheme: r.Scheme, Recorder: r.Recorder, Metrics: r.Metrics}
metrics := r.Metrics
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what problem do we solve here? Our reconciler should expect proper metrics being initiated, noop recorder is only for testing

}
}

recreateClusterClass := func(modify func(*enterprisev4.PostgresClusterClass)) {
Copy link
Copy Markdown
Collaborator

@mploski mploski Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not readable, can we make it straightforward? What problem we try to solve here? Why do we need to recreate class? especially in real live class cannot be recreated if it is already used. Maybe instead of this lets create few classes that we use in different scenarios and we change the cluster config itself

return ctrl.Result{}, err
}
rc := &dbcore.ReconcileContext{Client: r.Client, Scheme: r.Scheme, Recorder: r.Recorder, Metrics: r.Metrics}
metrics := r.Metrics
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As previously I dont think we should have it like that. metrics should be initiaited in reconciler

rc.emitPoolerReadyTransition(postgresCluster, oldConditions)
}

oldConditions := make([]metav1.Condition, len(postgresCluster.Status.Conditions))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead of duplicating making a oldConditions copy here and in line 337 we can have it before our switch and copy just once?

normalized.PgHBA = spec.PostgresConfiguration.PgHBA
}
if spec.InheritedMetadata != nil && len(spec.InheritedMetadata.Annotations) > 0 {
normalized.InheritedAnnotations = spec.InheritedMetadata.Annotations
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably read it wrong, but why do we set normalized.InheritedAnnotations if we already have it injected into spec?

Comment thread pkg/postgresql/cluster/core/cluster.go Outdated
oldConditions := make([]metav1.Condition, len(postgresCluster.Status.Conditions))
copy(oldConditions, postgresCluster.Status.Conditions)

if err := reconcilePostgreSQLMetricsService(ctx, c, rc.Scheme, postgresCluster, postgresMetricsEnabled); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that could potentially be packed into unit of work style of code block, as It's very repeatable across the logic statements, It could get packed into and executable interface which handles the componentMetrics or something in this line of thought.
The code would then get separated into testable blocks and orchestrated cleanly.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

units, nice!

return false
}
if cluster == nil || cluster.Spec.Monitoring == nil || cluster.Spec.Monitoring.PostgreSQLMetrics == nil {
return true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does that mean? areMetrics enabled return true on nil value's? Could you please explain?

return ctrl.Result{}, errors.Join(err, statusErr)
}

postgresMetricsEnabled := isPostgreSQLMetricsEnabled(postgresCluster, clusterClass)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea, could that be merged into a general port like "Are the component displaying metrics == enabled"? Because there is always a possibility to expand then without adding isComponent1MetricsEnabled, isComponentNMetricsEnabled. Just the method, getComponentMetricsSettings(...) return map[string(component)]bool. One config poll for any future coming.

}

func normalizeCNPGClusterSpec(spec cnpgv1.ClusterSpec, customDefinedParameters map[string]string) normalizedCNPGClusterSpec {
normalized := normalizedCNPGClusterSpec{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could potentially map It via json contract. Unless we have tags busy in our specs, which we probably have. If not, It would be mapped straight into cnpg spec.
Btw. wha do ingeritedAnnotations mean tech wise?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inherited annotations set a anottations on the k8s pod. Thanks do this we have a way to discover every pod with those annotations and this is what otel collector use to find pod endpoints to scrape from

assert.Equal(t, postgresMetricsPortString, cluster.Spec.InheritedMetadata.Annotations[prometheusPortAnnotation])
}

func TestClusterSecretExists(t *testing.T) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Isn't It more readable to split the units into chunks?
a naming idea, just for reference. TestClusterSecret_with_n_expected_rwro_poolers_exist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants