Add PostgreSQL observability telemetry exposure#1808
Add PostgreSQL observability telemetry exposure#1808DmytroPI-dev wants to merge 11 commits intofeature/database-controllersfrom
Conversation
a1b796f to
976ecd1
Compare
|
CLA Assistant Lite bot: I have read the CLA Document and I hereby sign the CLA 1 out of 2 committers have signed the CLA. |
d710f58 to
63b5937
Compare
988138d to
08dfa16
Compare
| ConnectionPoolerMetrics *FeatureDisableOverride `json:"connectionPoolerMetrics,omitempty"` | ||
| } | ||
|
|
||
| type FeatureDisableOverride struct { |
There was a problem hiding this comment.
could we allign to what we already have https://github.com/splunk/splunk-operator/pull/1808/changes#diff-e68b5e4731e03120f6625b75b8562e7845606ddaeae77144b79fc4368171e848L101?
| @@ -0,0 +1,321 @@ | |||
| # PostgreSQL Monitoring E2E on KIND | |||
There was a problem hiding this comment.
this file probably shouldn't be here, it sound like a internal testing approach, maybe move it to confluence?
| @@ -0,0 +1,434 @@ | |||
| # PostgreSQL Monitoring E2E with OTel Collector | |||
| -f test/postgresql/monitoring/prometheus-via-otel-values.yaml | ||
| ``` | ||
|
|
||
| This is important: Prometheus should scrape the OTel Collector exporter, not the PostgreSQL and PgBouncer pods directly. Otherwise Grafana will bypass OTel or you will get duplicate series. |
There was a problem hiding this comment.
What do you mean by this statement? We dont have Prometheus in this setup?
| - `cnpg_pgbouncer_last_collection_error` | ||
| - `cnpg_pgbouncer_pools_cl_active` | ||
|
|
||
| ## 7. Verify Prometheus is scraping OTel |
There was a problem hiding this comment.
prometheus shouldnt be involved at all here otel collector should send it directly to grafana: https://grafana.com/docs/opentelemetry/#ingest-otlp-data
|
|
||
| func (r *PostgresClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { | ||
| rc := &clustercore.ReconcileContext{Client: r.Client, Scheme: r.Scheme, Recorder: r.Recorder, Metrics: r.Metrics} | ||
| metrics := r.Metrics |
There was a problem hiding this comment.
what problem do we solve here? Our reconciler should expect proper metrics being initiated, noop recorder is only for testing
| } | ||
| } | ||
|
|
||
| recreateClusterClass := func(modify func(*enterprisev4.PostgresClusterClass)) { |
There was a problem hiding this comment.
It is not readable, can we make it straightforward? What problem we try to solve here? Why do we need to recreate class? especially in real live class cannot be recreated if it is already used. Maybe instead of this lets create few classes that we use in different scenarios and we change the cluster config itself
| return ctrl.Result{}, err | ||
| } | ||
| rc := &dbcore.ReconcileContext{Client: r.Client, Scheme: r.Scheme, Recorder: r.Recorder, Metrics: r.Metrics} | ||
| metrics := r.Metrics |
There was a problem hiding this comment.
As previously I dont think we should have it like that. metrics should be initiaited in reconciler
| rc.emitPoolerReadyTransition(postgresCluster, oldConditions) | ||
| } | ||
|
|
||
| oldConditions := make([]metav1.Condition, len(postgresCluster.Status.Conditions)) |
There was a problem hiding this comment.
maybe instead of duplicating making a oldConditions copy here and in line 337 we can have it before our switch and copy just once?
| normalized.PgHBA = spec.PostgresConfiguration.PgHBA | ||
| } | ||
| if spec.InheritedMetadata != nil && len(spec.InheritedMetadata.Annotations) > 0 { | ||
| normalized.InheritedAnnotations = spec.InheritedMetadata.Annotations |
There was a problem hiding this comment.
I probably read it wrong, but why do we set normalized.InheritedAnnotations if we already have it injected into spec?
| oldConditions := make([]metav1.Condition, len(postgresCluster.Status.Conditions)) | ||
| copy(oldConditions, postgresCluster.Status.Conditions) | ||
|
|
||
| if err := reconcilePostgreSQLMetricsService(ctx, c, rc.Scheme, postgresCluster, postgresMetricsEnabled); err != nil { |
There was a problem hiding this comment.
that could potentially be packed into unit of work style of code block, as It's very repeatable across the logic statements, It could get packed into and executable interface which handles the componentMetrics or something in this line of thought.
The code would then get separated into testable blocks and orchestrated cleanly.
| return false | ||
| } | ||
| if cluster == nil || cluster.Spec.Monitoring == nil || cluster.Spec.Monitoring.PostgreSQLMetrics == nil { | ||
| return true |
There was a problem hiding this comment.
what does that mean? areMetrics enabled return true on nil value's? Could you please explain?
| return ctrl.Result{}, errors.Join(err, statusErr) | ||
| } | ||
|
|
||
| postgresMetricsEnabled := isPostgreSQLMetricsEnabled(postgresCluster, clusterClass) |
There was a problem hiding this comment.
Just an idea, could that be merged into a general port like "Are the component displaying metrics == enabled"? Because there is always a possibility to expand then without adding isComponent1MetricsEnabled, isComponentNMetricsEnabled. Just the method, getComponentMetricsSettings(...) return map[string(component)]bool. One config poll for any future coming.
| } | ||
|
|
||
| func normalizeCNPGClusterSpec(spec cnpgv1.ClusterSpec, customDefinedParameters map[string]string) normalizedCNPGClusterSpec { | ||
| normalized := normalizedCNPGClusterSpec{ |
There was a problem hiding this comment.
we could potentially map It via json contract. Unless we have tags busy in our specs, which we probably have. If not, It would be mapped straight into cnpg spec.
Btw. wha do ingeritedAnnotations mean tech wise?
There was a problem hiding this comment.
inherited annotations set a anottations on the k8s pod. Thanks do this we have a way to discover every pod with those annotations and this is what otel collector use to find pod endpoints to scrape from
| assert.Equal(t, postgresMetricsPortString, cluster.Spec.InheritedMetadata.Annotations[prometheusPortAnnotation]) | ||
| } | ||
|
|
||
| func TestClusterSecretExists(t *testing.T) { |
There was a problem hiding this comment.
nit: Isn't It more readable to split the units into chunks?
a naming idea, just for reference. TestClusterSecret_with_n_expected_rwro_poolers_exist
Description
Adds PostgreSQL observability telemetry for
PostgresClusterusing Prometheus pod-annotation-based scraping. Metrics are exposed by CNPG's built-in exporters on PostgreSQL pods (port9187) and PgBouncer pooler pods (port9127). The operator controls whether annotations are injected via class- and cluster-level configuration, with no dedicated metricsServiceorServiceMonitorrequired for PostgreSQL or PgBouncer scraping.A
ServiceMonitoris still supported for operator-controller metrics as an optional step.Key Changes
api/v4/postgresclusterclass_types.goAdded class-level observability configuration (
monitoring.postgresqlMetrics.enabled,monitoring.connectionPoolerMetrics.enabled) that controls whether scrape annotations are injected into CNPG pods.api/v4/postgrescluster_types.goAdded cluster-level disable-only overrides (
spec.monitoring.postgresqlMetrics.disabled,spec.monitoring.connectionPoolerMetrics.disabled) allowing per-cluster opt-out without changing the class.pkg/postgresql/cluster/core/cluster.goWired observability flag resolution into
PostgresClusterreconciliation. When enabled, setsInheritedMetadata.Annotationson the CNPGCluster(for PostgreSQL pods) andTemplate.ObjectMeta.Annotationson CNPGPoolerresources (for PgBouncer pods).pkg/postgresql/cluster/core/monitoring.goAdded
isPostgreSQLMetricsEnabled/isConnectionPoolerMetricsEnabledflag resolution helpers.Added
buildPostgresScrapeAnnotations/buildPoolerScrapeAnnotationsannotation builders.Added
removeScrapeAnnotationsfor the disable path.pkg/postgresql/cluster/core/monitoring_unit_test.goAdded unit tests for flag resolution, scrape annotation builders, and annotation removal.
internal/controller/postgrescluster_controller_test.goAdded integration tests verifying that
InheritedMetadataannotations are set on the CNPGClusterwhen monitoring is enabled and removed when disabled by cluster override.docs/PostgreSQLObservabilityDashboard.jsonReference Grafana dashboard covering PostgreSQL target count, RW/RO PgBouncer availability, WAL activity, database sizes, PgBouncer client load, controller reconcile metrics, and domain fleet metrics.
docs/postgresSQLMonitoring-e2e.mdEnd-to-end validation guide for the annotation-based scraping flow on KIND.
Testing and Verification
Added unit tests in
pkg/postgresql/cluster/core/monitoring_unit_test.gofor:9187) and PgBouncer (port9127)Added integration tests in
internal/controller/postgrescluster_controller_test.goverifying:InheritedMetadata.Annotationspresence when monitoring is enabledRelated Issues
CPI-1853 — related JIRA ticket.
Grafana screenshot:
PR Checklist