Skip to content

Volume integral evaluation time grows linearly across long runs (~80x in 3000 calls) #171

@lmoresi

Description

@lmoresi

Symptom

In a long Rayleigh–Bénard run at Ra=1e7 (1/64 mesh, ~3100 timesteps), per-step solver costs (Stokes + advection–diffusion) stayed flat while the volume-integral diagnostic time (t_vol — used for Vrms, mean_T, Vmax) grew linearly with the number of integral evaluations.

Quartile of run mean t_vol first sample last sample
Q1 (steps 645–1425, 157 diag. calls) 44 s 5 s 84 s
Q2 (1430–2210) 149 s 84 s 418 s
Q3 (2215–2990) 267 s 194 s 833 s
Q4 (2995–3770) 348 s 296 s 576 s

Linear fit over the diagnostic samples: +0.13 s per simulation step (≈ +0.65 s per integral call). By step ~3700 the diagnostic step costs ~410 s — roughly 20× the actual solve (t_stokes + t_adv ≈ 21 s, also flat).

For comparison, t_stokes and t_adv over the same 3134 steps had slopes of +0.0001 s/step and +0.00009 s/step respectively (≈ 3% drift, well inside noise). So the growth is specific to the volume-integral path.

(Plot: t_vol vs step alongside t_stokes and t_adv, with the linear fit, can be attached.)

Reproduction

  • Script: standard convection workflow (docs/examples/workflows/convection.py) running _compute_diagnostics(...) — calls three Integral.evaluate() invocations per diagnostic step (Vrms, mean_T, Vmax).
  • The Integral objects themselves are cached (one object per integrand, reused — see convection_config.py:378–407), so this is not JIT recompilation: the JIT cache hits every call.

Suspect codepath

Integral.evaluate() in src/underworld3/cython/petsc_maths.pyx:

self.dm = self.mesh.dm  # .clone()                           # line 89
...
cdef DS ds = self.dm.getDS()                                  # line 105
ierr = PetscDSSetObjective(ds.ds, 0, ext.fns_residual[0])     # line 109
ierr = DMPlexComputeIntegralFEM(dm.dm, cgvec.vec, ...)        # line 110

Every evaluate() call:

  1. takes a reference to the shared mesh DS (no clone, no clear),
  2. registers a new objective callback via PetscDSSetObjective(ds.ds, 0, fn),
  3. computes the integral.

The DS is never cleared between calls. Hypothesis (the obvious one): the integration function pointers are being accumulated on the shared DS rather than replaced — every diagnostic call adds another callback that gets walked at integration time, producing the linear growth observed.

Supporting evidence inside the same file

The sister class CellWiseIntegral.evaluate() (lines 245–302) takes the opposite approach for the same operation:

dmc = self.mesh.dm.clone()    # line 287
dmc.setField(0, fec)
dmc.createDS()                # line 290
...
PetscDSSetObjective(...)      # on the fresh DS

It clones the DM and creates a fresh DS for every integration. The asymmetry suggests that the accumulation issue was at least implicitly known when CellWiseIntegral was written — Integral.evaluate() inherits the bug.

The commented-out # .clone() on line 89 also looks like a hint that someone tried cloning here and reverted.

Proposed fix (to investigate, not yet implemented)

Either:

  1. Clear the DS objective before re-setting: before line 109, call something equivalent to PetscDSSetObjective(ds.ds, 0, NULL) — or whatever PETSc exposes for resetting per-field objectives.
  2. Match CellWiseIntegral: clone the DM and createDS() per call (heavier, but guaranteed isolation).

Option (1) is preferred if PETSc supports it cleanly, since cloning the DM allocates per call.

Impact

  • The convection example becomes diagnostic-bound after a few thousand steps. For longer scientific runs (the 3134-step run above ran for ~24 wall-hours; ~85% of the wall-time was in t_vol after step 2200), this is a hard ceiling on what the workflow can do unattended.
  • Affects any user calling mesh.integrate(...) (or Integral.evaluate()) repeatedly across a long run — diagnostics, time-series, mass-conservation checks, convergence monitors.
  • Likely related to the broader DM lifecycle / object-accumulation pattern noted elsewhere in the codebase.

Discovered while

Validating the SLCN advection–diffusion fix and the Rayleigh–Bénard workflow tooling — see the long Ra=1e7 1/64 cubic baseline run in /Users/lmoresi/+Simulations/RayleighBenard/Ra1e7_1over64/.

Underworld development team with AI support from Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions