Skip to content

fix(opensearch): make shadow write log level configurable (#35302)#35375

Open
fabrizzio-dotCMS wants to merge 1 commit intomainfrom
fix/issue-35302-shadow-write-log-level
Open

fix(opensearch): make shadow write log level configurable (#35302)#35375
fabrizzio-dotCMS wants to merge 1 commit intomainfrom
fix/issue-35302-shadow-write-log-level

Conversation

@fabrizzio-dotCMS
Copy link
Copy Markdown
Member

Summary

  • Introduces DOTCMS_SHADOW_WRITE_LOG_LEVEL config property (default WARN) so operators can tune the visibility of OS shadow write failures in dual-write phases without code changes
  • Migrates all 6 fire-and-forget log sites (PhaseRouter ×3, BulkProcessorListener ×2, ContentletIndexAPIImpl ×1) to the new centralised IndexConfigHelper.logShadowWriteFailure() helper
  • Adds PhaseRouterTest with 12 unit tests covering the mismatched-index-name scenario (ES has working_T0, OS has working_T1) across all four migration phases

Usage

# dotmarketing-config.properties (or -D flag)
DOTCMS_SHADOW_WRITE_LOG_LEVEL=ERROR   # surface shadow failures in alerts during QA
DOTCMS_SHADOW_WRITE_LOG_LEVEL=DEBUG   # suppress during steady-state migration
# default: WARN

Test plan

  • ./mvnw test -pl :dotcms-core -Dtest=PhaseRouterTest — 12/12 pass
  • ./mvnw compile -pl :dotcms-core -DskipTests — compiles clean
  • Manual: set DOTCMS_SHADOW_WRITE_LOG_LEVEL=ERROR, trigger a Phase 1 shadow write failure, confirm ERROR appears in logs
  • Manual: set DOTCMS_SHADOW_WRITE_LOG_LEVEL=DEBUG, confirm shadow failures are silent at WARN level

Closes #35302

🤖 Generated with Claude Code

…e-and-forget tests (#35302)

Introduces DOTCMS_SHADOW_WRITE_LOG_LEVEL (default WARN) so operators can raise
shadow OS write failures to ERROR during QA or lower them to DEBUG during
steady-state migration, without code changes. Adds PhaseRouterTest covering
all four migration phases with the mismatched-index-name scenario.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 17, 2026

Claude finished @fabrizzio-dotCMS's task in 1m 57s —— View job


PR Review

  • Read changed files
  • Analyze for issues

1. INFO silently drops the throwable — IndexConfigHelper.java:62

case "INFO":  Logger.info(clazz,  message);    break;   // <-- t is dropped
case "ERROR": Logger.error(clazz, message, t); break;   // t included
default:      Logger.warn(clazz,  message, t); break;   // t included

Logger.info has no (Class, String, Throwable) overload, so the Throwable is silently discarded at INFO level. The Javadoc for this method documents @param t the throwable, or null if none, implying it should be surfaced at every level. An operator who sets DOTCMS_SHADOW_WRITE_LOG_LEVEL=INFO expecting to see stack traces will silently lose them with no indication why. The least-invasive fix is to call Logger.warnAndDebug instead of Logger.info, or add a fallback: Logger.info(clazz, message + " | cause: " + (t != null ? t : "")). Either way, the asymmetry should be documented explicitly. Fix this →


2. TOCTOU: phase is read 3-4 times per routing call — PhaseRouter.java

writeChecked calls writeProviders() (which calls isMigrationNotStarted() + isMigrationComplete()) and then readProvider() — that's 3 separate Config.getIntProperty reads within a single operation. If the phase is changed concurrently (e.g., during a live migration step-up from Phase 1 → Phase 2), one call could decide "dual-write" from writeProviders() but then select the wrong primary from readProvider(), fanning out to the wrong provider set. Same issue in writeBoolean and writeReturningChecked. Snapshotting the phase at the top of each method (final MigrationPhase phase = MigrationPhase.current()) would eliminate the inconsistency window. Fix this →


3. No warning for invalid config values — IndexConfigHelper.java:60-65

default: Logger.warn(clazz, message, t); break;

A typo like DOTCMS_SHADOW_WRITE_LOG_LEVEL=WRAN silently falls to WARN with no operator feedback. Consider logging once (e.g., using Logger.warnEvery) that the configured value is unrecognised.


4. Test coverage gap — PhaseRouterTest.java

writeBoolean tests cover Phase 1 (OS shadow throws → ES result returned) and Phase 2 (OS primary throws → propagates), but the symmetric Phase 2 case — OS succeeds as primary, ES shadow throws → result returned and exception swallowed — is missing. Same gap for Phase 0 (only ES called) and Phase 3 (only OS called) with writeBoolean. Given that writeBoolean has the only hand-rolled dual-write loop (as opposed to delegating to writeChecked), it would be worth covering.


Minor notes

  • isPhase2() (PhaseRouter.java:154) reads the config twice: isReadEnabled() && !isMigrationComplete(). Trivial but inconsistent with the snapshot-per-call fix above.
  • The write() method's catch (Exception e) { throw new RuntimeException(e); } path is correctly marked unreachable, but a // unreachable comment alone won't satisfy static analysis tools that warn about swallowed checked exceptions. Consider throw new AssertionError("unreachable", e) to make intent clearer.
  • IndexConfigHelper is an interface with only static methods and one constant — effectively a utility class. This prevents it from being final, which is a minor design smell, though harmless here since static interface methods can't be overridden.

The core logic is correct: shadow failures are properly fire-and-forget, primary failures propagate, and the phase-routing table is accurate. Issues 1 and 2 are the ones worth fixing before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[BUG] OS exceptions propagate to client in Phase 1 dual-write (should be fire-and-forget)

1 participant