Skip to content

Cleanup Gradle build daemon after Launcher tests#11499

Open
daniel-mohedano wants to merge 5 commits into
masterfrom
daniel.mohedano/gradle-launcher-temp-cleanup
Open

Cleanup Gradle build daemon after Launcher tests#11499
daniel-mohedano wants to merge 5 commits into
masterfrom
daniel.mohedano/gradle-launcher-temp-cleanup

Conversation

@daniel-mohedano
Copy link
Copy Markdown
Contributor

@daniel-mohedano daniel-mohedano commented May 29, 2026

What Does This Do

Makes the Gradle smoke tests robust against intermittent CI failures, with three related changes:

  1. GradleLauncherSmokeTest teardown flake: adds an @AfterEach hook that runs ./gradlew --stop to stop the Gradle build daemon before JUnit deletes the static @TempDir gradleUserHome.
  2. GradleDaemonSmokeTest teardown flake: adds an @AfterAll hook that calls DefaultGradleConnector.close() to stop the TestKit daemons before JUnit deletes the static @TempDir testKitFolder.
  3. succeed-gradle-plugin-test OutOfMemoryError: bumps the shared smoke-test build/daemon heap from -Xmx256m to -Xmx512m in CiVisibilitySmokeTest.

Motivation

GradleLauncherSmokeTest

GradleLauncherSmokeTest was flaky in CI. The test itself always passed, but teardown intermittently failed with:

org.junit.platform.commons.JUnitException: Failed to close extension context
Caused by: java.io.IOException: Failed to delete temp directory /tmp/junit-...
  The following paths could not be deleted: <root>, daemon
    Suppressed: DirectoryNotEmptyException: /tmp/junit-.../daemon

Even though the launcher runs with --no-daemon, Gradle still spawns a single-use build daemon that writes into $GRADLE_USER_HOME/daemon/<version>/. When that process hasn't fully released its file handles by the time JUnit's static @TempDir cleanup runs, the recursive delete races against it and fails with DirectoryNotEmptyException, surfacing as a class-level failure.

Explicitly stopping the daemon in @AfterEach (which runs before the static @TempDir cleanup) makes the deletion deterministic.

GradleDaemonSmokeTest

The TestKit-based GradleDaemonSmokeTest suffers from the same class of teardown failure:

org.junit.platform.commons.JUnitException: Failed to close extension context
Caused by: java.io.IOException: Failed to delete temp directory /tmp/junit-...
  paths could not be deleted: <root>, caches, caches/8.9, caches/8.9/groovy-dsl
    Suppressed: DirectoryNotEmptyException: /tmp/junit-.../caches/8.9/groovy-dsl

TestKit runs every build in a daemon rooted under <testKitFolder>/test-kit-daemon with a 120s idle timeout. At class teardown a daemon is often still alive and still holding file handles on its caches/<version> directory, so JUnit's recursive delete of the static @TempDir testKitFolder fails.

OutOfMemoryError in succeed-gradle-plugin-test

The test-succeed-gradle-plugin-test project applies java-gradle-plugin, which puts gradleApi() on its compile classpath. This could overflow the 256m daemon heap, making the build die with:

java.lang.OutOfMemoryError: Java heap space
> Task :compileJava FAILED
  > Could not resolve all files for configuration ':compileClasspath'.
    > Failed to create Jar file .../generated-gradle-jars/gradle-api-9.5.1.jar

Additional Notes

test-environment-trigger: skip

Contributor Checklist

  • Format the title according to the contribution guidelines
  • Assign the type: and (comp: or inst:) labels in addition to any other useful labels
  • Avoid using close, fix, or any linking keywords when referencing an issue
    Use solves instead, and assign the PR milestone to the issue
  • Update the CODEOWNERS file on source file addition, migration, or deletion
  • Update public documentation with any new configuration flags or behaviors
  • Add your completed PR to the merge queue by commenting /merge. You can also:
    • Customize the commit message associated with the merge with /merge --commit-message "..."
    • Remove your PR from the merge queue with /merge -c
    • Skip all merge queue checks with /merge -f --reason "reason"; please use this judiciously, as some checks do not run at the PR-level
    • Get more information in this doc

Jira ticket: [PROJ-IDENT]

@daniel-mohedano daniel-mohedano added comp: ci visibility Continuous Integration Visibility type: bug Bug report and fix tag: no release notes Changes to exclude from release notes labels May 29, 2026
@daniel-mohedano daniel-mohedano marked this pull request as ready for review May 29, 2026 10:53
@daniel-mohedano daniel-mohedano requested a review from a team as a code owner May 29, 2026 10:53
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 136bb6ee87

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

env.put("GRADLE_USER_HOME", gradleUserHome.toString());
env.put("GRADLE_OPTS", "");
ShellCommandExecutor shellCommandExecutor =
new ShellCommandExecutor(projectFolder.toFile(), GRADLE_STOP_TIMEOUT_MILLIS, env);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use an existing project directory for daemon shutdown

Because projectFolder is an instance @TempDir from AbstractGradleTest, JUnit scopes it to each parameterized invocation and cleans it up when that invocation finishes; by the time this @AfterAll runs, the last projectFolder has already been deleted. In that state ShellCommandExecutor starts ./gradlew --stop with a non-existent working directory, the exception is swallowed by the best-effort catch, and the shared static gradleUserHome is still left for JUnit to delete while Gradle daemons may hold files open.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in c36724f

Copy link
Copy Markdown
Contributor

@bric3 bric3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that's enough, but just in case, when writing tests for our gradle plugins I had to handle this rather harshly, because stopping wasn't enough or done in a timely enough maneer for the CI. This is what I did in the end.

private val testKitDir: File get() = sharedTestKitDir
companion object {
// JVM-wide testkit dir shared across all GradleFixture instances. One daemon
// pool serves every test method, so kotlinc work on .gradle.kts scripts is
// amortized instead of re-paid per test (recovers the +77 % wall-time
// regression introduced by the Groovy→Kotlin DSL conversion).
//
// Lives outside any @TempDir so JUnit cleanup never races with daemon file
// locks. See https://github.com/gradle/gradle/issues/12535
//
// TestKit may reuse the same daemon for builds with different withEnvironment()
// values, so build logic must not cache environment-derived state in daemon-static
// fields.
private val sharedTestKitDir: File by lazy {
Files.createTempDirectory("gradle-testkit-").toFile().also { dir ->
Runtime.getRuntime().addShutdownHook(Thread {
stopDaemonsIn(dir)
dir.deleteRecursively()
})
}
}
/**
* Kills Gradle daemons started by TestKit under the given testkit dir.
*
* The Gradle Tooling API (used by [GradleRunner]) always spawns a daemon and
* provides no public API to stop it (https://github.com/gradle/gradle/issues/12535).
* We replicate the strategy Gradle uses in its own integration tests
* ([DaemonLogsAnalyzer.killAll()][1]):
*
* 1. Scan `<testkit>/daemon/<version>/` for log files matching
* `DaemonLogConstants.DAEMON_LOG_PREFIX + pid + DaemonLogConstants.DAEMON_LOG_SUFFIX`,
* i.e. `daemon-<pid>.out.log`.
* 2. Extract the PID from the filename and kill the process.
*
* Trade-offs of the PID-from-filename approach:
* - **PID recycling**: between the build finishing and `kill` being sent, the OS
* could theoretically recycle the PID. Now that this only runs at JVM exit
* (no longer per-test), the window is short — when called from the shutdown
* hook all daemons we own are still alive — so the risk remains negligible.
* - **Filename convention is internal**: Gradle's `DaemonLogConstants.DAEMON_LOG_PREFIX`
* (`"daemon-"`) / `DAEMON_LOG_SUFFIX` (`".out.log"`) are not public API; a future
* Gradle version could change them. The `toLongOrNull()` guard safely skips entries
* that don't parse as a PID (including the UUID fallback Gradle uses when the PID
* is unavailable).
* - **Java 8 compatible**: uses `kill`/`taskkill` via [ProcessBuilder] instead of
* `ProcessHandle` (Java 9+) because build logic targets JVM 1.8.
*
* [1]: https://github.com/gradle/gradle/blob/43b381d88/testing/internal-distribution-testing/src/main/groovy/org/gradle/integtests/fixtures/daemon/DaemonLogsAnalyzer.groovy
*/
private fun stopDaemonsIn(testKitDir: File) {
val daemonDir = File(testKitDir, "daemon")
if (!daemonDir.exists()) return
daemonDir.walkTopDown()
.filter { it.isFile && it.name.endsWith(".out.log") && !it.name.startsWith("hs_err") }
.forEach { logFile ->
val pid = logFile.nameWithoutExtension // daemon-12345.out
.removeSuffix(".out") // daemon-12345
.removePrefix("daemon-") // 12345
.toLongOrNull() ?: return@forEach // skip UUIDs / unparseable names
val isWindows = System.getProperty("os.name").lowercase().contains("win")
val killProcess = if (isWindows) {
ProcessBuilder("taskkill", "/F", "/PID", pid.toString())
} else {
ProcessBuilder("kill", pid.toString())
}
try {
val process = killProcess.redirectErrorStream(true).start()
process.waitFor(5, java.util.concurrent.TimeUnit.SECONDS)
} catch (_: Exception) {
// best effort — daemon may already be stopped
}
}
}
}

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 29, 2026

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite Status
Startup 🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results
Scenario Candidate master Δ (95% CI of mean)
startup:insecure-bank:iast:Agent 13.94 s 14.07 s [-1.9%; +0.2%] (no difference)
startup:insecure-bank:tracing:Agent 12.92 s 13.04 s [-2.3%; +0.5%] (no difference)
startup:petclinic:appsec:Agent 16.53 s 16.37 s [-0.0%; +2.0%] (no difference)
startup:petclinic:iast:Agent 16.57 s 16.69 s [-1.6%; +0.1%] (no difference)
startup:petclinic:profiling:Agent 16.45 s 16.59 s [-1.8%; +0.2%] (no difference)
startup:petclinic:tracing:Agent 15.85 s 15.11 s [-3.8%; +13.5%] (unstable)

Commit: 0d364248 · CI Pipeline · Benchmarking Platform UI


Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

Copy link
Copy Markdown
Contributor

@AlexeyKuznetsov-DD AlexeyKuznetsov-DD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Let's test on CI with [NON_DEFAULT_JVMS=true].
  2. Change PR description AfterAll -> AfrerEach.

@datadog-prod-us1-3
Copy link
Copy Markdown

datadog-prod-us1-3 Bot commented May 29, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 10 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-java | test_smoke: [ibm8, 8/8]   View in Datadog   GitLab

🔧 Fix in code (Fix with Cursor). 4 tests failed in GradleDaemonSmokeTest due to marked verification errors. See report at file:///go/src/github.com/DataDog/apm-reliability/dd-trace-java/workspace/dd-smoke-tests/gradle/build/reports/tests/test/index.html

DataDog/apm-reliability/dd-trace-java | check_smoke 1/4   View in Datadog   GitLab

🔄 Retry job. This looks flaky and may succeed on retry. Could not read workspace metadata from Gradle cache, causing build failure.

DataDog/apm-reliability/dd-trace-java | check_smoke 2/4   View in Datadog   GitLab

🔄 Retry job. This looks flaky and may succeed on retry. Could not read workspace metadata from /go/src/github.com/DataDog/apm-reliability/dd-trace-java/.gradle/caches/8.14.5/groovy-dsl/001ccff1ba101051da4f247fcc9dc23c/metadata.bin due to buffer underflow.

View all 10 failed jobs.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 0d36424 | Docs | Datadog PR Page | Give us feedback!

@daniel-mohedano daniel-mohedano added the tag: ai generated Largely based on code generated by an AI or LLM label May 29, 2026
protected List<String> buildJvmArguments(
String mockBackendIntakeUrl, String serviceName, Map<String, String> additionalArgs) {
List<String> arguments = new ArrayList<>(Arrays.asList("-Xms256m", "-Xmx256m"));
List<String> arguments = new ArrayList<>(Arrays.asList("-Xms512m", "-Xmx512m"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: probably we may try to give a chance to use lower limits: "-Xms256m", "-Xmx512m" WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, tackled in 0d36424

[ci: NON_DEFAULT_JVMS]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: ci visibility Continuous Integration Visibility tag: ai generated Largely based on code generated by an AI or LLM tag: no release notes Changes to exclude from release notes type: bug Bug report and fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants