Skip to content

Core: Fix flaky TestHadoopCommits.testConcurrentFastAppends#16770

Open
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:fix-flaky-concurrent-fast-appends
Open

Core: Fix flaky TestHadoopCommits.testConcurrentFastAppends#16770
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:fix-flaky-concurrent-fast-appends

Conversation

@wombatu-kun

Copy link
Copy Markdown
Contributor

Summary

TestHadoopCommits.testConcurrentFastAppends is flaky and fails intermittently with org.awaitility.core.ConditionTimeoutException at the barrier wait. It most recently surfaced on an unrelated PR (#16664) in core-tests (17) while the same job passed on core-tests (21), the classic load-dependent flaky signature.

Root cause

The test runs 5 threads that each commit 10 files, using an AtomicInteger barrier to force all 5 threads to attempt newFastAppend().commit() simultaneously in lock-step rounds, maximizing optimistic-concurrency contention on the Hadoop file-based commit. It sets commit.retry.num-retries to threadsCount (5), only one above the default of 4. Under that forced contention, file-rename races are not fair, so a committer can lose more times than its retry budget allows, exhaust its retries, and throw CommitFailedException. The thread that throws never increments the barrier, so the surviving threads block on barrier >= round * threadsCount until the awaitility timeout fires. Because the barrier is permanently stalled, no amount of additional wait time helps.

This is a recurrence of #11047. The previous fix (#12714) only raised the awaitility timeout from 10s to 60s, which addressed the symptom rather than the cause, so the failure came back at the 60s boundary.

Fix

Raise commit.retry.num-retries from threadsCount to 20, matching the two sibling optimistic-commit concurrency tests TestJdbcTableConcurrency and TestHiveTableConcurrency, which both use 20 retries for 7 concurrent committers. With comfortable retry headroom no committer exhausts its budget under the forced contention, so the barrier always advances and the test no longer stalls. This does not weaken the test: it still verifies that all concurrent fast appends succeed and produce the expected number of snapshots.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the core label Jun 11, 2026

@huaxingao huaxingao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// lock-step contention below (matches Test{Jdbc,Hive}TableConcurrency).
COMMIT_NUM_RETRIES,
String.valueOf(threadsCount),
"20",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 20 ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because two sibling optimistic-commit concurrency tests TestJdbcTableConcurrency and TestHiveTableConcurrency both use 20 retries, and it looks like it's enough avoid flakiness this way.

@wombatu-kun wombatu-kun requested a review from singhpk234 June 12, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants